AI Engineering Skill Reference — Chapter 6: RAG and Agents
This reference distills the chapter into practical patterns, decision frameworks, code snippets, and checklists you can apply immediately.
RAG (Retrieval-Augmented Generation) — Essentials
RAG = retrieve relevant external context per query, then generate. Use it to:
- Reduce hallucinations
- Minimize tokens (cost/latency)
- Personalize with per-user data
- Keep models up-to-date
Typical architecture:
- Indexing: chunk → embed → store (vector DB + metadata store)
- Querying: rewrite → retrieve (hybrid) → rerank → assemble prompt → generate
- Optional: caches, guards, memory
Minimal RAG Pipeline (text)
- Indexing
- Chunk documents
- Generate embeddings
- Store vectors and metadata (title, tags, timestamps, ids, permissions)
- Build vector index (HNSW or IVF-PQ)
- Build keyword index (BM25/Elasticsearch)
- Querying
- Rewrite query (from chat history)
- Retrieve candidates (hybrid: BM25 + vector search)
- Rerank top-N with cross-encoder or LLM-Judge
- Build final prompt (instructions + user query + retrieved chunks)
- Generate answer
- Log for evaluation
Retrieval Algorithms — When to Use What
| Dimension | Term-based (BM25/Elasticsearch) | Embedding-based (Vector Search) |
|---|---|---|
| Strengths | Fast, cheap, proven, easy to operate | Semantic match, natural queries, improves with finetuning |
| Weaknesses | Lexical match only, ambiguity | Costly embeddings & vector infra, may miss exact codes |
| Best for | Exact terms, IDs, logs, error codes, legal cites | Ambiguous wording, multi-lingual, paraphrases |
| Cost/Latency | Low | Medium–High (embeddings + ANN search) |
| Tuning | Fewer knobs | Many knobs (model, index, rerankers) |
Best practice: Use hybrid search (term + embedding) with reciprocal rank fusion (RRF) and/or reranking.
Hybrid Search with RRF (Reciprocal Rank Fusion)
- Retrieve top-M from BM25 and top-M from vector search
- Combine by RRF: score(doc) += 1 / (k + rank), k≈60
- Take top-K fused results to rerank
Python example:
def rrf_rank(bm25_list, vec_list, k=60, top_k=10):
scores = {}
for rank, doc_id in enumerate(bm25_list, start=1):
scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank)
for rank, doc_id in enumerate(vec_list, start=1):
scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank)
return [doc for doc, _ in sorted(scores.items(), key=lambda x: -x[1])[:top_k]]
Chunking Strategy — Defaults and Variants
Why it matters: retrieval quality, token cost, latency, recall/precision.
Recommended defaults:
- Unit: tokens (model’s tokenizer)
- Chunk size: 500–1,500 tokens
- Overlap: 10–20% of chunk size (e.g., 50–150 tokens)
- Preserve structure boundaries (headings/sections/paragraphs)
- Attach metadata: title, section, doc_id, updated_at
Variations:
- Recursive split: sections → paragraphs → sentences (stop when <= chunk_size)
- Domain splitters: code-aware, Q&A pairs, legal clauses, Chinese sentence segmentation
- Contextual augmentation: prepend a concise “chunk explainer” (50–100 tokens)
Context explainer prompt:
{{WHOLE_DOCUMENT}}
Chunk:
{{CHUNK_CONTENT}}
Write a 50–100 token summary that situates this chunk within the whole document for improved retrieval. Return only the context.
Pitfalls:
- Too small chunks → lost context, high index/search cost
- Too large chunks → low recall, exceeds model/embedding context
- No overlap → boundary loss (“hot dog” → “hot” + “dog”)
Vector Search — Practical Selection
Common ANN indexes:
- HNSW: high recall, fast queries, larger index; great online retrieval
- IVF-PQ: scalable, compressed, lower memory; slight recall hit acceptable
- Annoy: simple, read-only, good for static catalogs
Quick guidance:
- <5M vectors: HNSW
- 5–200M: IVF-PQ or HNSW with careful memory planning
-
200M: IVF-PQ + sharding; consider hybrid indexing
Key metrics:
- Recall@K (target ≥0.9 for top-K)
- QPS (per shard and aggregate)
- Build time (embedding + index)
- Index size (RAM/SSD requirements)
Retrieval Optimization Tactics
- Reranking
- Cross-encoder rerank: best precision (e.g., bge-reranker, monoT5)
- Time-decay score: prioritize recent data (news/emails/changelogs)
- Positioning: important docs first/last (model primacy/recency)
- Query Rewriting
- Expand/clarify user intent from chat context
- Identity resolution and guard against unknowns
Prompt:
Given the conversation and the last user turn, rewrite the final user query to be fully self-contained and precise. If required information is missing, say "INSUFFICIENT CONTEXT: <what is needed>".
Conversation:
{{HISTORY}}
User:
{{LAST_UTTERANCE}}
Rewrite:
- Contextual Retrieval
- Enrich chunks with:
- Keywords, entities (error codes, product names)
- Titles, section headings, doc summaries
- Canonical Q&A phrasings (for FAQs/support)
- Store metadata in keyword index for hybrid search
RAG Evaluation — What to Measure and How
Retriever-level:
- Context Precision: % retrieved that are relevant
- Context Recall: % relevant retrieved (harder; needs exhaustive labels)
- Ranking Metrics: NDCG, MAP, MRR
System-level:
- Answer quality (task metrics or LLM judge)
- Hallucination rate
- Token cost and latency
Practical evaluation loop:
- Build test set: (query, doc corpus, gold relevant docs)
- Compute precision/recall (human or LLM judge)
- Ablate components: chunk sizes, number of chunks, k values, reranker on/off
- Track tokens, latency, cost per query
LLM judge prompt (pairwise doc relevance):
Query: {{Q}}
Document: {{DOC}}
Is the document sufficient to help answer the query? Respond with one of: "relevant", "partially relevant", "irrelevant". Provide a one-sentence rationale.
Cost and Latency Controls
- Reserve retrieval budget (e.g., 20–40% of total tokens)
- Limit top_k retrieved (e.g., 3–8, tune per model)
- Cache query embeddings and retrieval results
- Use reranking to prune aggressively; stream generation
- Embed incrementally on updates (changed chunks only)
- Monitor vector DB spend; compress with PQ where appropriate
Multimodal RAG (Images, Audio, Video)
Pattern:
- Index: multimodal embeddings (e.g., CLIP for image/text)
- Query: text → embedding
- Retrieve: images and text by similarity
- Prompt: include both retrieved captions and images (if model supports)
Example (CLIP-like):
# Pseudocode
image_index.add([img_embeds], metadata=[{"caption": "...", "id": ...}])
q_embed = clip.encode_text(query)
candidates = image_index.search(q_embed, top_k=20)
reranked = rerank_cross_encoder(query, candidates)
Tabular RAG (Text-to-SQL)
When queries span structured data:
- Steps: intent → schema selection → text-to-SQL → execute → explain
- Use a SQL execution tool with safe read/write gating
- For many tables: schema retriever first (semantic + metadata)
Text-to-SQL pipeline:
plan = classify_intent(query)
schemas = select_schemas(query, all_schemas) # vector + rules
sql = text2sql_model.generate(query, schemas)
result = sql_executor.run(sql)
answer = llm.generate(f"Question: {query}\nSQL:\n{sql}\nResult:\n{result}\nExplain the answer.")
Safety:
- Sandbox execution
- Read-only by default; writes gated by human approval
- Validate SQL against allowlist
Agents — Practical Architecture
An agent = model + tools + planner. It:
- Understands task (intent)
- Plans (decompose into steps/actions)
- Uses tools (function calling)
- Reflects (verify, correct)
- Acts (optionally write to environment)
- Uses memory (short- and long-term)
Tool Categories
- Knowledge: retrievers, web/search APIs, internal APIs
- Capability: calculator, code interpreter, unit/timezone converters, translators, OCR/captioning
- Write actions: email/send, DB writes, PR creation; strictly gated
Safety:
- Principle of least privilege
- Human-in-the-loop for risky actions
- Guardrails: input validation, policy checks, content filtering
- Code/Prompt injection defenses
Function Calling — Code Pattern
Declare tool inventory with schemas (name, description, parameters). Let the model decide when to call tools; route calls and post results back to the model.
Pseudocode:
tools = [
{
"name": "lbs_to_kg",
"description": "Convert pounds to kilograms.",
"parameters": {"type": "object", "properties": {"lbs": {"type":"number"}}, "required": ["lbs"]}
},
{
"name": "fetch_top_products",
"description": "Top N products by sales in [start_date, end_date].",
"parameters": {"type":"object","properties": {
"start_date":{"type":"string","format":"date"},
"end_date":{"type":"string","format":"date"},
"n":{"type":"integer","minimum":1,"maximum":100}
}, "required":["start_date","end_date","n"]}
},
]
resp = llm.chat(messages, tools=tools, tool_choice="auto")
if resp.tool_calls:
for call in resp.tool_calls:
# Validate parameters against schema
out = call_tool(call.name, call.arguments)
messages.append({"role":"tool","name":call.name,"content":json.dumps(out)})
resp2 = llm.chat(messages, tools=tools) # Continue with tool results
Tips:
- Always log tool name, params, and outputs
- Validate types/ranges; fill defaults
- Return structured outputs (JSON)
Planning — Decouple Plan from Execution
Why: avoid running bad plans; enable validation and parallelization.
Agent loop:
- Generate plan (natural language or tool sequence)
- Validate plan (rules + AI judge)
- Execute steps (sequential/parallel/conditional)
- Reflect on outcomes and update plan
- Stop when goal met or max steps reached
Plan schema (natural language, model-agnostic):
{
"goal": "Find price of last week's best-selling product",
"steps": [
{"action": "get_current_date"},
{"action": "retrieve_best_sellers", "args": {"window": "last_week", "top_k": 1}},
{"action": "get_product_info", "args": {"product_name": "$.steps[1].output[0].name"}},
{"action": "answer", "args": {"style": "brief", "include_sources": true}}
]
}
Translator: map high-level actions → tool calls; easier to maintain across tool API changes.
Validation rules:
- All actions known and allowed
- Arguments well-typed and in-range
- Steps <= max_steps; risky actions gated
- Dependencies exist (no missing outputs)
Control Flows in Agents
Support beyond sequential:
- Parallel: run independent fetches concurrently
- Conditional (if-else): branch on tool outputs
- Loop: iterate until condition met (with safety caps)
Example:
# Parallel
with ThreadPoolExecutor() as ex:
futs = [ex.submit(fetch_price, p) for p in products]
results = [f.result() for f in futs]
# Conditional
if earnings['surprise'] < -0.05:
action = "consider_sell"
else:
action = "hold_or_buy"
# Loop with guard
attempts = 0
while not done and attempts < 5:
attempts += 1
plan = refine_plan(last_feedback)
Choose an agent framework that supports these flows natively if your tasks need them.
Reflection and Error Correction
Implement reflection at:
- Pre-execution (plan sanity)
- Step-by-step (after tool outputs)
- Post-execution (goal achieved? constraints satisfied?)
ReAct-style prompt (simplified):
You are solving: {{TASK}}
At each step, produce:
Thought: your reasoning
Action: one of [TOOL_NAME, Finish]
Action Input: JSON arguments
When sufficient, use Action: Finish with the final answer.
History:
{{TRAJECTORY}}
Reflexion loop (pseudocode):
for attempt in range(MAX_ATTEMPTS):
plan = planner.generate(task, memory)
if not validator.is_valid(plan): continue
outcome = executor.run(plan)
score, feedback = evaluator.score(outcome, constraints)
if score >= PASS:
return outcome.final_answer
reflection = llm.generate(f"Why did we fail? Suggest improvements.\n{feedback}")
memory.update_with_reflection(reflection)
Trade-offs:
-
- Accuracy and robustness
- – Token and latency overhead; cap steps and token budgets
Tool Selection — Practical Process
- Start minimal; add tools only if they measurably improve success
- Instrument usage: frequencies, error rates, time per tool
- Ablation: remove a tool → does performance drop?
- If a tool is consistently hard to use (invalid params, low success), simplify or replace it
- Keep tool descriptions concise and precise; include parameter constraints and examples
Analytics to track:
- Tool call count and error rate per tool
- Average tokens/latency/cost contribution
- Common invalid parameter patterns
- Tool transition pairs (X→Y) to identify compound tools
Agent Failure Modes — What to Detect
Planning failures
- Invalid tool names
- Invalid parameters (missing/wrong types/ranges)
- Incorrect parameter values (wrong date window)
- Goal failure (wrong target or constraints violated)
- Premature “done” due to faulty reflection
Tool failures
- Tool output wrong (captioning/SQL)
- Translation layer errors (plan→tool mismatch)
- Missing tool (agent lacks required capability)
Efficiency failures
- Too many steps/calls (cost blowup)
- Slow tools blocking user experience
- Not using parallelizable steps in parallel
Instrumentation checklist:
- Log: plan, tool calls (name/params/out), tokens, time per step, final answer
- Error taxonomy: plan vs tool vs environment
- Threshold alerts: invalid tool rate, avg steps/query, latency SLOs
Agent Evaluation — Metrics and Benchmarks
Plan validity
- % valid plans
- Avg attempts to valid plan
- % invalid tool calls
- % invalid params
- % incorrect param values
Task success
- Success rate under constraints (budget, time)
- LLM-judge/scorer metrics with rubrics
- End-to-end latency and token cost
Efficiency
- Steps per task; tool calls per step
- Time and cost per tool
- Parallelization coverage
Benchmarking tips:
- Build a representative task set with constraints
- Include time-sensitive tasks if relevant
- Add adversarial cases (missing info, injection attempts)
- Compare against baselines (human operator, simpler agent)
Memory for RAG and Agents — Practical Patterns
Memory layers:
- Internal knowledge (model weights) — slow to update
- Short-term (context window) — fast but limited
- Long-term (external store) — scalable and persistent
Short-term memory budgeting:
- Reserve X% of prompt for retrieved context (e.g., 30%)
- Keep recent conversation turns + task-critical state
- Overflow moves to long-term memory
Eviction strategies:
- FIFO (simple, brittle)
- Summarization + entity tracking (preferred)
- Redundancy removal (keep facts, drop verbose)
- Recency + importance scoring
Summarization prompt:
Summarize the conversation into 150–200 tokens focusing on goals, decisions, constraints, and key facts. Maintain entity names and values. Omit chit-chat.
Conflict handling:
- Mark facts with timestamps/sources
- Prefer latest for volatile facts; retain conflicting views if helpful
- Use LLM to adjudicate contradictions when needed
Long-term memory retrieval:
- Treat like RAG: embed summaries, entities, decisions; hybrid search
- Attach memory snippets to prompts when relevant
Data structures:
- Conversation summary store (by thread/user)
- Fact store (key-value with provenance)
- Task state store (plan, step results, reflections)
- Tool output logs (for audit and learning)
Secure Write Actions — Operational Guardrails
- Allowlist tools; deny by default
- Parameter validation + type checks
- Policy evaluation (e.g., spending limits, data access scope)
- Human approval checkpoints for risky actions (DB writes, financial)
- Sandboxed code execution (no network/filesystem unless allowed)
- Prompt and code injection defenses (sanitize inputs; strip HTML/JS where needed)
- Audit trail (immutable logs of actions and approvals)
Ready-to-Use Prompts and Schemas
Query Rewriter (self-contained)
Rewrite the last user request to a fully self-contained, specific query.
If essential information is missing, respond with:
INSUFFICIENT CONTEXT: <list required info>
Conversation:
{{HISTORY}}
User: {{LAST_UTTERANCE}}
Rewrite:
LLM Judge (document relevance)
You are grading whether the document helps answer the query.
Query: {{Q}}
Document: {{DOC}}
Respond JSON:
{"label": "relevant|partially_relevant|irrelevant", "rationale": "<short reason>"}
ReAct Step Format
Thought: ...
Action: <TOOL_NAME or Finish>
Action Input: <JSON args or final answer>
Tool Schema Example (JSON Schema)
{
"name": "fetch_user_payments",
"description": "Get user's payments between start_date and end_date.",
"parameters": {
"type": "object",
"properties": {
"user_id": {"type":"string"},
"start_date": {"type":"string", "format":"date"},
"end_date": {"type":"string", "format":"date"}
},
"required": ["user_id","start_date","end_date"]
}
}
Default Settings — Sensible Starting Points
- Chunking: 1,000 tokens, 15% overlap
- Top-k retrieval: 5 (tune 3–8)
- Hybrid search: BM25 M=200 + Vector M=200 → RRF → rerank top 50
- Reranking: cross-encoder top 20 → final top 5 in prompt
- Prompt budget: 30–40% retrieval, 60–70% instructions + user + memory
- Embedding model: strong general model (e.g., bge-large, E5-large) or vendor-provided
- Vector index: HNSW ef_search=100, M=32 for ≤5M vectors
- Time-decay: exponential decay with half-life tuned to domain (e.g., 7 days for news)
Common Pitfalls and How to Avoid Them
- Over-indexing tiny chunks → slow, costly searches; increase chunk size
- Losing critical tokens at chunk boundaries → add overlap, augment with titles/summaries
- Missing keyword match (error codes) in semantic-only systems → hybrid search with metadata
- Stale embeddings after content changes → incremental re-embedding; content hash to detect diffs
- Tool misuse (invalid params) → strict schema validation + examples in tool descriptions
- Unbounded agent loops → set step/token caps; require progress checks; watchdog timers
- Security gaps for write tools → least privilege, approvals, sandboxing, audit trails
Quick Decision Frameworks
Should I use RAG or longer context?
- Knowledge base ≤200k tokens and rarely changes → try long-context prompt first
- Otherwise → RAG for scalability and cost control; hybrid retrieval
Term-based vs embedding-based?
- Heavy on IDs/codes/precise keywords → term-based baseline
- Natural language, ambiguity, multilingual → add embeddings
- Best overall → hybrid with reranking
Which ANN index?
- Need high recall, RAM OK → HNSW
- Scale and memory constraints → IVF-PQ (accept slight recall loss)
Do I need agents or just RAG?
- Single-turn Q&A with stable docs → RAG
- Multi-step tasks, tool orchestration, conditional flows → Agents
When to add write actions?
- When read-only automation is valuable and safe; add writes only with explicit safety gates and human approvals
RAG Deployment Checklist
- Define knowledge sources and access controls
- Choose retrieval strategy (BM25, embeddings, hybrid)
- Implement robust chunking + overlap + metadata
- Build vector index (HNSW/IVF-PQ) and BM25 index
- Add query rewriting and reranking
- Assemble prompts with clear system instructions
- Implement evaluation (precision/recall, answer quality)
- Add caches (embedding, retrieval)
- Monitor tokens/cost/latency; optimize top_k and reranking thresholds
Agent Deployment Checklist
- Define environment, tools (read/write), and constraints
- Implement function calling with strict schemas
- Plan/validate/execute loop with step caps
- Reflection: ReAct/Reflexion for robustness
- Translator for natural-language plans → tool calls
- Control flows: sequential, parallel, conditionals, loops
- Safety: validation, approvals, sandbox, logging, injection defenses
- Evaluation: plan validity, success rate, cost/latency, tool error rates
- Tool analytics and ablations for inventory optimization
Memory Management Checklist
- Budget short-term vs retrieval tokens per prompt
- Summarize conversations; track entities/facts with provenance
- Evict using recency + importance + summarization
- Store reflections, plans, and tool outputs for continuity
- Retrieve long-term memory via hybrid search when relevant
- Handle conflicting facts with timestamps and adjudication rules
Example: End-to-End RAG+Agent (Text-to-SQL + Docs)
High-level flow:
- Rewrite query → detect intent (needs SQL? docs? both?)
- If SQL:
- Select schemas → generate SQL → execute → capture results
- Retrieve docs (hybrid) → rerank
- Build final prompt with:
- Instructions
- Rewritten query
- SQL result (if applicable)
- Top-N doc chunks (with sources)
- Generate answer with citations
- Reflect: verify units, constraints, and consistency; if uncertain, ask for clarification
Execution skeleton:
query = rewrite(user_input, history)
intent = classify_intent(query)
sql_result = None
if intent.requires_sql:
schemas = select_schemas(query, schema_catalog)
sql = text2sql(query, schemas)
sql_result = safe_sql_execute(sql)
docs = hybrid_retrieve(query, k=200)
reranked = rerank(query, docs, top_n=5)
prompt = assemble_prompt(instructions, query, sql_result, reranked)
answer = llm.generate(prompt)
verified = verify(answer, constraints)
if not verified.ok:
answer = clarify_or_correct(answer, verified.feedback)
return answer
This reference is designed for fast decision-making and implementation. Use the defaults to get started, then iterate with evaluation and instrumentation to meet your accuracy, cost, and latency targets.