name: agent:rag description: RAG Pipeline Design - guides through chunking, embedding, vector store selection, retrieval tuning, and RAG alternatives argument-hint: ["description or path"]
RAG Pipeline Design
Guides the user through designing a Retrieval-Augmented Generation (RAG) pipeline. Based on "Principles of Building AI Agents" (Bhagwat & Gienow, 2025), Part V: RAG (Chapters 17-20).
When to use
Use this skill when the user needs to:
- Design a RAG pipeline for an agent
- Choose a vector database
- Configure chunking, embedding, and retrieval
- Evaluate whether RAG is even needed (vs. alternatives)
- Tune an existing RAG pipeline for better quality
Instructions
Step 1: Do You Actually Need RAG?
Before building a pipeline, apply the principle: Start simple, check quality, get complex.
Use AskUserQuestion to assess:
## RAG Decision Tree
### Step 1: How large is your corpus?
- **< 200 pages** → Try full context loading first (Gemini 2M, Claude 200K)
- **200-10,000 pages** → Consider agentic RAG (tools that query data) OR traditional RAG
- **> 10,000 pages** → Traditional RAG pipeline is likely needed
### Step 2: What is the query pattern?
- **Factual lookup** ("What is X?") → RAG works well
- **Analytical** ("Compare X and Y across documents") → Agentic RAG may be better
- **Conversational** ("Tell me about...") → Either works
### Step 3: How structured is the data?
- **Highly structured** (tables, databases) → Use tools/APIs, not RAG
- **Semi-structured** (markdown, HTML) → RAG with format-specific chunking
- **Unstructured** (PDFs, free text) → Traditional RAG
Recommended progression:
- First, load entire corpus into a large context window
- Second, write functions to query the dataset, give to agent as tools
- Only if 1 and 2 fail on quality, build a RAG pipeline
If the user decides RAG is needed, proceed. Otherwise, recommend the simpler alternative.
Step 2: Chunking Strategy
Design how documents are split into retrievable pieces:
## Chunking Strategy
### Method
| Strategy | Best For | Description |
|----------|----------|-------------|
| Recursive | General text | Splits by paragraph, then sentence, then character |
| Token-aware | LLM optimization | Splits by token count, respects model limits |
| Format-specific | Markdown/HTML/JSON | Uses document structure (headers, tags, keys) |
| Semantic | High quality needs | Uses LLM to identify natural topic boundaries |
**Selected:** [Strategy]
### Parameters
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| Chunk size | [256-1024 tokens] | Balance: smaller = more precise, larger = more context |
| Overlap | [50-200 tokens] | Prevents losing context at chunk boundaries |
| Metadata | [title, source, date, section, page] | Enables filtered retrieval |
### Document-Specific Rules
| Document Type | Chunking Rule |
|--------------|---------------|
| [Markdown docs] | Split on ## headers, keep header as metadata |
| [PDFs] | Page-based with overlap, extract title/section |
| [Code files] | Function/class-level chunks |
| [Chat logs] | Message groups of [N] turns |
Step 3: Embedding Configuration
Choose how chunks become vectors:
## Embedding
### Model Selection
| Model | Dimensions | Quality | Cost | Speed |
|-------|-----------|---------|------|-------|
| OpenAI text-embedding-3-large | 3072 | High | $0.13/M tokens | Fast |
| OpenAI text-embedding-3-small | 1536 | Good | $0.02/M tokens | Fast |
| Voyage voyage-3 | 1024 | High | $0.06/M tokens | Fast |
| Cohere embed-v3 | 1024 | High | $0.10/M tokens | Fast |
| Local (e5-large, BGE) | 1024 | Good | Free (compute) | Varies |
**Selected:** [Model]
### Indexing
| Parameter | Value |
|-----------|-------|
| Dimensions | [From model] |
| Similarity metric | Cosine (most common) |
| Index type | HNSW (default, good balance of speed/accuracy) |
Step 4: Vector Database Selection
Apply the principle: Prevent infra sprawl — vector DB choice is mostly commoditized.
Use AskUserQuestion:
## Vector Database
### Decision Matrix
| Option | When to Choose | Pros | Cons |
|--------|---------------|------|------|
| **pgvector** (Postgres extension) | Already using Postgres | No new infra, familiar SQL, metadata filtering | May need tuning at scale |
| **Pinecone** (managed) | New project, want simplicity | Fully managed, fast, scalable | Additional service + cost |
| **Chroma** (open-source) | Local dev, small scale | Free, easy setup | Self-host in production |
| **Cloud-native** (Cloudflare, DataStax) | Already on that cloud | Integrated billing, low latency | Vendor lock-in |
**Selected:** [Database]
**Rationale:** [Why]
Step 5: Retrieval Configuration
Design how the agent queries the vector store:
## Retrieval
### Query Strategy
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| topK | [3-10] | Number of chunks to retrieve |
| similarityThreshold | [0.7-0.9] | Min relevance to include |
| reranking | [Yes/No] | Post-retrieval quality boost |
### Hybrid Queries
Combine vector similarity with metadata filters:
| Filter | Type | Example |
|--------|------|---------|
| Date range | Metadata | Only docs from last 30 days |
| Category | Metadata | Only "technical" documents |
| Source | Metadata | Only from "docs.example.com" |
| User access | Metadata | Only docs user has permission to see |
### Reranking (Optional)
- **When to use:** Quality matters more than latency
- **How:** Retrieve topK * 3 candidates, rerank with a cross-encoder, return topK
- **Models:** Cohere Rerank, bge-reranker, cross-encoder/ms-marco
- **Cost:** More expensive per query, but runs only on candidates (not full corpus)
### Query Transformation (Optional)
- **HyDE:** Generate a hypothetical answer, use it as the search query
- **Multi-query:** Generate multiple query variations, merge results
- **Step-back:** Abstract the query to a higher level, then search
Step 6: Pipeline Architecture
Bring it all together:
## RAG Pipeline
### Ingestion Pipeline
1. **Load** documents from [source]
2. **Chunk** using [strategy] with [size] tokens, [overlap] overlap
3. **Enrich** metadata: source, date, category, section
4. **Embed** using [model]
5. **Upsert** into [vector DB]
6. **Schedule:** [On change / Nightly / Manual]
### Query Pipeline
1. **Receive** user query
2. **Transform** query (optional: HyDE, multi-query)
3. **Embed** query using [same model as ingestion]
4. **Search** vector DB: topK=[N], filters=[metadata filters]
5. **Rerank** results (optional)
6. **Inject** top chunks into LLM context as <retrieved_documents>
7. **Generate** response with source attribution
### Architecture Diagram
graph LR
subgraph Ingestion
Docs[Documents] --> Chunk[Chunker]
Chunk --> Embed[Embedder]
Embed --> Store[(Vector DB)]
end
subgraph Query
User[User Query] --> QEmbed[Query Embedder]
QEmbed --> Search[Similarity Search]
Store --> Search
Search --> Rerank[Reranker]
Rerank --> LLM[LLM + Context]
LLM --> Response[Response]
end
Step 7: Quality Checklist
## RAG Quality Checklist
### Retrieval Quality
- [ ] Relevant documents consistently in top-K results
- [ ] Metadata filters working correctly
- [ ] No duplicate chunks in results
- [ ] Chunk size balances precision vs. context
### Generation Quality
- [ ] Responses are grounded in retrieved documents
- [ ] Source attribution is accurate
- [ ] Agent says "I don't know" when no relevant chunks found
- [ ] No hallucination beyond retrieved context
### Operational
- [ ] Ingestion pipeline runs on schedule
- [ ] New documents are available within [SLA]
- [ ] Vector DB latency < [target]ms
- [ ] Embedding costs within budget
Step 8: Summarize and Offer Next Steps
Present all findings to the user as a structured summary in the conversation (including the pipeline diagram). Do NOT write to .specs/ — this skill works directly.
Use AskUserQuestion to offer:
- Implement pipeline — scaffold ingestion and query code
- Skip RAG — if the decision tree said RAG isn't needed, help with the alternative (full context or agentic tools)
- Comprehensive design — run
agent:designto cover all areas with a spec
Arguments
$ARGUMENTS($0) - Optional description of the knowledge domain or path to existing RAG code
Examples:
agent:rag documentation search— design RAG for a docs search agentagent:rag src/rag/— review and tune existing RAG pipelineagent:rag— start fresh