name: evaluating-rag description: Evaluate RAG systems with hit rate, MRR, faithfulness metrics and compare retrieval strategies. Use when testing retrieval quality, generating evaluation datasets, comparing embeddings or retrievers, A/B testing, or measuring production RAG performance.
Evaluating RAG Systems
Guide for measuring RAG performance, comparing strategies, and implementing continuous evaluation. Focus on key metrics and practical testing approaches.
When to Use This Skill
- Testing retrieval quality and accuracy
- Generating evaluation datasets for your domain
- Comparing different retrieval strategies (vector vs BM25 vs hybrid)
- A/B testing embedding models or rerankers
- Measuring production RAG performance
- Validating improvements after optimizations
- Comparing your 7 retrieval strategies in
src/orsrc-iLand/
Key Evaluation Metrics
Retrieval Metrics
Hit Rate: Fraction of queries where correct answer found in top-k
- Perfect: 1.0 (all queries found relevant docs)
- Good: 0.85+ (85%+ queries successful)
- Needs work: <0.70
MRR (Mean Reciprocal Rank): Quality of ranking
- Perfect: 1.0 (relevant doc always rank 1)
- Good: 0.80+ (relevant doc typically in top 2-3)
- Formula: Average of 1/rank across queries
Response Metrics
Faithfulness: No hallucinations, grounded in context Correctness: Factually accurate vs reference answer Relevancy: Directly addresses the query
Quick Decision Guide
When to Evaluate
- After implementing → Baseline performance
- After optimization → Validate improvements
- Before production → Quality gate
- In production → Continuous monitoring
What to Measure
- Development → Hit rate + MRR (retrieval quality)
- Production → All metrics (retrieval + response quality)
- A/B testing → Comparative metrics
Dataset Size
- Quick test → 20-50 Q&A pairs
- Thorough eval → 100-200 pairs
- Production → 500+ pairs
Quick Start Patterns
Pattern 1: Basic Retrieval Evaluation
from llama_index.core.evaluation import RetrieverEvaluator
# Create evaluator
evaluator = RetrieverEvaluator.from_metric_names(
["mrr", "hit_rate"],
retriever=retriever
)
# Run evaluation
eval_results = await evaluator.aevaluate_dataset(qa_dataset)
print(f"Hit Rate: {eval_results['hit_rate']:.3f}")
print(f"MRR: {eval_results['mrr']:.3f}")
Pattern 2: Generate Evaluation Dataset
from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.llms.openai import OpenAI
# Generate Q&A pairs from your documents
llm = OpenAI(model="gpt-4o-mini")
qa_dataset = generate_question_context_pairs(
nodes,
llm=llm,
num_questions_per_chunk=2
)
# Filter invalid entries
qa_dataset = filter_qa_dataset(qa_dataset)
# Save for reuse
qa_dataset.save_json("evaluation_dataset.json")
Pattern 3: Compare Multiple Strategies
strategies = {
"vector": vector_retriever,
"bm25": bm25_retriever,
"hybrid": hybrid_retriever,
"metadata": metadata_retriever,
}
results = {}
for strategy_name, retriever in strategies.items():
evaluator = RetrieverEvaluator.from_metric_names(
["mrr", "hit_rate"],
retriever=retriever
)
eval_result = await evaluator.aevaluate_dataset(qa_dataset)
results[strategy_name] = eval_result
print(f"{strategy_name}: {eval_result}")
# Find best strategy
best_strategy = max(results, key=lambda x: results[x]['hit_rate'])
print(f"\nBest strategy: {best_strategy}")
Pattern 4: Compare With/Without Reranking
# Without reranking
retriever_no_rerank = index.as_retriever(similarity_top_k=5)
# With reranking
from llama_index.postprocessor.cohere_rerank import CohereRerank
retriever_with_rerank = index.as_retriever(
similarity_top_k=10,
node_postprocessors=[CohereRerank(top_n=5)]
)
# Evaluate both
for name, retriever in [("No Rerank", retriever_no_rerank),
("With Rerank", retriever_with_rerank)]:
evaluator = RetrieverEvaluator.from_metric_names(
["mrr", "hit_rate"],
retriever=retriever
)
results = await evaluator.aevaluate_dataset(qa_dataset)
print(f"{name}: Hit Rate={results['hit_rate']:.3f}, MRR={results['mrr']:.3f}")
# Calculate improvement
improvement = (rerank_results['hit_rate'] - no_rerank_results['hit_rate']) / no_rerank_results['hit_rate']
print(f"Improvement: {improvement * 100:.1f}%")
Pattern 5: Response Quality Evaluation
from llama_index.core.evaluation import (
FaithfulnessEvaluator,
RelevancyEvaluator
)
# Initialize evaluators
faithfulness_evaluator = FaithfulnessEvaluator()
relevancy_evaluator = RelevancyEvaluator()
# Generate response
response = query_engine.query("What is machine learning?")
# Evaluate faithfulness (no hallucinations)
faithfulness_result = faithfulness_evaluator.evaluate_response(
response=response
)
print(f"Faithfulness: {faithfulness_result.passing}")
# Evaluate relevancy
relevancy_result = relevancy_evaluator.evaluate_response(
query="What is machine learning?",
response=response
)
print(f"Relevancy: {relevancy_result.passing}")
Your Codebase Integration
For src/ Pipeline (7 Strategies)
Compare All Strategies:
strategies = {
"vector": "src/10_basic_query_engine.py",
"summary": "src/11_document_summary_retriever.py",
"recursive": "src/12_recursive_retriever.py",
"metadata": "src/14_metadata_filtering.py",
"chunk_decoupling": "src/15_chunk_decoupling.py",
"hybrid": "src/16_hybrid_search.py",
"planner": "src/17_query_planning_agent.py",
}
# Create evaluation framework to compare all 7
Baseline Performance:
- Generate Q&A dataset from your documents
- Evaluate each strategy
- Identify best performer
- Use as baseline for improvements
For src-iLand/ Pipeline (Thai Land Deeds)
Thai-Specific Evaluation:
# Generate Thai Q&A pairs
llm = OpenAI(model="gpt-4o-mini") # Supports Thai
qa_dataset = generate_question_context_pairs(
thai_nodes,
llm=llm,
num_questions_per_chunk=2
)
# Test with Thai queries
thai_queries = [
"โฉนดที่ดินในกรุงเทพ", # Land deeds in Bangkok
"นส.3 คืออะไร", # What is NS.3
"ที่ดินในสมุทรปราการ" # Land in Samut Prakan
]
Router Evaluation (src-iLand/retrieval/router.py):
- Test index classification accuracy
- Test strategy selection appropriateness
- Measure end-to-end performance
Fast Metadata Testing:
- Validate <50ms response time
- Test filtering accuracy
- Compare with/without fast indexing
Detailed References
Load these when you need comprehensive details:
-
reference-metrics.md: Complete evaluation guide
- All metrics (hit rate, MRR, faithfulness, correctness)
- Dataset generation techniques
- A/B testing frameworks
- Production monitoring
- Statistical significance testing
-
reference-agents.md: Advanced techniques
- Agents (FunctionAgent, ReActAgent)
- Multi-agent systems
- Query engines (Router, SubQuestion)
- Workflow orchestration
- Observability and debugging
Common Workflows
Workflow 1: Create Evaluation Dataset
-
Step 1: Prepare representative documents
- Sample from different categories
- Include edge cases
-
Step 2: Generate Q&A pairs
qa_dataset = generate_question_context_pairs( nodes, llm=llm, num_questions_per_chunk=2 ) -
Step 3: Filter invalid entries
- Remove auto-generated artifacts
- Load
reference-metrics.mdfor filtering code
-
Step 4: Manual review (optional)
- Check 10-20 samples
- Ensure question quality
-
Step 5: Save for reuse
qa_dataset.save_json("eval_dataset.json")
Workflow 2: Compare Retrieval Strategies
-
Step 1: Load evaluation dataset
from llama_index.core.llama_dataset import LabelledRagDataset qa_dataset = LabelledRagDataset.from_json("eval_dataset.json") -
Step 2: Define strategies to compare
- List all retrievers to test
- For
src/: All 7 strategies - For
src-iLand/: Router + individual strategies
-
Step 3: Run evaluation for each
for name, retriever in strategies.items(): results[name] = evaluate(retriever, qa_dataset) -
Step 4: Compare results
- Identify best hit rate
- Identify best MRR
- Consider trade-offs (latency, cost)
-
Step 5: Document findings
- Record baseline performance
- Note best strategies for different query types
Workflow 3: A/B Test an Optimization
-
Step 1: Measure baseline
baseline_results = evaluate(current_retriever, qa_dataset) -
Step 2: Apply optimization
- Add reranking
- Change embedding model
- Adjust chunk size
- etc.
-
Step 3: Measure optimized version
optimized_results = evaluate(optimized_retriever, qa_dataset) -
Step 4: Calculate improvement
improvement = (optimized - baseline) / baseline * 100 print(f"Hit Rate improvement: {improvement:.1f}%") -
Step 5: Decide based on data
- If improvement > 5%: Deploy
- If improvement < 2%: Consider cost/complexity
- If negative: Rollback
Workflow 4: Production Monitoring
-
Step 1: Create production evaluation set
- Sample real user queries
- Include ground truth when available
-
Step 2: Set up continuous evaluation
class ProductionEvaluator: def evaluate_query(self, query, response): # Log metrics # Track over time -
Step 3: Define alerts
- Hit rate < 0.80 → Alert
- MRR < 0.70 → Alert
- Latency p95 > 2s → Alert
-
Step 4: Monitor trends
- Daily/weekly metrics
- Detect degradation early
-
Step 5: Iterate based on data
- Identify failure patterns
- Generate new test cases
- Improve weak areas
Workflow 5: Evaluate All 7 Strategies (src/)
-
Step 1: Generate comprehensive dataset
- Cover different query types
- Factual, summarization, comparison
-
Step 2: Run each strategy
python src/10_basic_query_engine.py # Vector python src/11_document_summary_retriever.py # Summary python src/12_recursive_retriever.py # Recursive python src/14_metadata_filtering.py # Metadata python src/15_chunk_decoupling.py # Chunk decoupling python src/16_hybrid_search.py # Hybrid python src/17_query_planning_agent.py # Planner -
Step 3: Collect metrics
- Hit rate for each
- MRR for each
- Latency for each
-
Step 4: Create comparison table
Strategy Hit Rate MRR Latency Use Case Vector ... ... ... General Hybrid ... ... ... Best overall ... ... ... ... ... -
Step 5: Document recommendations
- Best for factual queries
- Best for complex queries
- Best for production (speed + quality)
Evaluation Metrics Reference
Hit Rate Interpretation
- 1.0 → Perfect (all queries successful)
- 0.90+ → Excellent
- 0.80-0.89 → Good
- 0.70-0.79 → Acceptable
- <0.70 → Needs improvement
MRR Interpretation
- 1.0 → Perfect ranking (relevant doc always #1)
- 0.85+ → Excellent (relevant doc typically #1 or #2)
- 0.70-0.84 → Good
- 0.50-0.69 → Acceptable
- <0.50 → Poor ranking quality
Latency Targets
- <100ms → Excellent
- 100-500ms → Good
- 500ms-1s → Acceptable
- >1s → Needs optimization
Performance Benchmarks
Embedding Model Comparison (from reference docs)
| Embedding | Reranker | Hit Rate | MRR |
|---|---|---|---|
| JinaAI Base | bge-reranker-large | 0.938 | 0.869 |
| JinaAI Base | CohereRerank | 0.933 | 0.874 |
| OpenAI | CohereRerank | 0.927 | 0.866 |
| OpenAI | bge-reranker-large | 0.910 | 0.856 |
Typical Improvements
- Adding reranking: +5-15% hit rate
- Hybrid vs vector: +3-8% hit rate
- Optimal chunk size: +2-5% hit rate
- Better embeddings: +3-10% hit rate
Scripts
This skill includes utility scripts in the scripts/ directory:
generate_qa_dataset.py
Generate evaluation Q&A pairs from documents:
python .claude/skills/evaluating-rag/scripts/generate_qa_dataset.py \
--documents-dir ./data \
--output eval_dataset.json \
--num-questions-per-chunk 2
compare_retrievers.py
Compare multiple retrieval strategies:
python .claude/skills/evaluating-rag/scripts/compare_retrievers.py \
--dataset eval_dataset.json \
--strategies vector,bm25,hybrid \
--output comparison_results.json
Outputs:
- Hit rate and MRR for each strategy
- Performance comparison table
- Recommendations
run_evaluation.py
Run comprehensive evaluation:
python .claude/skills/evaluating-rag/scripts/run_evaluation.py \
--retriever-config config.yaml \
--dataset eval_dataset.json \
--metrics hit_rate,mrr,faithfulness
Reports:
- All requested metrics
- Per-query breakdown
- Summary statistics
Key Reminders
Dataset Quality:
- Generate from your actual documents
- Include diverse query types
- Filter invalid auto-generated entries
- Manual review recommended for critical domains
Evaluation Best Practices:
- Start with baseline (before optimization)
- Test one change at a time (for clear attribution)
- Use same dataset for comparisons
- Statistical significance matters (>5% improvement)
Production Monitoring:
- Continuous evaluation on sample queries
- Track trends over time
- Alert on degradation
- Regular dataset refresh
For Your Pipelines:
src/: Compare all 7 strategies systematicallysrc-iLand/: Test with Thai queries and metadata- Both: Establish baselines before optimizations
Next Steps
After evaluation:
- Optimize: Use
optimizing-ragskill to improve low scores - Implement: Use
implementing-ragskill to rebuild weak components - Monitor: Set up continuous evaluation in production
- Iterate: Regular evaluation → optimization → re-evaluation cycle