name: libeval description: > libeval - RAG evaluation system. Evaluator orchestrates quality assessment using LLM-as-judge patterns. CriteriaEvaluator scores responses against rubrics. RecallEvaluator measures retrieval performance. TraceEvaluator analyzes execution traces. EvalStore persists results. Use for automated quality testing, RAG pipeline evaluation, and agent performance testing
libeval Skill
When to Use
- Evaluating RAG agent response quality
- Measuring retrieval recall and precision
- Running automated quality assessments
- Benchmarking agent performance over time
Key Concepts
Evaluator: Main orchestrator that runs test cases through the agent and collects metrics.
CriteriaEvaluator: Uses LLM-as-judge to score responses against defined criteria and rubrics.
RecallEvaluator: Measures how well the retrieval system returns relevant documents.
TraceEvaluator: Analyzes execution traces for performance and correctness.
Usage Patterns
Pattern 1: Run evaluation suite
import { Evaluator } from "@copilot-ld/libeval";
const evaluator = new Evaluator(config);
const results = await evaluator.run(testCases);
console.log(results.summary);
Pattern 2: Criteria-based evaluation
import { CriteriaEvaluator } from "@copilot-ld/libeval";
const criteria = new CriteriaEvaluator(llmClient);
const score = await criteria.evaluate(response, rubric);
Integration
Configured via config/eval.yml. Run via make eval. Uses libllm for
LLM-as-judge.