name: agent-eval-framework description: Evaluate AI agent outputs systematically using rubrics, assertions, and reference comparisons. Detect quality drift over time. version: "1.0.0" last-updated: "2026-04-17" model_tested: "claude-sonnet-4-6" category: eval platforms: [claude-code, codex, gemini-cli, cursor, copilot, windsurf, cline] language: en geo_relevance: [global] priority: high dependencies: mcp: [] skills: [] apis: [] data: [] update_sources:
- url: "https://arxiv.org/abs/2603.28052" check_frequency: "quarterly" last_checked: "2026-04-21" license: MIT
Agent Evaluation Framework
When to Use
- Before deploying an agent to production
- After changing an agent's system prompt or skills
- When agent output quality seems to degrade
- During periodic quality reviews
- When comparing two agent configurations
Step 1: Define Evaluation Criteria
Choose criteria relevant to your agent's purpose:
Universal Criteria
| Criterion | Question | Score |
|---|---|---|
| Correctness | Is the output factually/technically correct? | 0-10 |
| Completeness | Does it cover all required aspects? | 0-10 |
| Relevance | Is every part relevant to the request? | 0-10 |
| Safety | Does it avoid harmful/insecure patterns? | 0-10 |
Code-Specific Criteria
| Criterion | Question | Score |
|---|---|---|
| Functionality | Does the code work as intended? | 0-10 |
| Edge Cases | Are edge cases handled? | 0-10 |
| Style | Does it match project conventions? | 0-10 |
| Security | Are there vulnerabilities? | 0-10 |
Content-Specific Criteria
| Criterion | Question | Score |
|---|---|---|
| Accuracy | Are claims supported by evidence? | 0-10 |
| Tone | Does it match the intended audience? | 0-10 |
| Structure | Is it well-organized? | 0-10 |
| Originality | Does it avoid generic/cliche content? | 0-10 |
Step 2: Choose Evaluation Method
A. Assertion-Based (Automated)
Define pass/fail conditions:
ASSERT: output contains "disclaimer"
ASSERT: output does NOT contain "TODO"
ASSERT: code compiles without errors
ASSERT: response length < 2000 tokens
ASSERT: no PII detected in output
Best for: Regression testing, CI/CD pipelines.
B. Reference-Based (Semi-Automated)
Compare output against a known-good reference:
- Exact match (strict)
- Semantic similarity (using embeddings)
- Key-point coverage (checklist)
Best for: Consistent tasks with known expected outputs.
C. Rubric-Based (Human + AI)
Score each criterion 0-10 with justification:
Correctness: 8/10 — Accurate but missed one edge case
Completeness: 7/10 — Covered 5 of 6 required points
Safety: 10/10 — No security issues
TOTAL: 25/30 (83%) — PASS (threshold: 70%)
Best for: Complex, subjective outputs.
Step 3: Run Evaluation
-
Prepare 5-10 test cases covering:
- Happy path (normal usage)
- Edge cases (unusual inputs)
- Adversarial inputs (injection, confusion)
- Empty/minimal inputs
- Maximum complexity inputs
-
Run each test case through the agent
-
Apply chosen evaluation method
-
Record results with timestamps
Step 4: Set Thresholds
| Level | Score | Action |
|---|---|---|
| Excellent | >= 90% | Ship |
| Good | 70-89% | Ship with monitoring |
| Marginal | 50-69% | Fix before shipping |
| Failing | < 50% | Do not ship |
Step 5: Monitor Drift
Track these metrics over time:
- Average score per criterion (weekly)
- Pass rate on test suite (per deployment)
- Token cost per task (per session)
- User satisfaction signals (if available)
Drift signals:
- Score drops >10% week-over-week
- Pass rate drops below threshold
- Token cost increases >20% without scope change
- New failure modes not in original test suite
Output Format
AGENT EVAL REPORT
Agent: {name}
Date: {ISO-8601}
Test cases: {n}
Method: {assertion|reference|rubric}
Results:
Pass: {n} ({%})
Fail: {n} ({%})
Average score: {x}/10
Per-criterion:
Correctness: {x}/10
Completeness: {x}/10
Safety: {x}/10
Verdict: {PASS|MARGINAL|FAIL}
Recommendation: {ship|fix|block}
What This Skill Does NOT Do
- Does not test the LLM model itself (tests agent in context)
- Does not perform adversarial red-teaming (different discipline)
- Does not replace user feedback (complements it)
- Does not measure latency or throughput (APM tools do this)