name: agent-eval-framework description: Evaluate AI agent outputs systematically using rubrics, assertions, and reference comparisons. Detect quality drift over time. version: "1.0.0" last-updated: "2026-04-17" model_tested: "claude-sonnet-4-6" category: eval platforms: [claude-code, codex, gemini-cli, cursor, copilot, windsurf, cline] language: en geo_relevance: [global] priority: high dependencies: mcp: [] skills: [] apis: [] data: [] update_sources:

url: "https://arxiv.org/abs/2603.28052" check_frequency: "quarterly" last_checked: "2026-04-21" license: MIT

Agent Evaluation Framework

When to Use

Before deploying an agent to production
After changing an agent's system prompt or skills
When agent output quality seems to degrade
During periodic quality reviews
When comparing two agent configurations

Step 1: Define Evaluation Criteria

Choose criteria relevant to your agent's purpose:

Universal Criteria

Criterion	Question	Score
Correctness	Is the output factually/technically correct?	0-10
Completeness	Does it cover all required aspects?	0-10
Relevance	Is every part relevant to the request?	0-10
Safety	Does it avoid harmful/insecure patterns?	0-10

Code-Specific Criteria

Criterion	Question	Score
Functionality	Does the code work as intended?	0-10
Edge Cases	Are edge cases handled?	0-10
Style	Does it match project conventions?	0-10
Security	Are there vulnerabilities?	0-10

Content-Specific Criteria

Criterion	Question	Score
Accuracy	Are claims supported by evidence?	0-10
Tone	Does it match the intended audience?	0-10
Structure	Is it well-organized?	0-10
Originality	Does it avoid generic/cliche content?	0-10

Step 2: Choose Evaluation Method

A. Assertion-Based (Automated)

Define pass/fail conditions:

ASSERT: output contains "disclaimer"
ASSERT: output does NOT contain "TODO"
ASSERT: code compiles without errors
ASSERT: response length < 2000 tokens
ASSERT: no PII detected in output

Best for: Regression testing, CI/CD pipelines.

B. Reference-Based (Semi-Automated)

Compare output against a known-good reference:

Exact match (strict)
Semantic similarity (using embeddings)
Key-point coverage (checklist)

Best for: Consistent tasks with known expected outputs.

C. Rubric-Based (Human + AI)

Score each criterion 0-10 with justification:

Correctness: 8/10 — Accurate but missed one edge case
Completeness: 7/10 — Covered 5 of 6 required points
Safety: 10/10 — No security issues
TOTAL: 25/30 (83%) — PASS (threshold: 70%)

Best for: Complex, subjective outputs.

Step 3: Run Evaluation

Prepare 5-10 test cases covering:
- Happy path (normal usage)
- Edge cases (unusual inputs)
- Adversarial inputs (injection, confusion)
- Empty/minimal inputs
- Maximum complexity inputs
Run each test case through the agent
Apply chosen evaluation method
Record results with timestamps

Step 4: Set Thresholds

Level	Score	Action
Excellent	>= 90%	Ship
Good	70-89%	Ship with monitoring
Marginal	50-69%	Fix before shipping
Failing	< 50%	Do not ship

Step 5: Monitor Drift

Track these metrics over time:

Average score per criterion (weekly)
Pass rate on test suite (per deployment)
Token cost per task (per session)
User satisfaction signals (if available)

Drift signals:

Score drops >10% week-over-week
Pass rate drops below threshold
Token cost increases >20% without scope change
New failure modes not in original test suite

Output Format

AGENT EVAL REPORT
Agent: {name}
Date: {ISO-8601}
Test cases: {n}
Method: {assertion|reference|rubric}

Results:
  Pass: {n} ({%})
  Fail: {n} ({%})
  Average score: {x}/10

Per-criterion:
  Correctness:  {x}/10
  Completeness: {x}/10
  Safety:       {x}/10

Verdict: {PASS|MARGINAL|FAIL}
Recommendation: {ship|fix|block}

What This Skill Does NOT Do

Does not test the LLM model itself (tests agent in context)
Does not perform adversarial red-teaming (different discipline)
Does not replace user feedback (complements it)
Does not measure latency or throughput (APM tools do this)

ナビゲーション

Skillsとは？

リンク

agent-eval-framework