name: agent:eval description: Agent Evaluation System - designs failure modes, metrics, eval test suites, SME labeling, and production data evaluation pipelines argument-hint: [spec-name]
Agent Evaluation System
Guides the user through building a comprehensive evaluation system for their AI agent. Applies patterns 10-17 from "Patterns for Building AI Agents" (Bhagwat & Gienow, 2025): failure mode taxonomy, business metrics, cross-referencing, iterating against evals, test suites, SME labeling, production datasets, and live evaluation.
When to use
Use this skill when the user needs to:
- Define what "good" looks like for an AI agent
- Create a failure mode taxonomy
- Set up business metrics for agent performance
- Build an evaluation test suite
- Design SME labeling workflows
- Plan production data evaluation pipelines
Instructions
Step 1: Understand the Agent
Use the AskUserQuestion tool to gather context:
- What does the agent do? (domain, tasks, outputs)
- Who are the end users?
- What are the consequences of wrong outputs? (low = inconvenience, high = financial/legal/safety)
- Is there an existing agent design? (check
.specs/<spec-name>/) - Do you have existing test data or production logs?
Read any existing spec documents before proceeding.
Step 2: List Failure Modes (Pattern 10)
Build a classification of failure reasons. LLM outputs are nondeterministic — you need to understand not just WHAT fails, but WHY.
Use AskUserQuestion to explore failure categories with the user. Start with these common categories and adapt to the domain:
| Category | Description | Example |
|---|---|---|
| Data Quality | Agent received wrong, incomplete, or ambiguous input | Missing fields, contradictory data |
| Reasoning Failure | Agent had correct data but drew wrong conclusions | Incorrect logic chain, hallucinated facts |
| Rule Misapplication | Agent misapplied domain-specific rules or policies | Wrong insurance code, incorrect legal precedent |
| Tool Failure | External tool/API call failed or returned unexpected results | Timeout, wrong API response format |
| Context Failure | Agent lost track of important context | Forgot earlier constraint, ignored user correction |
| Output Format | Correct answer but wrong format or structure | Missing required fields, wrong data types |
Ask the user to identify domain-specific failure modes.
Output:
## Failure Mode Taxonomy
| ID | Category | Failure Mode | Description | Severity |
|----|----------|-------------|-------------|----------|
| F1 | Reasoning | [Name] | [Description] | Critical / High / Medium / Low |
| F2 | Data Quality | [Name] | [Description] | Critical / High / Medium / Low |
| F3 | [Domain] | [Name] | [Description] | Critical / High / Medium / Low |
Step 3: List Critical Business Metrics (Pattern 11)
Define metrics that connect agent performance to business value. Use AskUserQuestion to identify metrics in three categories:
1. Accuracy metrics (baseline):
- False positive rate
- False negative rate
- Overall accuracy / F1 score
2. Domain-specific outcome metrics:
- What domain-specific outcomes matter? (e.g., missed critical terms in legal, dollar loss in finance, resolution time in support)
3. Human team metrics:
- How does the equivalent human team perform?
- What is the target agent performance vs. human baseline?
Ask the user to identify the north star metric — the single most important metric.
Output:
## Business Metrics
### North Star Metric
**[Metric name]:** [Description and why it matters most]
**Current baseline:** [Human performance or current agent performance]
**Target:** [Goal]
### Accuracy Metrics
| Metric | Current | Target | Measurement |
|--------|---------|--------|-------------|
| False positive rate | [X%] | [Y%] | [How measured] |
| False negative rate | [X%] | [Y%] | [How measured] |
| Overall accuracy | [X%] | [Y%] | [How measured] |
### Domain-Specific Metrics
| Metric | Current | Target | Business Impact |
|--------|---------|--------|----------------|
| [Metric 1] | [X] | [Y] | [Why it matters] |
| [Metric 2] | [X] | [Y] | [Why it matters] |
Step 4: Cross-Reference Failure Modes and Metrics (Pattern 12)
Map which failure modes drive which metrics. This turns metrics into actionable improvement work.
## Failure Mode → Metric Impact Matrix
| Failure Mode | North Star Impact | Other Metrics Affected | Priority |
|---|---|---|---|
| F1: [Name] | HIGH — directly causes [metric] regression | [Other metrics] | P0 |
| F2: [Name] | MEDIUM — contributes to [metric] | [Other metrics] | P1 |
| F3: [Name] | LOW — rare but severe | [Other metrics] | P2 |
Define the improvement cycle:
## Improvement Cycle
1. **SME Review** — Domain experts review agent outputs, classify failure modes
2. **PM Prioritization** — Cross-reference metrics + failure modes, set next target
- Current: [X%] → Next target: [Y%]
3. **Engineering** — Experiment with fixes using failure-mode-specific datasets
4. **Validation** — Test against past production data, decide go/no-go
Step 5: Design Eval Test Suite (Pattern 13-14)
Help the user build an evaluation test suite.
Use AskUserQuestion to determine data sources:
- Synthetic data — Use LLM to generate test cases (fastest to start)
- Internal user data — Real data from internal testing
- SME golden answers — Expert-created input/output pairs (highest quality)
- Production data — Real user interactions (most realistic, available later)
Test suite structure:
## Eval Test Suite
### Suite Metadata
- **Total test cases:** [N]
- **Data sources:** [Synthetic / Internal / SME / Production]
- **Evaluation method:** [LLM-as-judge / Exact match / Human review]
- **CI integration:** [Yes/No — run on every code change]
### Evaluation Criteria
| Criterion | Weight | Scoring | Description |
|-----------|--------|---------|-------------|
| Accuracy | 40% | Binary (pass/fail) | Factually correct output |
| Completeness | 25% | Binary | All required information present |
| Relevance | 20% | Binary | Focused on the user's actual question |
| Format | 15% | Binary | Correct structure and data types |
### Regression Policy
- **Merge blocker:** Any change that reduces overall accuracy below [X%]
- **Review required:** Any change that regresses accuracy by > [Y%]
- **Paired improvements:** If a regression in one area is necessary, pair with offsetting improvements elsewhere
### Test Case Template
| Field | Description |
|-------|-------------|
| `id` | Unique test case identifier |
| `input` | The user input / agent prompt |
| `expected_output` | The correct or ideal response |
| `failure_modes` | Which failure modes this tests (F1, F2, ...) |
| `metadata` | Source, date added, domain category |
Scoring recommendation: Use binary (pass/fail) or categorical (good/fair/poor) scoring. Avoid numerical scales (1-10) — LLMs are better at categorical than numerical judgment.
Step 6: SME Labeling Plan (Pattern 15)
Design how subject matter experts will validate agent outputs.
Use AskUserQuestion to understand:
- Who are the domain experts? (role, availability)
- What tools will they use for labeling? (custom UI, spreadsheet, observability tool)
- How many annotators per data point? (recommend 2+ for inter-rater reliability)
## SME Labeling Plan
### Annotators
| Role | Count | Domain | Availability |
|------|-------|--------|-------------|
| [Role 1] | [N] | [Domain area] | [Hours/week] |
### Labeling Schema
Each review includes:
1. **Overall grade:** Pass / Partial / Fail
2. **Category tags:** [List of failure mode IDs that apply]
3. **Subjective feedback:** Free-text explanation (optional)
### Labeling Workflow
1. Agent generates output → logged to observability tool
2. Automated flags trigger review (guardrail violations, CI failures, low-confidence outputs)
3. Random sampling of unflagged outputs ([X%] sample rate)
4. SME reviews full trace: user input → tool calls → reasoning → output
5. SME labels using schema above
6. Labels feed back into eval test suite
### Inter-Rater Reliability
- Metric: Cohen's Kappa / Fleiss' Kappa
- Target: > 0.7 (substantial agreement)
- Calibration: Weekly sync to align on edge cases
Step 7: Production Data Pipeline (Patterns 16-17)
Design how production data flows into the evaluation system.
## Production Data Pipeline
### Data Collection
- **Observability tool:** [Tool name — e.g., LangSmith, Braintrust, custom]
- **Logged fields:** Input, output, tool calls, latency, token usage, model version
- **Storage:** [Where datasets are stored — not JSONL files, use versioned store]
### Live Evaluation
- **Method:** LLM-as-judge with defined evaluation prompt
- **Scoring:** [Binary / Categorical] — strongly recommended over numerical
- **Sampling:** Evaluate [X%] of production responses
- **Frequency:** [Real-time / Hourly / Daily batch]
### Evaluation Prompt Template
You are evaluating an AI agent's response.
User input: {input} Agent output: {output} Expected behavior: {criteria}
Grade the response as PASS or FAIL. Explain your reasoning in one sentence.
### Dataset Versioning
- Version datasets when: new failure modes discovered, distribution shift detected
- Store: inputs, expected outputs, metadata (source, date, failure mode tags)
- Review cadence: [Weekly / Monthly] — check if synthetic data still matches production reality
### Feedback Loop
Production data → SME review → New test cases → Eval suite update → CI regression check
Step 8: Generate Eval Document
Compile all outputs into .specs/<spec-name>/agent-eval.md.
Step 9: Offer Next Steps
Use AskUserQuestion to offer:
- Create initial test cases — generate synthetic eval data based on the failure modes
- Proceed to security audit — run
agent:secure - Full review — run
agent:review
Arguments
$ARGUMENTS($0) - Optional spec name<spec-name>— reads existing agent design from.specs/<spec-name>/
Examples:
agent:eval customer-support— design eval system for the customer-support agentagent:eval— start fresh, will ask for details