Quality Assurance Agents
Sources: Production evaluation patterns, inspired by anthropics/skills agent architecture
Covers: grader agent prompt, blind comparator agent prompt, spawning patterns, output schemas.
Does NOT cover: the evaluation workflow process itself (see
references/evaluation-workflow.md).
Overview
Two reusable agent prompts for evaluating skill output quality:
| Agent | Purpose | When to Use |
|---|---|---|
| Grader | Evaluate assertions against outputs | After every test run |
| Blind Comparator | Judge two outputs without bias | When comparing skill versions |
Spawn these via delegation with the relevant prompt section as instructions.
Grader Agent
Role
The Grader reviews execution transcripts and output files, then determines whether each assertion passes or fails. Provides evidence for every judgment.
Two jobs: grade the outputs AND critique the assertions themselves. A passing grade on a weak assertion creates false confidence.
Delegation Template
TASK: Grade the outputs for eval "{eval_name}" against its assertions.
EXPECTED OUTCOME: grading.json with pass/fail for each assertion, evidence,
summary stats, and optionally eval improvement suggestions.
MUST DO:
- Read the transcript at {transcript_path}
- Examine all files in {outputs_dir}
- For each assertion, search for evidence in BOTH transcript and outputs
- PASS only when clear evidence exists AND reflects genuine task completion
- FAIL when evidence is missing, contradicts, or is merely surface-level
- Extract and verify implicit claims from the output
- Check {outputs_dir}/user_notes.md if it exists
- Critique assertions that are too easy or non-discriminating
- Write results to {outputs_dir}/../grading.json
MUST NOT DO:
- Give partial credit (pass or fail, no middle ground)
- Pass assertions based on surface compliance (right filename but wrong content)
- Skip reading the actual output files
- Assume transcript claims are true without verification
- Nitpick every assertion — only flag genuinely problematic ones
CONTEXT:
- Assertions to evaluate: {assertions_list}
- Transcript path: {transcript_path}
- Outputs directory: {outputs_dir}
Grading Criteria
PASS when:
- Transcript or outputs clearly demonstrate the assertion is true
- Specific evidence can be cited
- Evidence reflects genuine substance, not surface compliance
FAIL when:
- No evidence found
- Evidence contradicts the assertion
- Cannot be verified from available information
- Evidence is superficial (technically satisfied but outcome is wrong)
- Output meets assertion by coincidence, not by doing the work
When uncertain: Burden of proof is on the assertion to pass.
Claim Extraction
Beyond predefined assertions, extract implicit claims from outputs:
| Claim Type | Example | How to Verify |
|---|---|---|
| Factual | "The form has 12 fields" | Count in output file |
| Process | "Used pypdf to fill the form" | Check transcript tool calls |
| Quality | "All fields filled correctly" | Inspect output vs input |
Flag unverifiable claims. These may reveal issues assertions miss.
Eval Self-Critique
After grading, assess whether the assertions themselves are good:
Flag when:
- An assertion passed but would also pass for a clearly wrong output
- An important outcome has no assertion covering it
- An assertion can't actually be verified from available outputs
Don't flag when:
- Assertion is working as intended
- Suggestion would be nitpicking without real value
Grader Output Schema
{
"expectations": [
{
"text": "The output includes a summary table",
"passed": true,
"evidence": "Found 3-column table in output.md, lines 15-28"
},
{
"text": "Date fields use ISO 8601 format",
"passed": false,
"evidence": "Dates use MM/DD/YYYY format (line 7: '03/15/2025')"
}
],
"summary": {
"passed": 4,
"failed": 1,
"total": 5,
"pass_rate": 0.80
},
"claims": [
{
"claim": "The report covers all 3 regions",
"type": "factual",
"verified": true,
"evidence": "Sections found for North, South, West regions"
}
],
"eval_feedback": {
"suggestions": [
{
"assertion": "Output includes summary table",
"reason": "A table with wrong data would also pass. Consider checking column headers match expected schema."
}
],
"overall": "Assertions check presence but not correctness for 2 of 5 checks."
}
}
Blind Comparator Agent
Role
Compare two outputs WITHOUT knowing which skill produced them. Prevents bias toward a particular approach. Judges purely on output quality and task completion.
When to Use
- Comparing a new skill version against the previous version
- Deciding between two alternative skill approaches
- Validating that improvements actually improved quality
Not needed for every iteration. The grader + human review loop is usually sufficient. Use blind comparison when you want rigorous evidence that version B is better than version A.
Delegation Template
TASK: Compare two outputs for the task "{eval_prompt}" and determine
which is better. You do NOT know which approach produced which output.
EXPECTED OUTCOME: comparison.json with winner, rubric scores, reasoning,
and optionally assertion results.
MUST DO:
- Read output A at {output_a_path}
- Read output B at {output_b_path}
- Generate a task-specific rubric (content + structure dimensions)
- Score each output 1-5 on each criterion
- Determine a winner based on overall quality
- If assertions provided, check them for both outputs (secondary signal)
- Write results to {output_path}
MUST NOT DO:
- Try to guess which skill produced which output
- Favor outputs based on style preferences over correctness
- Declare a tie unless outputs are genuinely equivalent
- Use assertion scores as primary decision factor
CONTEXT:
- Task prompt: {eval_prompt}
- Output A: {output_a_path}
- Output B: {output_b_path}
- Assertions (optional): {assertions_list}
Rubric Generation
The comparator generates a task-specific rubric with two dimensions:
Content Rubric (what the output contains):
| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
|---|---|---|---|
| Correctness | Major errors | Minor errors | Fully correct |
| Completeness | Missing key elements | Mostly complete | All present |
| Accuracy | Significant issues | Minor inaccuracies | Accurate |
Structure Rubric (how the output is organized):
| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
|---|---|---|---|
| Organization | Disorganized | Reasonable | Clear, logical |
| Formatting | Inconsistent | Mostly consistent | Professional |
| Usability | Difficult to use | Usable with effort | Easy to use |
Adapt criteria to the task. PDF form → "Field alignment", "Data placement". Document → "Section structure", "Heading hierarchy". Data → "Schema correctness".
Decision Priority
- Primary: Overall rubric score (content + structure)
- Secondary: Assertion pass rates (if provided)
- Tiebreaker: If truly equal, declare TIE (should be rare)
Comparator Output Schema
{
"winner": "A",
"reasoning": "Output A provides complete solution with proper formatting. Output B is missing the date field and has inconsistencies.",
"rubric": {
"A": {
"content": {"correctness": 5, "completeness": 5, "accuracy": 4},
"structure": {"organization": 4, "formatting": 5, "usability": 4},
"content_score": 4.7,
"structure_score": 4.3,
"overall_score": 9.0
},
"B": {
"content": {"correctness": 3, "completeness": 2, "accuracy": 3},
"structure": {"organization": 3, "formatting": 2, "usability": 3},
"content_score": 2.7,
"structure_score": 2.7,
"overall_score": 5.4
}
},
"output_quality": {
"A": {
"score": 9,
"strengths": ["Complete solution", "All fields present"],
"weaknesses": ["Minor style inconsistency in header"]
},
"B": {
"score": 5,
"strengths": ["Readable output", "Correct basic structure"],
"weaknesses": ["Missing date field", "Formatting issues"]
}
},
"expectation_results": {
"A": {"passed": 4, "total": 5, "pass_rate": 0.80},
"B": {"passed": 3, "total": 5, "pass_rate": 0.60}
}
}
Post-Comparison Analysis
After blind comparison identifies a winner, optionally analyze WHY it won. Read both skills and transcripts to extract actionable improvements.
Analysis Delegation Template
TASK: Analyze why the winning output was better and generate improvement
suggestions for the losing skill.
MUST DO:
- Read the comparison result at {comparison_path}
- Read both skills (winner at {winner_skill_path}, loser at {loser_skill_path})
- Read both transcripts if available
- Identify instruction-following differences
- Generate prioritized improvement suggestions
- Focus on changes that would have changed the outcome
MUST NOT DO:
- Suggest cosmetic changes that wouldn't affect quality
- Speculate without evidence from transcripts
- Recommend changes that would only help this specific test case
Analysis Output Schema
{
"comparison_summary": {
"winner": "A",
"winner_skill": "path/to/winner",
"loser_skill": "path/to/loser"
},
"winner_strengths": [
"Clear step-by-step instructions for multi-page documents",
"Included validation script that caught formatting errors"
],
"loser_weaknesses": [
"Vague instruction 'process appropriately' led to inconsistent behavior",
"No validation step, errors went uncaught"
],
"improvement_suggestions": [
{
"priority": "high",
"category": "instructions",
"suggestion": "Replace 'process appropriately' with explicit steps",
"expected_impact": "Eliminates ambiguity causing inconsistent behavior"
}
]
}
Suggestion Categories
| Category | What to Change |
|---|---|
instructions | Prose instructions in SKILL.md or references |
tools | Scripts or utilities to add/modify |
examples | Input/output examples to include |
error_handling | Guidance for handling failures |
structure | Reorganization of skill content |
references | Additional reference docs to add |
Priority Levels
- high: Would likely change the outcome of comparison
- medium: Improves quality but may not change win/loss
- low: Nice to have, marginal improvement