Quality Assurance Agents

Sources: Production evaluation patterns, inspired by anthropics/skills agent architecture

Covers: grader agent prompt, blind comparator agent prompt, spawning patterns, output schemas.

Does NOT cover: the evaluation workflow process itself (see references/evaluation-workflow.md).

Overview

Two reusable agent prompts for evaluating skill output quality:

Agent	Purpose	When to Use
Grader	Evaluate assertions against outputs	After every test run
Blind Comparator	Judge two outputs without bias	When comparing skill versions

Spawn these via delegation with the relevant prompt section as instructions.

Grader Agent

Role

The Grader reviews execution transcripts and output files, then determines whether each assertion passes or fails. Provides evidence for every judgment.

Two jobs: grade the outputs AND critique the assertions themselves. A passing grade on a weak assertion creates false confidence.

Delegation Template

TASK: Grade the outputs for eval "{eval_name}" against its assertions.

EXPECTED OUTCOME: grading.json with pass/fail for each assertion, evidence,
summary stats, and optionally eval improvement suggestions.

MUST DO:
- Read the transcript at {transcript_path}
- Examine all files in {outputs_dir}
- For each assertion, search for evidence in BOTH transcript and outputs
- PASS only when clear evidence exists AND reflects genuine task completion
- FAIL when evidence is missing, contradicts, or is merely surface-level
- Extract and verify implicit claims from the output
- Check {outputs_dir}/user_notes.md if it exists
- Critique assertions that are too easy or non-discriminating
- Write results to {outputs_dir}/../grading.json

MUST NOT DO:
- Give partial credit (pass or fail, no middle ground)
- Pass assertions based on surface compliance (right filename but wrong content)
- Skip reading the actual output files
- Assume transcript claims are true without verification
- Nitpick every assertion — only flag genuinely problematic ones

CONTEXT:
- Assertions to evaluate: {assertions_list}
- Transcript path: {transcript_path}
- Outputs directory: {outputs_dir}

Grading Criteria

PASS when:

Transcript or outputs clearly demonstrate the assertion is true
Specific evidence can be cited
Evidence reflects genuine substance, not surface compliance

FAIL when:

No evidence found
Evidence contradicts the assertion
Cannot be verified from available information
Evidence is superficial (technically satisfied but outcome is wrong)
Output meets assertion by coincidence, not by doing the work

When uncertain: Burden of proof is on the assertion to pass.

Claim Extraction

Beyond predefined assertions, extract implicit claims from outputs:

Claim Type	Example	How to Verify
Factual	"The form has 12 fields"	Count in output file
Process	"Used pypdf to fill the form"	Check transcript tool calls
Quality	"All fields filled correctly"	Inspect output vs input

Flag unverifiable claims. These may reveal issues assertions miss.

Eval Self-Critique

After grading, assess whether the assertions themselves are good:

Flag when:

An assertion passed but would also pass for a clearly wrong output
An important outcome has no assertion covering it
An assertion can't actually be verified from available outputs

Don't flag when:

Assertion is working as intended
Suggestion would be nitpicking without real value

Grader Output Schema

{
  "expectations": [
    {
      "text": "The output includes a summary table",
      "passed": true,
      "evidence": "Found 3-column table in output.md, lines 15-28"
    },
    {
      "text": "Date fields use ISO 8601 format",
      "passed": false,
      "evidence": "Dates use MM/DD/YYYY format (line 7: '03/15/2025')"
    }
  ],
  "summary": {
    "passed": 4,
    "failed": 1,
    "total": 5,
    "pass_rate": 0.80
  },
  "claims": [
    {
      "claim": "The report covers all 3 regions",
      "type": "factual",
      "verified": true,
      "evidence": "Sections found for North, South, West regions"
    }
  ],
  "eval_feedback": {
    "suggestions": [
      {
        "assertion": "Output includes summary table",
        "reason": "A table with wrong data would also pass. Consider checking column headers match expected schema."
      }
    ],
    "overall": "Assertions check presence but not correctness for 2 of 5 checks."
  }
}

Blind Comparator Agent

Role

Compare two outputs WITHOUT knowing which skill produced them. Prevents bias toward a particular approach. Judges purely on output quality and task completion.

When to Use

Comparing a new skill version against the previous version
Deciding between two alternative skill approaches
Validating that improvements actually improved quality

Not needed for every iteration. The grader + human review loop is usually sufficient. Use blind comparison when you want rigorous evidence that version B is better than version A.

Delegation Template

TASK: Compare two outputs for the task "{eval_prompt}" and determine
which is better. You do NOT know which approach produced which output.

EXPECTED OUTCOME: comparison.json with winner, rubric scores, reasoning,
and optionally assertion results.

MUST DO:
- Read output A at {output_a_path}
- Read output B at {output_b_path}
- Generate a task-specific rubric (content + structure dimensions)
- Score each output 1-5 on each criterion
- Determine a winner based on overall quality
- If assertions provided, check them for both outputs (secondary signal)
- Write results to {output_path}

MUST NOT DO:
- Try to guess which skill produced which output
- Favor outputs based on style preferences over correctness
- Declare a tie unless outputs are genuinely equivalent
- Use assertion scores as primary decision factor

CONTEXT:
- Task prompt: {eval_prompt}
- Output A: {output_a_path}
- Output B: {output_b_path}
- Assertions (optional): {assertions_list}

Rubric Generation

The comparator generates a task-specific rubric with two dimensions:

Content Rubric (what the output contains):

Criterion	1 (Poor)	3 (Acceptable)	5 (Excellent)
Correctness	Major errors	Minor errors	Fully correct
Completeness	Missing key elements	Mostly complete	All present
Accuracy	Significant issues	Minor inaccuracies	Accurate

Structure Rubric (how the output is organized):

Criterion	1 (Poor)	3 (Acceptable)	5 (Excellent)
Organization	Disorganized	Reasonable	Clear, logical
Formatting	Inconsistent	Mostly consistent	Professional
Usability	Difficult to use	Usable with effort	Easy to use

Adapt criteria to the task. PDF form → "Field alignment", "Data placement". Document → "Section structure", "Heading hierarchy". Data → "Schema correctness".

Decision Priority

Primary: Overall rubric score (content + structure)
Secondary: Assertion pass rates (if provided)
Tiebreaker: If truly equal, declare TIE (should be rare)

Comparator Output Schema

{
  "winner": "A",
  "reasoning": "Output A provides complete solution with proper formatting. Output B is missing the date field and has inconsistencies.",
  "rubric": {
    "A": {
      "content": {"correctness": 5, "completeness": 5, "accuracy": 4},
      "structure": {"organization": 4, "formatting": 5, "usability": 4},
      "content_score": 4.7,
      "structure_score": 4.3,
      "overall_score": 9.0
    },
    "B": {
      "content": {"correctness": 3, "completeness": 2, "accuracy": 3},
      "structure": {"organization": 3, "formatting": 2, "usability": 3},
      "content_score": 2.7,
      "structure_score": 2.7,
      "overall_score": 5.4
    }
  },
  "output_quality": {
    "A": {
      "score": 9,
      "strengths": ["Complete solution", "All fields present"],
      "weaknesses": ["Minor style inconsistency in header"]
    },
    "B": {
      "score": 5,
      "strengths": ["Readable output", "Correct basic structure"],
      "weaknesses": ["Missing date field", "Formatting issues"]
    }
  },
  "expectation_results": {
    "A": {"passed": 4, "total": 5, "pass_rate": 0.80},
    "B": {"passed": 3, "total": 5, "pass_rate": 0.60}
  }
}

Post-Comparison Analysis

After blind comparison identifies a winner, optionally analyze WHY it won. Read both skills and transcripts to extract actionable improvements.

Analysis Delegation Template

TASK: Analyze why the winning output was better and generate improvement
suggestions for the losing skill.

MUST DO:
- Read the comparison result at {comparison_path}
- Read both skills (winner at {winner_skill_path}, loser at {loser_skill_path})
- Read both transcripts if available
- Identify instruction-following differences
- Generate prioritized improvement suggestions
- Focus on changes that would have changed the outcome

MUST NOT DO:
- Suggest cosmetic changes that wouldn't affect quality
- Speculate without evidence from transcripts
- Recommend changes that would only help this specific test case

Analysis Output Schema

{
  "comparison_summary": {
    "winner": "A",
    "winner_skill": "path/to/winner",
    "loser_skill": "path/to/loser"
  },
  "winner_strengths": [
    "Clear step-by-step instructions for multi-page documents",
    "Included validation script that caught formatting errors"
  ],
  "loser_weaknesses": [
    "Vague instruction 'process appropriately' led to inconsistent behavior",
    "No validation step, errors went uncaught"
  ],
  "improvement_suggestions": [
    {
      "priority": "high",
      "category": "instructions",
      "suggestion": "Replace 'process appropriately' with explicit steps",
      "expected_impact": "Eliminates ambiguity causing inconsistent behavior"
    }
  ]
}

Suggestion Categories

Category	What to Change
`instructions`	Prose instructions in SKILL.md or references
`tools`	Scripts or utilities to add/modify
`examples`	Input/output examples to include
`error_handling`	Guidance for handling failures
`structure`	Reorganization of skill content
`references`	Additional reference docs to add

Priority Levels

high: Would likely change the outcome of comparison
medium: Improves quality but may not change win/loss
low: Nice to have, marginal improvement

ナビゲーション

Skillsとは？

リンク

Quality Assurance Agents

Quality Assurance Agents

Overview

Grader Agent

Role

Delegation Template

Grading Criteria

Claim Extraction

Eval Self-Critique

Grader Output Schema

Blind Comparator Agent

Role

When to Use

Delegation Template

Rubric Generation

Decision Priority

Comparator Output Schema

Post-Comparison Analysis

Analysis Delegation Template

Analysis Output Schema

Suggestion Categories

Priority Levels

関連スキル(🌐 Web開発)