name: waza-runner description: | Run evaluations on Agent Skills to measure their effectiveness. USE FOR: "run skill evals", "evaluate my skill", "test skill quality", "check skill triggers", "skill compliance check", "measure skill performance", "run evals on [skill-name]", "grade skill execution". DO NOT USE FOR: writing skills (use skill-authoring), improving frontmatter (use sensei), or general testing unrelated to skills. metadata: author: spboyer version: "1.0"

Skill Eval Runner

Evaluate Agent Skills like you evaluate AI Agents

This skill runs evaluations on other skills to measure their effectiveness using the same patterns that power AI agent evaluations.

When to Use

Running quality evaluations on a skill
Testing if a skill triggers on correct prompts
Measuring skill behavior quality
Generating eval reports for CI/CD

Commands

Run Evals

Run evals on <skill-name>

Initialize Eval Suite

Create evals for <skill-name>

Generate Report

Generate eval report for <skill-name>

Workflow

Check for Eval Suite: Look for eval.yaml in the skill directory
Load Tasks: Parse task definitions from tasks/*.yaml
Execute: Run each task through the configured graders
Report: Output results in JSON or Markdown format

Metrics Measured

Metric	Description	Default Threshold
Task Completion	Did the skill accomplish the goal?	80%
Trigger Accuracy	Was skill invoked on correct prompts?	90%
Behavior Quality	Tool calls, efficiency, reasoning	70%

Grader Types

Code Graders: Deterministic assertions, regex matching
LLM Graders: Model-as-judge with configurable rubrics
Human Graders: Manual review workflow

Example Usage

Running Evals

# From CLI
waza run ./my-skill/eval.yaml

# Output to file
waza run ./my-skill/eval.yaml -o results.json

Interpreting Results

{
  "summary": {
    "pass_rate": 0.85,
    "composite_score": 0.82
  },
  "metrics": {
    "task_completion": { "score": 0.9, "passed": true },
    "trigger_accuracy": { "score": 0.95, "passed": true }
  }
}

References

Eval Specification - Full eval.yaml schema
Writing Tasks - Task definition guide
Grader Reference - Available graders

ナビゲーション

Skillsとは？

リンク

waza-runner

Skill Eval Runner

When to Use

Commands

Run Evals

Initialize Eval Suite

Generate Report

Workflow

Metrics Measured

Grader Types

Example Usage

Running Evals

Interpreting Results

References

関連スキル(🔧 開発ツール)