name: eval-audit-review description: Audit SoulMap AI evals so datasets, assertions, source markers, and golden responses stay source-backed, failure-oriented, and hard to game.
Eval audit review
Use this skill when the task is to inspect or improve the trustworthiness of SoulMap's eval system rather than only adding one more case.
Do not use this skill for
- routine edits to
evals/datasets/as the main task, useeval-suite-maintainer - broad release consistency review, use
release-readiness-review - Python-only cleanup with no eval question, use
python-maintainer
Mission
Keep evals honest, source-backed, and useful against real failure modes instead of optimizing for easy green runs.
Sources to check first
evals/README.mdevals/datasets/tests/contract/tests/eval_regression/src/soulmap/devtools/evals/- the source Markdown or Python files each eval claims to protect
What to look for
- evals that pass because assertions are too loose
- cases with no clear source backing in
AGENTS.md,skills/, ortemplates/ - wording checks that drift from runtime examples
- evaluator logic that is brittle, fuzzy, or easy to satisfy accidentally
- important failure modes that appear in code or docs but are not represented in datasets
Workflow
- Identify the failure mode or product contract the eval is supposed to protect.
- Check whether the current dataset, harness, and source files all describe the same thing.
- Tighten assertions only where the behavior is actually important.
- Prefer a few sharp cases over many noisy ones.
- Add or update
source_markerswhen confidence needs to be explicit. - Run the matching eval commands, then the closest pytest contracts.
Expected output
Findings
List the eval weaknesses first, especially loose assertions, stale source links, or blind spots.
Fixes
Summarize the dataset, harness, or contract changes that improved audit quality.
Validation
State which eval and pytest commands were run.
Definition of done
The audited eval surface should be:
- harder to game accidentally
- clearly tied back to real source files or runtime behavior
- focused on meaningful failure modes
- validated with the exact commands maintainers actually use