name: eval-audit-review description: Audit SoulMap AI evals so datasets, assertions, source markers, and golden responses stay source-backed, failure-oriented, and hard to game.

Eval audit review

Use this skill when the task is to inspect or improve the trustworthiness of SoulMap's eval system rather than only adding one more case.

routine edits to evals/datasets/ as the main task, use eval-suite-maintainer
broad release consistency review, use release-readiness-review
Python-only cleanup with no eval question, use python-maintainer

Keep evals honest, source-backed, and useful against real failure modes instead of optimizing for easy green runs.

evals that pass because assertions are too loose
cases with no clear source backing in AGENTS.md, skills/, or templates/
wording checks that drift from runtime examples
evaluator logic that is brittle, fuzzy, or easy to satisfy accidentally
important failure modes that appear in code or docs but are not represented in datasets

Identify the failure mode or product contract the eval is supposed to protect.
Check whether the current dataset, harness, and source files all describe the same thing.
Tighten assertions only where the behavior is actually important.
Prefer a few sharp cases over many noisy ones.
Add or update source_markers when confidence needs to be explicit.
Run the matching eval commands, then the closest pytest contracts.

List the eval weaknesses first, especially loose assertions, stale source links, or blind spots.

Summarize the dataset, harness, or contract changes that improved audit quality.

State which eval and pytest commands were run.

The audited eval surface should be: