name: enhance-evals description: Improve the eval system quality -- add better cases, fix broken cases, improve the LLM judge, add new metrics, or refine pass/fail criteria based on eval run data. Use when the user wants to make the eval system itself better. disable-model-invocation: true argument-hint: [area to improve] allowed-tools: Bash Read Write Edit Glob Grep Agent effort: high
Enhance the Eval System
You are improving the boredgame.lol recommendation engine evaluation system. The user wants the eval system itself to be better -- more accurate tests, better failure detection, smarter judging, more useful output.
Full Context You MUST Know
System Architecture
evals/runner.ts(626 lines) - Core runner: parallel execution, LLM judge, constraint checker, regression tracking, persistent loggingevals/types.ts(218 lines) - Type definitions for cases, results, runs, metricsevals/metrics.ts(109 lines) - NDCG@K, Precision@K, MRR, Hit Rate calculationsevals/llm-judge.ts(105 lines) - GPT-4o-mini rates overall result quality 0-10evals/constraint-checker.ts(110 lines) - Detects player count, time, complexity, game type violationsevals/compare-runs.ts(139 lines) - Side-by-side run comparison with regression detectionevals/summary.ts(109 lines) - Run summary viewer with history modeevals/analyze-failures.ts(254 lines) - Failure pattern categorization + most-missing-game trackingevals/generate-cases.ts- 130 hand-curated base casesevals/generate-expanded-cases.ts- +177 systematic variationsevals/generate-massive.ts- LLM-generated thousands (GPT-4o-mini, batched)evals/cases.json- Currently ~3,028 cases across 16 categories
Known Weaknesses in the Eval System
-
Pass/fail criteria may be too strict: A case fails if ANY ideal game (relevance >= 2) is missing. But the results might be excellent games the eval case didn't anticipate. The LLM judge catches this (7.14/10 average vs 68.4% pass rate means many "failing" cases have good results).
-
idealGames are sometimes wrong: Some generated cases reference games that don't exist in the DB, or use slightly wrong names (e.g., "Castles of Burgundy" vs "The Castles of Burgundy").
-
LLM judge uses 0-10 scale: Research (arxiv 2411.15594) shows 0-2 scales with per-dimension scoring are more reliable. Current judge gives a single holistic score.
-
No serendipity metric: Research shows users want accuracy + novelty. We don't measure whether recommendations include unexpected-but-delightful games.
-
No familiarity balance metric: Optimal is 20-30% recognizable games, 70-80% discovery. We don't track this.
-
Catalog coverage is only 0.5%: Only ~400-500 unique games appear across all recommendations. Could indicate test cases are too similar or engine has narrow candidate fetching.
-
Generated cases may have quality issues: LLM-generated cases (GPT-4o-mini at temp 0.9) may include unrealistic queries, wrong game names, or contradictory constraints.
-
No confidence intervals: We report point estimates without standard errors. With 3,000+ cases we should use statistical significance testing.
Research-Backed Improvements to Consider
From docs/research/recommendation-eval-methodology.md and evals/RECOMMENDATIONS.md:
- Pairwise LLM comparison instead of absolute scoring (more reliable)
- Per-dimension scoring: Rate mechanic match, theme match, constraint satisfaction separately
- Chain-of-thought reasoning in LLM judge (explain BEFORE scoring)
- Trust buster detection: Flag obviously-wrong individual results (Chess for party game query)
- Constraint violation breakdown by type (time vs player count vs complexity)
- Serendipity@K: relevant AND dissimilar to obvious matches
- Familiarity-discovery ratio: % of results that are well-known vs obscure
The 16 Categories and Their Current Performance
| Category | Cases | Pass Rate | Notes |
|---|---|---|---|
| mechanic-focused | 530 | ~32% | Weakest. BGG mechanic alias gap is the root cause. |
| multi-constraint | 384 | ~36% | Combined constraints are hard to satisfy. |
| theme-focused | 356 | ~86% | Strong. Theme matching works well. |
| video-game | 262 | ~100% | Perfect. No cross-contamination. |
| similar-to | 212 | ~83% | Good. "Like X" queries work. |
| mood-vibe | 189 | ~29% | Weak. Missing Patchwork, Jaipur for chill. |
| player-count | 177 | ~60% | Moderate. Some constraint violations. |
| time-constraint | 164 | ~50% | Moderate. Time violations persist. |
| free-text-intent | 159 | ~73% | Good. Natural language decent. |
| edge-case | 153 | ~100% | Perfect. Handles garbage gracefully. |
| negative-preference | 123 | ~73% | Good. Respects exclusions. |
| designer-search | 116 | ~42% | Weak. Non-designer games mixed in. |
| complexity | 112 | ~44% | Moderate. Misses gateway games. |
| real-user-feedback | 78 | ~33% | Weak. BGG user issues persist. |
| regression | 9 | ~89% | Good. Past bugs mostly fixed. |
| party-game | 4 | ~50% | Too few cases. |
What To Do
Based on what the user asks (or $ARGUMENTS), pick the right enhancement:
If they want to fix broken/inaccurate test cases:
- Read
evals/cases.jsonand look for cases with wrong game names, unrealistic queries, or contradictory constraints - Cross-reference idealGames against the actual database (query Supabase or use the validate script from
scripts/validate-eval-cases.ts) - Fix the cases in the appropriate generator (
generate-cases.ts,generate-expanded-cases.ts, orgenerate-massive.ts) and regenerate
If they want better metrics:
- Read
evals/metrics.tsandevals/types.ts - Add new metric calculations (serendipity, familiarity balance, catalog coverage per run)
- Update
computeCaseMetrics()andcomputeAggregateMetrics()in the runner - Update the report formatter to display new metrics
If they want a better LLM judge:
- Read
evals/llm-judge.ts - Implement per-dimension scoring (mechanic match, theme match, constraint satisfaction as separate 0-2 scores)
- Add chain-of-thought reasoning requirement
- Consider pairwise comparison mode for A/B testing system versions
If they want more/better test cases:
- Identify which categories are underrepresented (party-game has only 4 cases!)
- Add hand-curated cases in
generate-cases.tsfor the weakest categories - Run
npm run eval:generate-massiveto fill gaps with LLM-generated cases - Validate new cases against the database
If they want better reporting/visualization:
- Read
evals/summary.ts,evals/compare-runs.ts,evals/analyze-failures.ts - Add new views (e.g., trend over time, per-game analysis, category drill-down)
- Consider adding an HTML report generator for richer visualization
Important Rules
- READ the relevant files thoroughly before making changes
- ALWAYS regenerate cases.json after modifying generators (run the appropriate generate script)
- ALWAYS run a quick eval after changes to verify nothing broke:
source .env.local && npx tsx evals/runner.ts --quick --no-judge - Document what you changed in
evals/EVAL-WORKLOG.md - Do NOT change engine code -- this skill is about the eval system only