name: enhance-evals description: Improve the eval system quality -- add better cases, fix broken cases, improve the LLM judge, add new metrics, or refine pass/fail criteria based on eval run data. Use when the user wants to make the eval system itself better. disable-model-invocation: true argument-hint: [area to improve] allowed-tools: Bash Read Write Edit Glob Grep Agent effort: high

Enhance the Eval System

You are improving the boredgame.lol recommendation engine evaluation system. The user wants the eval system itself to be better -- more accurate tests, better failure detection, smarter judging, more useful output.

Full Context You MUST Know

System Architecture

evals/runner.ts (626 lines) - Core runner: parallel execution, LLM judge, constraint checker, regression tracking, persistent logging
evals/types.ts (218 lines) - Type definitions for cases, results, runs, metrics
evals/metrics.ts (109 lines) - NDCG@K, Precision@K, MRR, Hit Rate calculations
evals/llm-judge.ts (105 lines) - GPT-4o-mini rates overall result quality 0-10
evals/constraint-checker.ts (110 lines) - Detects player count, time, complexity, game type violations
evals/compare-runs.ts (139 lines) - Side-by-side run comparison with regression detection
evals/summary.ts (109 lines) - Run summary viewer with history mode
evals/analyze-failures.ts (254 lines) - Failure pattern categorization + most-missing-game tracking
evals/generate-cases.ts - 130 hand-curated base cases
evals/generate-expanded-cases.ts - +177 systematic variations
evals/generate-massive.ts - LLM-generated thousands (GPT-4o-mini, batched)
evals/cases.json - Currently ~3,028 cases across 16 categories

Known Weaknesses in the Eval System

Pass/fail criteria may be too strict: A case fails if ANY ideal game (relevance >= 2) is missing. But the results might be excellent games the eval case didn't anticipate. The LLM judge catches this (7.14/10 average vs 68.4% pass rate means many "failing" cases have good results).
idealGames are sometimes wrong: Some generated cases reference games that don't exist in the DB, or use slightly wrong names (e.g., "Castles of Burgundy" vs "The Castles of Burgundy").
LLM judge uses 0-10 scale: Research (arxiv 2411.15594) shows 0-2 scales with per-dimension scoring are more reliable. Current judge gives a single holistic score.
No serendipity metric: Research shows users want accuracy + novelty. We don't measure whether recommendations include unexpected-but-delightful games.
No familiarity balance metric: Optimal is 20-30% recognizable games, 70-80% discovery. We don't track this.
Catalog coverage is only 0.5%: Only ~400-500 unique games appear across all recommendations. Could indicate test cases are too similar or engine has narrow candidate fetching.
Generated cases may have quality issues: LLM-generated cases (GPT-4o-mini at temp 0.9) may include unrealistic queries, wrong game names, or contradictory constraints.
No confidence intervals: We report point estimates without standard errors. With 3,000+ cases we should use statistical significance testing.

Research-Backed Improvements to Consider

From docs/research/recommendation-eval-methodology.md and evals/RECOMMENDATIONS.md:

Pairwise LLM comparison instead of absolute scoring (more reliable)
Per-dimension scoring: Rate mechanic match, theme match, constraint satisfaction separately
Chain-of-thought reasoning in LLM judge (explain BEFORE scoring)
Trust buster detection: Flag obviously-wrong individual results (Chess for party game query)
Constraint violation breakdown by type (time vs player count vs complexity)
Serendipity@K: relevant AND dissimilar to obvious matches
Familiarity-discovery ratio: % of results that are well-known vs obscure

The 16 Categories and Their Current Performance

Category	Cases	Pass Rate	Notes
mechanic-focused	530	~32%	Weakest. BGG mechanic alias gap is the root cause.
multi-constraint	384	~36%	Combined constraints are hard to satisfy.
theme-focused	356	~86%	Strong. Theme matching works well.
video-game	262	~100%	Perfect. No cross-contamination.
similar-to	212	~83%	Good. "Like X" queries work.
mood-vibe	189	~29%	Weak. Missing Patchwork, Jaipur for chill.
player-count	177	~60%	Moderate. Some constraint violations.
time-constraint	164	~50%	Moderate. Time violations persist.
free-text-intent	159	~73%	Good. Natural language decent.
edge-case	153	~100%	Perfect. Handles garbage gracefully.
negative-preference	123	~73%	Good. Respects exclusions.
designer-search	116	~42%	Weak. Non-designer games mixed in.
complexity	112	~44%	Moderate. Misses gateway games.
real-user-feedback	78	~33%	Weak. BGG user issues persist.
regression	9	~89%	Good. Past bugs mostly fixed.
party-game	4	~50%	Too few cases.

What To Do

Based on what the user asks (or $ARGUMENTS), pick the right enhancement:

If they want to fix broken/inaccurate test cases:

Read evals/cases.json and look for cases with wrong game names, unrealistic queries, or contradictory constraints
Cross-reference idealGames against the actual database (query Supabase or use the validate script from scripts/validate-eval-cases.ts)
Fix the cases in the appropriate generator (generate-cases.ts, generate-expanded-cases.ts, or generate-massive.ts) and regenerate

If they want better metrics:

Read evals/metrics.ts and evals/types.ts
Add new metric calculations (serendipity, familiarity balance, catalog coverage per run)
Update computeCaseMetrics() and computeAggregateMetrics() in the runner
Update the report formatter to display new metrics

If they want a better LLM judge:

Read evals/llm-judge.ts
Implement per-dimension scoring (mechanic match, theme match, constraint satisfaction as separate 0-2 scores)
Add chain-of-thought reasoning requirement
Consider pairwise comparison mode for A/B testing system versions

If they want more/better test cases:

Identify which categories are underrepresented (party-game has only 4 cases!)
Add hand-curated cases in generate-cases.ts for the weakest categories
Run npm run eval:generate-massive to fill gaps with LLM-generated cases
Validate new cases against the database

If they want better reporting/visualization:

Read evals/summary.ts, evals/compare-runs.ts, evals/analyze-failures.ts
Add new views (e.g., trend over time, per-game analysis, category drill-down)
Consider adding an HTML report generator for richer visualization

Important Rules

READ the relevant files thoroughly before making changes
ALWAYS regenerate cases.json after modifying generators (run the appropriate generate script)
ALWAYS run a quick eval after changes to verify nothing broke: source .env.local && npx tsx evals/runner.ts --quick --no-judge
Document what you changed in evals/EVAL-WORKLOG.md
Do NOT change engine code -- this skill is about the eval system only

ナビゲーション

Skillsとは？

リンク

enhance-evals