name: run description: Launch and manage CodeScaleBench benchmark runs with paired-run guardrails, quick reruns, and execution orchestration.
Skill: Run Benchmarks
Scope
Use this skill when the user asks to:
- Run benchmark suites, rerun failures, or launch gap-fill batches
- Manage multi-account parallel execution
- Execute paired baseline+MCP runs with curation guardrails
- Perform quick reruns of specific tasks or suites
Approval Gate (Required Before Running)
Before executing any benchmark run, confirm with the user:
- Model — which model? (e.g.,
anthropic/claude-haiku-4-5-20251001for test runs) - Suite / selection file — which benchmark suite or
--selection-file? - Config — paired (default),
--baseline-only, or--full-only? Which--full-config? - Parallel slots — how many? (default: auto-detect; use 8+ for multi-account)
- Category —
staging(default) orofficial?
Do NOT launch a run until the user has confirmed these five parameters.
Canonical Commands
- Per-suite default:
./configs/harnesses/<suite>_2config.sh - Unified selected-task runner:
./configs/harnesses/run_selected_tasks.sh - Config registry:
configs/eval_matrix.json - Quick rerun: use
--rerunflag on base command
Run Policy (Mandatory)
- Default execution is paired by task:
baseline+sourcegraph_full - Single-lane runs are gap-fill only:
--baseline-onlyrequires valid existingsourcegraph_fullcounterpart runs--full-onlyrequires valid existingbaselinecounterpart runs
- Emergency bypass only:
ALLOW_UNPAIRED_SINGLE_CONFIG=true - Account readiness: Always run
python3 scripts/infra/account_health.py statusbefore launching
Standard Launch Patterns
# Paired per-suite run
./configs/harnesses/pytorch_2config.sh --parallel 4
# Paired selected-task run
./configs/harnesses/run_selected_tasks.sh --benchmark csb_sdlc_pytorch
# Gap-fill baseline only (guarded)
./configs/harnesses/run_selected_tasks.sh --benchmark csb_sdlc_pytorch --baseline-only
# Quick rerun of failed tasks
./configs/harnesses/run_selected_tasks.sh --benchmark csb_sdlc_pytorch --rerun failed
Infrastructure
- Default environment: Daytona (see
docs/DAYTONA.md) - Parallelism: Auto-detected from account count and rate limits; override with
--parallel N - Orchestration:
scripts/running/control_plane.pymanages multi-account scheduling - Monitoring:
scripts/running/monitor_and_queue.shwatches active runs
Related Skills
/status— monitor active runs, check completion/audit— post-run validation and integrity checks/evaluate— extract and score results