name: experiment-loop description: "Autonomous experiment loop: hypothesize > modify > test > evaluate > keep/discard > repeat. Run N experiments automatically with measurable metrics. Works for performance optimization, A/B testing, prompt engineering, and any measurable improvement task."
Experiment Loop
Autonomous, iterative improvement inspired by Karpathy's autoresearch methodology. Define a metric, set a target, and let the loop run until the target is met or the iteration limit is reached.
The 5-Step Loop
1. HYPOTHESIZE -> Form a specific, falsifiable improvement hypothesis
2. MODIFY -> Apply the minimal code/config/prompt change
3. TEST -> Run the measurement suite (benchmarks, tests, evals)
4. EVALUATE -> Compare result against baseline and previous best
5. DECIDE -> KEEP if better, DISCARD (git stash pop --index) if worse
|
Repeat until target met OR max_iterations reached
Each iteration is atomic: one hypothesis, one change, one measurement, one decision.
Experiment Definition
Define an experiment in your task or in thoughts/EXPERIMENTS.md:
experiment:
name: "reduce-api-latency"
metric: "p95 response time (ms)"
baseline: 340
target: 200
direction: minimize # minimize | maximize
max_iterations: 10 # hard cap, never exceed
measurement_cmd: "npm run bench:api"
measurement_key: "p95" # JSON key from bench output
scope: "src/api/" # files the loop is allowed to touch
Key Fields
| Field | Description |
|---|---|
metric | Human-readable name of what you are measuring |
baseline | Measured value before any changes (run this first) |
target | Success condition -- loop exits when this is met |
direction | minimize for latency/size, maximize for coverage/score |
max_iterations | Safety cap, default 10, absolute maximum 10 |
measurement_cmd | Shell command that produces JSON with the metric value |
scope | Directories/files the loop is allowed to modify |
Safety Protocol
Before every experiment iteration:
# Save current state
git stash push -u -m "experiment-loop: iteration N baseline"
# Run experiment
# ... apply hypothesis change ...
# ... run measurement ...
# Decision
if result is better:
git stash drop # keep changes, discard stash
else:
git stash pop --index # restore exactly: staged + unstaged
Never skip the stash. Never accumulate multiple iterations without a decision checkpoint. If the measurement command fails or times out, treat it as DISCARD.
Agent Integration
The experiment loop coordinates three vibecosystem agents:
| Phase | Agent | Role |
|---|---|---|
| Hypothesize | profiler | Identify bottlenecks, suggest what to change |
| Modify | spark | Apply the focused code change |
| Test + Evaluate | verifier / tdd-guide | Run benchmarks, tests, evals and parse results |
Spawn profiler once at the start to get the initial hypothesis queue. Then run spark + verifier in tight loops per iteration.
Example Experiments
Bundle Size Reduction
experiment:
name: "optimize-bundle-size"
metric: "gzipped bundle size (KB)"
baseline: 420
target: 300
direction: minimize
max_iterations: 10
measurement_cmd: "npm run build && node scripts/measure-bundle.js"
measurement_key: "gzipped_kb"
scope: "src/"
Hypothesis queue to try in order:
- Add tree-shaking for unused lodash imports (use named imports)
- Replace
momentwithdate-fns(smaller footprint) - Move large dependencies to dynamic
import()at route boundaries - Enable
usedExports: truein webpack/rollup config - Replace
axioswith nativefetchwrapper
API Latency
experiment:
name: "reduce-api-latency"
metric: "p95 response time (ms)"
baseline: 340
target: 200
direction: minimize
max_iterations: 8
measurement_cmd: "npm run bench:api"
measurement_key: "p95"
scope: "src/api/"
Hypothesis queue:
- Add Redis cache for repeated DB reads (TTL 60s)
- Replace N+1 queries with single JOIN query
- Add connection pool sizing (
max: 20) - Move synchronous validation to async parallel (
Promise.all) - Add response compression (gzip middleware)
Test Coverage
experiment:
name: "improve-test-coverage"
metric: "line coverage (%)"
baseline: 64
target: 80
direction: maximize
max_iterations: 10
measurement_cmd: "npm test -- --coverage --json > coverage.json"
measurement_key: "coverageMap.total.lines.pct"
scope: "src/"
Prompt Engineering (LLM Eval)
experiment:
name: "improve-extraction-accuracy"
metric: "extraction F1 score"
baseline: 0.71
target: 0.85
direction: maximize
max_iterations: 10
measurement_cmd: "python eval/run_evals.py --output eval/results.json"
measurement_key: "f1"
scope: "prompts/"
Results Log Format
Append each iteration result to thoughts/EXPERIMENTS.md:
## Experiment: reduce-api-latency
Started: 2026-04-07T10:00:00Z
Baseline: 340ms | Target: 200ms | Direction: minimize
### Iteration 1
- Hypothesis: Add Redis cache for repeated DB reads
- Change: `src/api/users.ts` lines 45-67 -- wrap DB call with cache layer
- Result: 280ms (improvement: -60ms, -17.6%)
- Decision: KEEP
- Cumulative best: 280ms
### Iteration 2
- Hypothesis: Replace N+1 queries with JOIN
- Change: `src/api/users.ts` lines 89-102 -- rewrite fetchWithPosts()
- Result: 210ms (improvement: -70ms, -25%)
- Decision: KEEP
- Cumulative best: 210ms
### Iteration 3
- Hypothesis: Add connection pool sizing max:20
- Change: `src/db/pool.ts` line 12 -- max: 10 -> 20
- Result: 215ms (regression: +5ms)
- Decision: DISCARD (restored via git stash pop)
- Cumulative best: 210ms
### Final Result
- Target: 200ms | Achieved: 210ms | Status: NEAR_MISS (within 5%)
- Iterations: 3 of 10 used
- Total improvement: -38% from baseline
Iteration Limits and Exit Conditions
| Condition | Action |
|---|---|
| Target met | EXIT -- log SUCCESS, keep all accumulated changes |
| max_iterations reached | EXIT -- log PARTIAL, keep best achieved state |
| 3 consecutive DISCARDs | PAUSE -- re-run profiler for new hypothesis queue |
| Measurement command fails | DISCARD current iteration, continue loop |
| Git stash fails | STOP -- do not continue, report error |
Running the Loop
Invoke this skill by describing the experiment:
Use experiment-loop to reduce the API p95 latency from 340ms to under 200ms.
Baseline measurement: npm run bench:api
Max iterations: 8
Scope: src/api/
The loop will:
- Read any existing
thoughts/EXPERIMENTS.mdfor prior runs on the same metric - Ask
profilerfor an ordered hypothesis queue - Execute iterations with safety stashing
- Log each result immediately after measurement
- Report final state with all changes that were kept
Hard Limits
- Maximum 10 experiments per invocation (no exceptions)
- Scope must be specified -- loop will not touch files outside scope
- Measurement command must be deterministic (no unbounded network calls)
- Total wall-clock time cap: 30 minutes (prevents runaway loops)
- Never auto-merge to main -- changes stay on current branch