name: experiment-loop description: "Autonomous experiment loop: hypothesize > modify > test > evaluate > keep/discard > repeat. Run N experiments automatically with measurable metrics. Works for performance optimization, A/B testing, prompt engineering, and any measurable improvement task."

Experiment Loop

Autonomous, iterative improvement inspired by Karpathy's autoresearch methodology. Define a metric, set a target, and let the loop run until the target is met or the iteration limit is reached.

The 5-Step Loop

1. HYPOTHESIZE  -> Form a specific, falsifiable improvement hypothesis
2. MODIFY       -> Apply the minimal code/config/prompt change
3. TEST         -> Run the measurement suite (benchmarks, tests, evals)
4. EVALUATE     -> Compare result against baseline and previous best
5. DECIDE       -> KEEP if better, DISCARD (git stash pop --index) if worse
      |
   Repeat until target met OR max_iterations reached

Each iteration is atomic: one hypothesis, one change, one measurement, one decision.

Experiment Definition

Define an experiment in your task or in thoughts/EXPERIMENTS.md:

experiment:
  name: "reduce-api-latency"
  metric: "p95 response time (ms)"
  baseline: 340
  target: 200
  direction: minimize          # minimize | maximize
  max_iterations: 10           # hard cap, never exceed
  measurement_cmd: "npm run bench:api"
  measurement_key: "p95"       # JSON key from bench output
  scope: "src/api/"            # files the loop is allowed to touch

Key Fields

Field	Description
`metric`	Human-readable name of what you are measuring
`baseline`	Measured value before any changes (run this first)
`target`	Success condition -- loop exits when this is met
`direction`	`minimize` for latency/size, `maximize` for coverage/score
`max_iterations`	Safety cap, default 10, absolute maximum 10
`measurement_cmd`	Shell command that produces JSON with the metric value
`scope`	Directories/files the loop is allowed to modify

Safety Protocol

Before every experiment iteration:

# Save current state
git stash push -u -m "experiment-loop: iteration N baseline"

# Run experiment
# ... apply hypothesis change ...
# ... run measurement ...

# Decision
if result is better:
    git stash drop          # keep changes, discard stash
else:
    git stash pop --index   # restore exactly: staged + unstaged

Never skip the stash. Never accumulate multiple iterations without a decision checkpoint. If the measurement command fails or times out, treat it as DISCARD.

Agent Integration

The experiment loop coordinates three vibecosystem agents:

Phase	Agent	Role
Hypothesize	`profiler`	Identify bottlenecks, suggest what to change
Modify	`spark`	Apply the focused code change
Test + Evaluate	`verifier` / `tdd-guide`	Run benchmarks, tests, evals and parse results

Spawn profiler once at the start to get the initial hypothesis queue. Then run spark + verifier in tight loops per iteration.

Example Experiments

Bundle Size Reduction

experiment:
  name: "optimize-bundle-size"
  metric: "gzipped bundle size (KB)"
  baseline: 420
  target: 300
  direction: minimize
  max_iterations: 10
  measurement_cmd: "npm run build && node scripts/measure-bundle.js"
  measurement_key: "gzipped_kb"
  scope: "src/"

Hypothesis queue to try in order:

Add tree-shaking for unused lodash imports (use named imports)
Replace moment with date-fns (smaller footprint)
Move large dependencies to dynamic import() at route boundaries
Enable usedExports: true in webpack/rollup config
Replace axios with native fetch wrapper

API Latency

experiment:
  name: "reduce-api-latency"
  metric: "p95 response time (ms)"
  baseline: 340
  target: 200
  direction: minimize
  max_iterations: 8
  measurement_cmd: "npm run bench:api"
  measurement_key: "p95"
  scope: "src/api/"

Hypothesis queue:

Add Redis cache for repeated DB reads (TTL 60s)
Replace N+1 queries with single JOIN query
Add connection pool sizing (max: 20)
Move synchronous validation to async parallel (Promise.all)
Add response compression (gzip middleware)

Test Coverage

experiment:
  name: "improve-test-coverage"
  metric: "line coverage (%)"
  baseline: 64
  target: 80
  direction: maximize
  max_iterations: 10
  measurement_cmd: "npm test -- --coverage --json > coverage.json"
  measurement_key: "coverageMap.total.lines.pct"
  scope: "src/"

Prompt Engineering (LLM Eval)

experiment:
  name: "improve-extraction-accuracy"
  metric: "extraction F1 score"
  baseline: 0.71
  target: 0.85
  direction: maximize
  max_iterations: 10
  measurement_cmd: "python eval/run_evals.py --output eval/results.json"
  measurement_key: "f1"
  scope: "prompts/"

Results Log Format

Append each iteration result to thoughts/EXPERIMENTS.md:

## Experiment: reduce-api-latency
Started: 2026-04-07T10:00:00Z
Baseline: 340ms | Target: 200ms | Direction: minimize

### Iteration 1
- Hypothesis: Add Redis cache for repeated DB reads
- Change: `src/api/users.ts` lines 45-67 -- wrap DB call with cache layer
- Result: 280ms (improvement: -60ms, -17.6%)
- Decision: KEEP
- Cumulative best: 280ms

### Iteration 2
- Hypothesis: Replace N+1 queries with JOIN
- Change: `src/api/users.ts` lines 89-102 -- rewrite fetchWithPosts()
- Result: 210ms (improvement: -70ms, -25%)
- Decision: KEEP
- Cumulative best: 210ms

### Iteration 3
- Hypothesis: Add connection pool sizing max:20
- Change: `src/db/pool.ts` line 12 -- max: 10 -> 20
- Result: 215ms (regression: +5ms)
- Decision: DISCARD (restored via git stash pop)
- Cumulative best: 210ms

### Final Result
- Target: 200ms | Achieved: 210ms | Status: NEAR_MISS (within 5%)
- Iterations: 3 of 10 used
- Total improvement: -38% from baseline

Iteration Limits and Exit Conditions

Condition	Action
Target met	EXIT -- log SUCCESS, keep all accumulated changes
max_iterations reached	EXIT -- log PARTIAL, keep best achieved state
3 consecutive DISCARDs	PAUSE -- re-run profiler for new hypothesis queue
Measurement command fails	DISCARD current iteration, continue loop
Git stash fails	STOP -- do not continue, report error

Running the Loop

Invoke this skill by describing the experiment:

Use experiment-loop to reduce the API p95 latency from 340ms to under 200ms.
Baseline measurement: npm run bench:api
Max iterations: 8
Scope: src/api/

The loop will:

Read any existing thoughts/EXPERIMENTS.md for prior runs on the same metric
Ask profiler for an ordered hypothesis queue
Execute iterations with safety stashing
Log each result immediately after measurement
Report final state with all changes that were kept

Hard Limits

Maximum 10 experiments per invocation (no exceptions)
Scope must be specified -- loop will not touch files outside scope
Measurement command must be deterministic (no unbounded network calls)
Total wall-clock time cap: 30 minutes (prevents runaway loops)
Never auto-merge to main -- changes stay on current branch

ナビゲーション

Skillsとは？

リンク

experiment-loop