name: red-queen description: "Adversarial evolutionary QA using Digital Red Queen. Code and tests coevolve via deterministic state machine execution."
The Red Queen: Deterministic Adversarial Evolution
Liza Path
The deterministic state machine lives at:
~/.claude/skills/red-queen/liza-advanced.nu
All commands use this absolute path:
L="$HOME/.claude/skills/red-queen/liza-advanced.nu"
nu $L init
nu $L task-add drq-session --spec_ref README.md
# etc.
Always set L at the start of every session.
"It takes all the running you can do, to keep in the same place." — The Red Queen
Design Principle: Deterministic Over AI
The evolutionary loop has two kinds of operations:
| Operation | Who Does It | Why |
|---|---|---|
| Generate test commands | AI (creative) | Requires understanding of product semantics |
| Execute tests | Shell (deterministic) | run-shell-cmd — exit code is ground truth |
| Select survivors | Liza (deterministic) | exit_code != expect_exit → survivor. No AI judgment. |
| Lock regressions | Liza (deterministic) | done_when append — permanent, computed, no AI |
| Gate transitions | Liza (deterministic) | assert-can-submit, assert-can-review, assert-can-merge |
| Track generations | Liza gen-start (deterministic) | Increments generation counter on blackboard |
| Record survivor | Liza gen-survivor (deterministic) | Locks regression + updates landscape atomically |
| Record discard | Liza gen-discard (deterministic) | Updates landscape tests_run without adding regression |
| Regression check | Liza validate (deterministic) | Runs ALL done_when checks, all must pass |
| Landscape scoring | Liza landscape (deterministic) | Computes fitness, crown status from blackboard fields |
| Mutation testing | Liza mutate (deterministic) | Runs cargo-mutants, missed mutants → survivors |
| Spec mining | Liza spec-mine (deterministic) | Extracts testable promises from README, CLI, doctests |
| Quality gates | Liza quality-gate (deterministic) | FP gates + DRY + coverage checks, failures → survivors |
| Fowler review | Liza fowler-review (deterministic) | AST analysis + test smells + security, violations → survivors |
AI generates test commands. Everything else is deterministic code.
The Algorithm
DRQ_DETERMINISTIC(target_binary, liza_script):
# PHASE 0: Initialize deterministic state
nu $Linit
nu $Ltask-add drq-session --spec_ref README.md
# AI: Read docs, generate initial done_when checks (the contract)
for each promise discovered:
nu $Ltask-add-check drq-session "<verification_cmd>" --expect_exit=0
# Claim the session
nu $Lclaim drq-session red-queen
generation = 0
landscape = {} # dimension → {tests_run: 0, survivors: 0}
loop:
# INCREMENT: Deterministic
nu $Lgen-start drq-session
# EVOLVE: AI generates test commands for this generation
# This is the ONLY step that uses AI creativity
challengers = AI_generate_test_commands(landscape, generation)
# EXECUTE + SELECT: Deterministic — run each, check exit code
for cmd, dimension in challengers:
result = shell(cmd)
if result.exit_code != 0: # BUG FOUND
# Lock regression + update landscape atomically (DETERMINISTIC)
nu $Lgen-survivor drq-session "<dimension>" "<cmd>" --severity <SEVERITY>
# File bead IMMEDIATELY — not at session end
br create --title "[Red Queen] <SEVERITY>: <finding>" --type=bug --priority=<N>
else:
# No bug — update landscape tests_run only (DETERMINISTIC)
nu $Lgen-discard drq-session "<dimension>"
# ADAPT: Deterministic — computed by liza from blackboard fields
nu $Llandscape drq-session
# REGRESSION: Deterministic — Liza runs ALL done_when checks
nu $Lcoder-submit drq-session red-queen
nu $Lvalidate drq-session
# LINEAGE REPLAY: Every predecessor must be defeated
nu $Llineage-replay drq-session
# CARNAGE: Track kill rate and lethality
nu $Lcarnage drq-session
# AUTOMATED WEAPONS: Run deterministic analysis tools
nu $Lspec-mine drq-session $TARGET_DIR --bin $BIN
nu $Lquality-gate drq-session $TARGET_DIR
nu $Lfowler-review drq-session $TARGET_DIR
nu $Lmutate drq-session $TARGET_DIR # Rust only
# ESCALATION: Auto-promote severity for bleeding dimensions
nu $Lescalate drq-session
# EQUILIBRIUM: Requires 3 consecutive zero-survivor gens + all dims exhausted
# Dormant dimensions reawaken every 5 gens — the Queen never rests
# VERDICT: Computed from blackboard state
nu $Llandscape drq-session
What Liza Controls (Deterministic)
Every gate, transition, and regression check is enforced by liza-advanced.nu:
State Machine
UNCLAIMED → claim → IN_PROGRESS → coder-submit → READY_FOR_REVIEW
→ validate (runs ALL done_when) → approve/reject
→ APPROVED → merge → MERGED
Each transition has a deterministic gate:
| Gate | What It Checks | Code |
|---|---|---|
assert-can-submit | Status is IN_PROGRESS, agent_id set | Lines 132-139 |
assert-can-review | Status is READY_FOR_REVIEW, commit SHA present, validation results exist | Lines 142-156 |
assert-can-merge | Status is APPROVED, review decision is APPROVED | Lines 159-167 |
assert-no-test-weakening | Changed files don't touch tests/ unless explicitly allowed | Lines 170-177 |
No AI judgment in any gate. Pure field checks.
Validation (The Ratchet)
cmd-supervisor-validate (lines 356-414) is the ratchet mechanism:
for each done_when check:
result = run-shell-cmd(check.cmd)
pass = (result.exit_code == check.expect_exit)
# DETERMINISTIC: exit code comparison, nothing else
all_ok = ALL checks pass
# If ANY check fails → validation fails → no approve possible
This is the core evolutionary mechanism: every test that ever broke the code becomes a permanent done_when entry. The validation command runs ALL of them. The code must pass ALL of them. There is no AI involved in this check — it's pure exit code comparison.
Regression Locking
cmd-regress (lines 522-555) adds tests to the permanent bank:
# Without --force: deterministically verifies the test fails on champion, passes on candidate
champion_test = run-shell-cmd(cmd) # must FAIL on current code
candidate_test = run-shell-cmd(cmd) # must PASS after fix
# Only then: append to done_when
This prevents false positives from entering the lineage. Deterministic: two shell commands, two exit code checks.
Execution Protocol
Phase 0: Probe (AI + Deterministic)
AI does: Read README, --help, source code. Discover promises.
Deterministic does: Register each promise as a done_when check.
# Initialize state machine
nu $Linit
nu $Ltask-add drq-session --spec_ref README.md
# AI discovers promises, then registers each deterministically:
nu $Ltask-add-check drq-session "factory help" --expect_exit=0
nu $Ltask-add-check drq-session "factory version" --expect_exit=0
nu $Ltask-add-check drq-session "factory new -s test-slug 2>/dev/null; echo \$?" --expect_exit=0
# ... one per promise
# Claim
nu $Lclaim drq-session red-queen
Generation N: Evolve → Execute → Select → Regress
AI does: Generate 3-10 test commands based on the landscape. Everything else is deterministic.
L="$HOME/.claude/skills/red-queen/liza-advanced.nu"
DIM="error-handling"
# Start generation
nu $L gen-start drq-session
# AI generates test commands (the creative part)
# Execute each, let exit code decide survivor vs discard
# Challenger 1
factory bogus 2>/dev/null
if [ $? -eq 0 ]; then
# BUG: should have failed — lock survivor + file bead
nu $L gen-survivor drq-session "$DIM" "factory bogus 2>/dev/null; test \$? -ne 0" --severity CRITICAL
br create --title "[Red Queen] CRITICAL: bogus command exits 0" --type=bug --priority=0
else
nu $L gen-discard drq-session "$DIM"
fi
# Challenger 2
factory new 2>/dev/null
if [ $? -eq 0 ]; then
nu $L gen-survivor drq-session "$DIM" "factory new 2>/dev/null; test \$? -ne 0" --severity MAJOR
br create --title "[Red Queen] MAJOR: new without --slug exits 0" --type=bug --priority=1
else
nu $L gen-discard drq-session "$DIM"
fi
# Show landscape (fitness computed from blackboard)
nu $L landscape drq-session
# REGRESS: Run full lineage (deterministic)
nu $L coder-submit drq-session red-queen
nu $L validate drq-session
Landscape Scoring (Deterministic)
After each generation, compute fitness scores from raw counts:
# Landscape is a simple ratio: survivors / tests_run per dimension
# No AI judgment — pure arithmetic
# Example after generation 3:
# error-handling: 5 survivors / 8 tests = 0.625 (high fitness — keep probing)
# setup: 1 survivor / 6 tests = 0.167 (low fitness — fewer tests next gen)
# edge-cases: 0 survivors / 4 tests = 0.000 (exhausted if 0 for 2 gens)
# state-management: 3 survivors / 3 tests = 1.000 (everything breaks — maximum pressure)
Allocation rule (deterministic, escalating):
- Dimensions with fitness > 0.7: allocate 6+ challengers (HEMORRHAGING)
- Dimensions with fitness > 0.5: allocate 5 challengers (HIGH PRESSURE)
- Dimensions with fitness > 0.3: allocate 4 challengers (CONTESTED)
- Dimensions with fitness > 0.1: allocate 3 challengers (PROBING)
- Dimensions with fitness 0: allocate 2 challengers (COOLING — double-tap, never single)
- Dimensions exhausted for 3+ gens: 0 challengers (but reawaken every 5 gens)
- ALL allocations multiplied by escalation factor: 1.0x + 0.5x per 2 generations
- Equilibrium requires 3 consecutive zero-survivor generations (not 2)
Coevolution (Fix → Regress → Repeat)
When the DRQ loop is fixing code (not just reporting):
# 1. AI fixes the code (creative)
# 2. Liza validates ALL done_when (deterministic)
nu $Lcoder-submit drq-session red-queen
nu $Lvalidate drq-session
# If validate fails:
# → The failing check is ALREADY in done_when (it was there before)
# → The fix introduced a regression
# → Fix the regression, re-validate
# → Repeat until validate passes
# If validate passes:
# → ALL historical tests pass
# → Proceed to next generation
The ratchet is entirely in liza's done_when list and validate command. No AI decides if something regressed — the exit code does.
Automated Weapons
Four deterministic analysis commands that run real tools (not AI heuristics) and record results as survivors/discards in the landscape. Each requires an active generation (gen-start first) except spec-mine which uses task-add-check directly.
mutate — Rust Mutation Testing
Runs cargo-mutants and records uncaught mutants as gen-survivors.
nu $L mutate drq-session /path/to/rust/project --file src/lib.rs --function parse --timeout 300
- Requires:
cargo-mutantsinstalled, active generation - MissedMutant → gen-survivor (dimension:
mutation), auto-escalates to CRITICAL for pub API (lib.rs, mod.rs) - CaughtMutant → gen-discard
- done_when:
cd $dir && cargo mutants -f <file> --re <fn>expect_exit=0 - Gracefully handles missing
cargo-mutantswith a warning
spec-mine — Language-Agnostic Promise Extraction
Mines any project for testable promises. Does NOT require an active generation — uses task-add-check directly to lock permanent checks.
nu $L spec-mine drq-session /path/to/project --bin myapp --readme /path/to/README.md
What it mines:
| Source | Method | Dimension |
|---|---|---|
| README.md | Extract fenced bash/shell/sh/console blocks | spec-readme |
| CLI --help | Run $bin --help, parse subcommands, add $bin $subcmd --help checks | spec-help |
| Rust doctests | Grep /// # Examples → cargo test --doc | spec-doctest |
| Python doctests | Grep >>> → python -m doctest <file> | spec-doctest |
| Debt markers | Grep TODO/FIXME/HACK/XXX → observation check | spec-debt |
| Rust type safety | cargo clippy -- -D clippy::unwrap_used | spec-type-safety |
- Language detection: Cargo.toml=rust, gleam.toml=gleam, pyproject.toml=python, package.json=node
- All checks are permanent ratchet entries
quality-gate — Deterministic Code Quality Checks (from tdd15)
Runs a battery of deterministic quality checks. Each failing check → gen-survivor.
nu $L quality-gate drq-session /path/to/project
FP Gates (Rust):
| Gate | Command | Dimension |
|---|---|---|
| No Panic | cargo clippy -- -D clippy::unwrap_used -D clippy::expect_used -D clippy::panic | fp-gate-no-panic |
| Exhaustive Match | cargo clippy -- -D clippy::wildcard_enum_match_arm | fp-gate-exhaustive |
| Format | cargo fmt --check | fp-gate-format |
| Lint | cargo clippy -- -D warnings | fp-gate-lint |
| Tests | cargo test | fp-gate-tests |
| Coverage | cargo tarpaulin --skip-clean --out json (< 80% → survivor) | fp-gate-coverage |
FP Gates (Gleam): gleam format --check, gleam build, gleam test
DRY Check (Rust): cargo clippy -- -D clippy::redundant_clone -D clippy::manual_map -D clippy::unnecessary_wraps → dimension quality-dry
Test Quality: tokei JSON → test-to-code ratio < 0.5 → survivor (dimension: quality-test-coverage)
fowler-review — Martin Fowler Code + Test Quality Review
Tool-based (not grep heuristic) review of source code AND tests. Each violation → gen-survivor.
nu $L fowler-review drq-session /path/to/project --complexity-threshold 15 --fn-length-threshold 50 --file-length-threshold 250 --nesting-threshold 4 --coverage-threshold 80.0
4a. Structural Analysis — rust-code-analysis-cli (Mozilla tree-sitter):
| Metric | Threshold | Dimension |
|---|---|---|
| Cyclomatic complexity | > 15 | fowler-complexity |
| Function length (SLOC) | > 50 | fowler-large-fn |
| Nesting depth | > 4 | fowler-deep-nesting |
| Cognitive complexity | clippy flag | fowler-cognitive |
4b. AST Pattern Matching — ast-grep (tree-sitter queries):
| Pattern | Dimension |
|---|---|
.unwrap() | fowler-unwrap |
.expect($MSG) | fowler-expect |
todo!() / unimplemented!() | fowler-todo |
4c. Clippy Extended:
| Check | Dimension |
|---|---|
| Dead code / unused imports | fowler-dead-code |
| DRY violations | fowler-dry |
| Error handling (unwrap/expect) | fowler-error-handling |
| Wildcard enum matches | fowler-exhaustive |
4d. Test Code Review:
| Smell | Method | Dimension |
|---|---|---|
| No assertions | rg -c 'assert' <file> = 0 for test fn | fowler-test-no-assert |
| Test-to-code ratio | tokei JSON, < 0.5 → survivor | fowler-test-ratio |
| Coverage | cargo llvm-cov --fail-under-lines $threshold | fowler-test-coverage |
| Happy path only | < 30% of tests cover error paths | fowler-test-happy-only |
| Flaky indicators | sleep calls in tests | fowler-test-flaky |
| Test isolation | static mut / lazy_static / Mutex in tests | fowler-test-isolation |
4e. Security & Supply Chain:
| Check | Tool | Dimension |
|---|---|---|
| Unsafe code | cargo-geiger | fowler-unsafe |
| Security vulns | cargo-audit | fowler-security |
| Unused deps | cargo-udeps | fowler-unused-deps |
| License issues | cargo-deny | fowler-licenses |
4f. File Size & Documentation:
| Check | Method | Dimension |
|---|---|---|
| File > 250 lines | tokei per-file | fowler-file-size |
| Comment ratio < 5% | tokei totals | fowler-documentation |
Required tools (Rust): rust-code-analysis-cli, ast-grep/sg, cargo-llvm-cov, cargo-geiger, cargo-audit, cargo-udeps, cargo-deny, tokei, rg. Missing tools are skipped gracefully.
What AI Does vs What Code Does
| Step | AI | Deterministic Code |
|---|---|---|
| Discover promises | Reads docs, generates check commands | task-add-check stores them |
| Generate challengers | Creates test commands per dimension | — |
| Execute challengers | — | run-shell-cmd captures exit code |
| Classify survivor | — | exit_code != expect_exit → survivor |
| Lock regression | — | task-add-check appends to done_when |
| File bead | — | br create at same moment as task-add-check — never deferred |
| Score landscape | — | survivors / tests_run per dimension |
| Allocate next gen | Uses landscape scores to decide dimensions | Scores are computed, allocation follows rules |
| Gate transitions | — | assert-can-* functions |
| Validate full lineage | — | validate runs all done_when checks |
| Fix code | Writes code changes | — |
| Detect regression | — | validate fails → regression exists |
| Mine specs | — | spec-mine extracts promises from README, CLI, doctests |
| Quality gates | — | quality-gate runs FP/DRY/coverage checks |
| Code review | — | fowler-review runs AST analysis + test smells + security |
| Mutation testing | — | mutate runs cargo-mutants, missed → survivors |
| Track state | — | Blackboard YAML (atomic save) |
AI touches: test command generation, code fixes, promise discovery. Code handles: selection, regression, gates, state, validation, scoring.
Verdict Format
The verdict is computed from blackboard state, not AI narrative:
# Extract deterministic facts from blackboard
nu $Lshow --task=drq-session
# The verdict fields are all computable:
# - generations: count of generation loops executed
# - lineage_size: length of done_when list
# - survivors_by_dimension: group done_when entries by dimension tag
# - crown_status: if ALL done_when pass → DEFENDED
# if CRITICAL survivors exist → FORFEIT
# else → CONTESTED
THE RED QUEEN'S VERDICT
═══════════════════════════════════════════════════════════════
Champion: [product name]
Generations: [N]
Lineage: [M] survivors (done_when entries)
Final: CROWN DEFENDED | CROWN CONTESTED | CROWN FORFEIT
FITNESS LANDSCAPE (computed from test results)
═══════════════════════════════════════════════════════════════
Dimension Tests Survivors Fitness Status
───────────────────── ───── ───────── ─────── ──────────
[computed per dimension from raw counts]
PERMANENT LINEAGE (done_when entries)
═══════════════════════════════════════════════════════════════
[Each entry from done_when: cmd, expect_exit, generation added, dimension]
FULL VALIDATION
═══════════════════════════════════════════════════════════════
[Output of: nu $Lvalidate drq-session]
All checks pass: YES/NO
Failed checks: [list]
Finding Report
Each finding maps to a done_when entry — the deterministic artifact:
[GEN-{gen}-{n}] {SEVERITY}: {title}
═══════════════════════════════════════════════
Generation: {N}
Dimension: {landscape dimension}
Command: {exact command — this IS the done_when entry}
Expected Exit: {expect_exit}
Actual Exit: {what was observed}
Stdout: {captured}
Stderr: {captured}
done_when entry: { cmd: "<cmd>", expect_exit: <N> }
Locked by: task-add-check (deterministic, permanent)
═══════════════════════════════════════════════
Multi-Agent Orchestration
RED QUEEN (Orchestrator)
│
├─ SCOUT AGENT (Task: Explore)
│ └─ Phase 0: Read docs, generate initial done_when commands
│ └─ OUTPUT: List of "nu $Ltask-add-check" commands
│
├─ GENERATION AGENTS (Task: general-purpose)
│ └─ INPUT: Landscape scores, lineage (done_when list), generation N
│ └─ AI DOES: Generate test commands
│ └─ DETERMINISTIC: Execute, check exit codes, call task-add-check for survivors
│ └─ OUTPUT: Survivor list, updated landscape scores
│
├─ SOURCE AUDITOR (Task: code-reviewer)
│ └─ Identifies code patterns that inform landscape dimensions
│ └─ OUTPUT: Suggested dimensions to add to landscape
│
└─ ORCHESTRATOR computes:
└─ Landscape fitness (arithmetic)
└─ Equilibrium check (2 consecutive zero-survivor gens)
└─ Crown status (validate pass/fail)
└─ Verdict (from blackboard state)
Rules of Engagement
- Deterministic over AI — if it can be computed, compute it. AI generates test commands only.
- Exit codes are ground truth — not AI interpretation of output
- done_when is the lineage — every survivor is a permanent shell command with an expected exit code
- validate is the ratchet — runs ALL done_when, deterministic pass/fail
- Gates are code —
assert-can-*functions, not AI judgment - Landscape is arithmetic — survivors / tests_run, not AI scoring
- State is YAML — blackboard.yml, atomic save, auditable
- AI creativity is bounded — generate test commands, fix code, read docs. Nothing else.
- No AI decides if something passes — the shell exit code decides
- File beads at selection, not at session end —
br createhappens the same moment astask-add-check. Every survivor gets a bead immediately. Never defer. - The Queen always returns — today's done_when is tomorrow's regression gate
- Escalating pressure — challenger counts increase with generation number (1.0x + 0.5x per 2 gens). The codebase faces ever-growing armies.
- Defeat ALL predecessors — use
lineage-replayto verify current code defeats every warrior from every generation. New code must beat the entire lineage. - Anti-stagnation — dormant dimensions reawaken every 5 generations. The Queen never truly rests.
- Severity escalation — 3+ consecutive survivors in a dimension auto-promote severity. Persistent wounds become critical.
- Double-tap cooling dimensions — never send just 1 challenger to a cooling dimension. Always 2+. Confirm the kill.
- Carnage tracking — monitor kill rate and lethality per dimension. The codebase must constantly defend itself.
Severity Classification
| Severity | Deterministic Criteria | Landscape Effect |
|---|---|---|
| CRITICAL | Core workflow command returns wrong exit code (0 on error, non-0 on success) | dimension.fitness = 1.0 (maximum pressure) |
| MAJOR | Documented command fails or produces wrong output (verified by grep/diff, not AI) | dimension.fitness += 0.3 |
| MINOR | Output doesn't match documented format (verified by pattern match, not AI) | dimension.fitness += 0.1 |
| OBSERVATION | AI-only judgment (no deterministic check possible) | Not added to done_when |
Note: OBSERVATION is the only severity that relies on AI judgment. CRITICAL/MAJOR/MINOR all have deterministic verification commands in done_when.
Anti-Patterns
| Anti-Pattern | Problem | Deterministic Way |
|---|---|---|
| AI decides if test passed | Nondeterministic, unreproducible | Exit code comparison only |
| AI scores the landscape | Subjective, varies between runs | survivors / tests_run ratio |
| AI decides when to stop | No convergence guarantee | Equilibrium: 0 survivors for 2 consecutive gens |
| AI gates transitions | Bypassable, inconsistent | assert-can-* functions in liza |
| Tests not persisted | Lost between sessions | done_when in blackboard YAML |
| AI judges regression | Flaky, depends on prompt | validate runs all checks, exit code comparison |
| Narrative verdict | Different every run | Computed from blackboard fields |
Quality Gates (All Deterministic)
-
nu $Lshow --task=drq-sessionreturns valid state - At least 3 generations executed (check generation counter)
- Every survivor has a
done_whenentry (check done_when length >= survivor count) -
nu $Lvalidate drq-sessionpasses (all done_when checks green) - Landscape scores computed (survivors / tests_run for each dimension)
- Equilibrium checked (0-survivor generations counted)
- Crown status derived from validate result + survivor severities
Session Completion
# 1. Beads already filed (filed at survivor selection, not here)
# Verify: br list --status=open | grep "Red Queen"
# 2. Verify lineage integrity (deterministic)
nu $Lvalidate drq-session
# 3. Show final state (deterministic)
nu $Lshow --task=drq-session
# 4. Push
git add . && git commit -m "test(red-queen): gen N — <verdict>"
git push
Skill Version: 7.0.0 Last Updated: January 2026 Status: Production-Ready Model: Deterministic Adversarial Evolution — AI generates tests, code decides outcomes