name: dev-testing description: Capsem testing policy and workflow. Use whenever running tests, writing new tests, or verifying changes work. Covers the three test tiers (unit, smoke, full), TDD red-green-refactor, adversarial security testing, coverage policy, and the mandatory end-to-end VM validation. For VM-specific tests see dev-testing-vm, for hypervisor tests see dev-testing-hypervisor, for frontend tests see dev-testing-frontend.

Testing

Test tiers

Three tiers, fast to thorough. Every change must pass all three before it ships.

Command	What	VM?
`just test`	Everything: unit tests (llvm-cov, warnings-as-errors for service crates) + cross-compile + frontend + all Python integration tests + injection + benchmarks	Yes
`just smoke`	Quick end-to-end: repack + sign + boot + capsem-doctor + MCP + service integration (~30s)	Yes

just test is the single source of truth. There is no "fast" tier that skips integration tests -- that's how the "Connection refused" bug shipped while tests said green. Individual test-* recipes exist for targeted debugging but just test is the gate.

TDD workflow

Write tests first:

Write failing tests that capture expected behavior
Verify they fail for the right reason
Write minimal implementation to pass them
Refactor

Without a failing test first, it's easy to write tests that pass by accident or don't actually verify the behavior you intended.

Parallel tests as dogfooding (n=4 is non-negotiable)

just test runs the python suite under pytest -n 4 --dist=loadfile. Four real VMs boot simultaneously. This is the canary, not just a speed-up. We ship Capsem as a multi-VM sandbox for AI agents -- if our own test suite cannot safely boot 4 concurrent VMs, real users running an agent farm will hit the exact same bug. Treat any concurrency flake as a Capsem-side bug, not a test-tuning problem:

"Suspend timed out" under load -> service IPC handling is racy, not "bump the timeout"
"Session did not become ready" -> Apple VZ resource serialization, VirtioFS lock contention, or service handling concurrent provisions; investigate, don't suppress
Two tests both want the same VM name -> name-collision bug in validate_vm_name / registry, not "isolate test names better"
Stale socket between tests -> service didn't reap a child cleanly, real production bug

Anti-patterns when a test flakes under -n 4:

Adding time.sleep() to "let things settle" -- masking a race
Bumping the per-test timeout -- buying time for a real bug to manifest in prod instead of CI
Marking the test serial so it runs alone -- defeating the dogfooding signal

The host has plenty of headroom (48 GB RAM, 14 cores; 4 VMs at 2 GB / 2 CPU each = 8 GB / 8 cores). If concurrency surfaces a flake, fix the product, then re-run. Bumping -n higher (8, 12) is the natural follow-on once n=4 is stable -- real users will run more.

Orphan processes across runs are a product bug (not a test bug)

If a previous just test -n 4 run was interrupted (ctrl-C, pytest-xdist worker death, host crash) and the NEXT run flakes with "vm-ready never asserted", UDS "connection refused", or mysterious HTTP 500s -- the cause is companion processes from the interrupted run still alive under PID 1. pkill -f "target/debug/capsem-(service|process|gateway|tray|mcp)" will make the flake vanish, but that is cleanup-after-the-fact. The fix is on the COMPANION side: every spawned companion (gateway, tray, and any new one) must use capsem-guard::install(parent_pid, lock_path) to enforce (a) refuse-standalone, (b) singleton, (c) self-exit on parent death. See /dev-rust-patterns lesson 18. Regression tests live in tests/capsem-service/test_companion_lifecycle.py -- never remove them; when adding a new companion, extend that file.

Never pkill -f capsem- with a broad pattern during test debugging: capsem- matches --crate-name capsem-core in running rustc/cargo invocations and will SIGKILL the compiler mid-build. Use a binary-path pattern like pkill -f "target/debug/capsem-(service|process|gateway|tray|mcp)" instead.

When `-n 1` is actually the right answer: multi-service-only gotchas

One narrow class of concurrency bug belongs at -n 1, not -n 4: bugs that only exist when two capsem-service processes run on the same host. Apple's Virtualization.framework does not tolerate overlapping saveMachineStateToURL / restoreMachineStateFromURL calls on sibling VMs, and we serialize with a per-service tokio::sync::Mutex (ServiceState::save_restore_lock). That lock is in-process, so it only serializes VMs inside one service. Production always has exactly one service per host per user, so the lock is sufficient in real deployments.

tests/capsem-mcp/test_stress_suspend_resume.py runs under pytest-xdist, which spawns one capsem-service per worker. At -n 2+, worker A's service can't see worker B's lock, and you re-expose the bug that never happens in production. This is the one case where the "n=4 dogfoods concurrency" rule doesn't apply -- the concurrency being tested would never happen outside the test harness. Keep this harness at -n 1. Full context and the failure signature live in docs/src/content/docs/gotchas/concurrent-suspend-resume.md.

This is NOT a blanket license to run any flaky test at -n 1. If you're tempted to demote another test, first ask: "Would this failure occur in production with one capsem-service and N VMs?" If yes, it belongs at -n 4; fix the product.

Adversarial testing

Capsem is a security product. Every security-relevant feature needs tests that actively try to break invariants. Think like an attacker:

Can a corp-blocked domain be snuck through another provider's list?
Does an overlapping wildcard in allow+block always deny?
Does malformed input (empty strings, unicode, huge payloads, invalid JSON) get rejected?
Can path traversal escape the VirtioFS sandbox?
Can a guest process modify its own binaries?

Stress-test boundary conditions. Write tests for the attacks you'd attempt yourself.

Security invariants to verify in tests

When touching security-relevant code, check these invariants have test coverage:

Invariant	What to test	Where
VirtioFS share is `guest/` only	`session_dir/guest/` exists, symlinks resolve, host-only files (`session.db`, `serial.log`) are outside the share	`capsem-core::lib::tests`
UDS sockets are 0600	After bind, verify permissions exclude other users	`capsem-process`
Process env is cleared	`env_clear()` called, only allowlisted vars passed	`capsem-service` spawn tests
No `process::exit` on guest I/O	Control channel close causes loop break, not exit	`capsem-process`
Sensitive logs are 0600	`serial.log` created with restricted permissions	`capsem-process`
Gateway auth on all routes	Every route except `GET /` returns 401 without token	`capsem-gateway::auth::tests`
Auth rate limiting	429 after threshold, resets after window	`capsem-gateway::auth::tests`
CORS rejects external origins	Only localhost/127.0.0.1/tauri allowed	`capsem-gateway::tests`
Body size limit	413 for >10MB payloads	`capsem-gateway::proxy::tests`
VM ID validation	Path traversal (`../`), dots, spaces, null bytes rejected	`capsem-gateway::terminal::tests`
Rootfs read-only	squashfs mounted ro, guest binaries 555	`capsem-doctor` in-VM tests
Suspend reports errors	IPC failure and timeout both return 500, not silent success	`capsem-service` tests

Test fixture anti-pattern: masking races with polling

If all test fixtures wait/poll before asserting, the tests will never catch server-side race conditions. For every endpoint that talks to a VM socket, write at least one test that calls it IMMEDIATELY after provision (no wait_exec_ready, no ready_vm fixture). The server must handle readiness internally.

Pattern to avoid (masks the bug -- server never needs wait logic because client always waits):

fixture calls provision -> fixture polls wait_exec_ready -> test calls exec

Required test pattern (catches the bug -- if server doesn't wait, test fails):

test calls provision -> test immediately calls exec -> server handles wait

See tests/capsem-service/test_svc_exec_ready.py for the regression tests that enforce this.

wait_exec_ready is a single call, not a loop

wait_exec_ready (in tests/helpers/service.py, tests/helpers/mcp.py, tests/capsem-gateway/test_gw_e2e.py) makes one exec call with the server-side timeout passed through. The server's handle_exec calls wait_for_vm_ready internally, which polls until the VM is ready. Do NOT add client-side retry loops -- that creates a double-wait where each retry can block for the full server timeout (30s client retries x 30s server wait = pathological cascade). One wait, one place.

Exec latency regression gate

tests/capsem-serial/test_boot_timing.py::test_exec_latency_under_1_5_seconds asserts that provision-to-first-exec completes in under 1.5s. If this test fails, investigate boot time (process.log boot_timeline spans), not the wait mechanism.

Where tests live

Rust unit: sibling tests.rs file, not inline mod tests { ... }. See the next subsection.
Rust integration: crates/capsem-core/tests/
In-VM diagnostics: guest/artifacts/diagnostics/test_*.py (see dev-testing-vm)
Hypervisor: KVM + Apple VZ tests (see dev-testing-hypervisor)
Frontend: frontend/src/lib/__tests__/ (see dev-testing-frontend)
Python (builder): tests/test_*.py
Python integration (service daemon): tests/capsem-*/ directories, each with its own conftest.py and pytest marker

Rust unit tests: sibling `tests.rs` pattern

Every Rust module keeps its unit tests in a sibling tests.rs, not an inline mod tests { ... } block. The parent module declares:

// foo.rs  OR  foo/mod.rs
// ... production code ...

#[cfg(test)]
mod tests;

and the tests go in tests.rs in the same directory:

// tests.rs -- sibling of foo.rs or child of foo/
use super::*;

#[test]
fn roundtrip() { ... }

Why. Inline #[cfg(test)] mod tests { ... } blocks are appended at the bottom of prod files and commonly hit 50–99% of the file's line count. That means every Read, grep, and scroll to reach production code walks past thousands of test lines first. Several modules in this codebase hit 4,000+ lines that way before extraction. Agents and humans both read faster when prod code isn't buried.

Mechanics.

tests.rs is a submodule of the parent file -- use super::*; works, private items are visible, #[cfg(test)] on the mod tests; declaration still gates compilation.
For files that don't yet have a sibling directory (e.g. lib.rs, foo.rs), put tests.rs next to them in the same src/ directory.
For files that are already foo/mod.rs, put tests.rs inside foo/.
Attributes on the inline mod tests block (e.g. #[allow(unused_imports)]) move onto the declaration: #[cfg(test)]\n#[allow(unused_imports)]\nmod tests;.

Extraction recipe (for any remaining inline mod tests { ... }):

Move the block body (everything between the outer { and }) into a new sibling tests.rs.
Dedent one indentation level so contents read as top-level items.
Replace the old inline block with #[cfg(test)] mod tests; (plus any attributes that were on the original).
cargo test -p <crate> -- should pass identically.

When to push back. If you see a new PR or agent output adding an inline mod tests { ... } block, request it be moved to tests.rs before merge. Exceptions are narrow: tiny helper modules under ~50 lines total where inline tests plus prod code fit on one screen, or a module that's already a test-only helper.

Integration test suites

All Python integration tests live under tests/capsem-*/ and use pytest markers. Each suite has a dedicated just recipe.

Suite	Directory	Marker	VM?	What it tests
Service API	`capsem-service/`	`integration`	Yes	HTTP endpoints: provision, list, info, exec, logs, file I/O, delete
CLI	`capsem-cli/`	`integration`	Yes	CLI subcommands via subprocess
MCP	`capsem-mcp/`	`mcp`	Yes	MCP server black-box (stdio, tool routing)
Session DB	`capsem-session/`	`session`	Yes	Telemetry: net/model/tool/mcp/fs/snapshot events
Snapshots	`capsem-snapshots/`	`snapshot`	Yes	Auto/manual snapshots, revert
Isolation	`capsem-isolation/`	`isolation`	Yes	Multi-VM filesystem + network isolation
Security	`capsem-security/`	`security`	Yes	Binary perms, codesigning, asset integrity, env blocklist
Config	`capsem-config/`	`config`	Yes	Limits, resource bounds, hot-reload
Bootstrap	`capsem-bootstrap/`	`bootstrap`	No	Setup flow, dev tools, asset checks
Stress	`capsem-stress/`	`stress`	Yes	5 concurrent VMs, rapid create/delete
Build chain	`capsem-build-chain/`	`build_chain`	Yes	cargo build -> codesign -> pack -> manifest -> boot
Guest	`capsem-guest/`	`guest`	Yes	Network, services, filesystem, env inside guest
Cleanup	`capsem-cleanup/`	`cleanup`	Yes	Process killed, socket removed, session dir removed
Codesign	`capsem-codesign/`	`codesign`	No	All binaries signed, entitlements present (FAIL not skip)
Serial	`capsem-serial/`	`serial`	Yes	Console logs, boot timing < 30s
Session lifecycle	`capsem-session-lifecycle/`	`session_lifecycle`	Yes	DB exists, schema, events, survives shutdown
Config runtime	`capsem-config-runtime/`	`config_runtime`	Yes	CPU/RAM applied in guest, blocked domains
Recipes	`capsem-recipes/`	`recipe`	No	just run-service, just doctor, cargo build
Recovery	`capsem-recovery/`	`recovery`	Yes	Stale socket/instances, orphaned process, double service
Rootfs artifacts	`capsem-rootfs-artifacts/`	`rootfs`	No	Artifact files, build context, doctor consistency
Session exhaustive	`capsem-session-exhaustive/`	`session_exhaustive`	Yes	Per-table data validation, cross-table FK integrity
Install	`capsem-install/`	`install`	No	Native installer: layout, auto-launch, service install, setup wizard, update, uninstall, lifecycle, reinstall, error paths

Composite recipe: just test-vm runs build-chain + guest + cleanup + codesign + serial + session-lifecycle + config-runtime + recovery. just test-install runs the install suite in Docker with systemd. just test runs everything.

Test matrix: what runs where

Rust crate CI coverage

Crate	Tests	CI macOS	CI Linux	Smoke	Full
capsem-core	~1695	Yes	Yes	No	Yes
capsem-agent	~71	Yes	No	No	Yes
capsem-logger	~47	Yes	Yes	No	Yes
capsem-proto	~132	Yes	Yes	No	Yes
capsem-gateway	~38	Yes	No	No	Yes
capsem-service	~109	Yes	Yes	No	Yes
capsem (CLI)	~140	Yes	Yes	No	Yes
capsem-mcp	~67	Yes	Yes	No	Yes
capsem-tray	~47	Yes	No	No	Yes
capsem-process	~62	Yes	No	No	Yes
capsem-app	~35	Check	No	No	Yes

Python integration suite tier map

Suite	Marker	VM?	CI	Smoke	Full
capsem-bootstrap	`bootstrap`	No	Run	No	Yes
capsem-codesign	`codesign`	No	Run	No	Yes
capsem-rootfs-artifacts	`rootfs`	No	Run	No	Yes
capsem-mcp	`mcp`	Yes	Collect	Yes	Yes
capsem-service	`integration`	Yes	Collect	Yes	Yes
capsem-cli	`integration`	Yes	Collect	Yes	Yes
capsem-gateway	`gateway`	Yes	Collect	Yes	Yes
capsem-e2e	`e2e`	Yes	Collect	No	Yes
capsem-session	`session`	Yes	Collect	No	Yes
capsem-session-lifecycle	`session_lifecycle`	Yes	Collect	No	Yes
capsem-session-exhaustive	`session_exhaustive`	Yes	Collect	No	Yes
capsem-security	`security`	Yes	Collect	No	Yes
capsem-isolation	`isolation`	Yes	Collect	No	Yes
capsem-snapshots	`snapshot`	Yes	Collect	No	Yes
capsem-config	`config`	Yes	Collect	No	Yes
capsem-config-runtime	`config_runtime`	Yes	Collect	No	Yes
capsem-guest	`guest`	Yes	Collect	No	Yes
capsem-cleanup	`cleanup`	Yes	Collect	No	Yes
capsem-stress	`stress`	Yes	Collect	No	Yes
capsem-recovery	`recovery`	Yes	Collect	No	Yes
capsem-serial	`serial`	Yes	Collect	No	Yes
capsem-lifecycle	`integration`	Yes	Collect	No	Yes
capsem-build-chain	`build_chain`	Yes	Collect	No	Yes
capsem-recipes	`recipe`	No	Run	No	Yes
capsem-install	`install`	No	Yes (Docker)	No	Yes

"Run" = tests execute in CI. "Collect" = imports verified (--collect-only) but tests skip (need VM). "Yes (Docker)" = runs in dedicated Docker+systemd CI job.

Coverage targets

Component	Floor	Enforced	Where
Rust workspace	70%	`--fail-under-lines 70`	CI (`cargo llvm-cov`), `just test`
Python builder	90%	`--cov-fail-under=90`	CI (`pytest`), `just test`
capsem-service	80%	Codecov component	`codecov.yml`
capsem-mcp	80%	Codecov component	`codecov.yml`
capsem-gateway	80%	Codecov component	`codecov.yml`
capsem (CLI)	80%	Codecov component	`codecov.yml`

Coverage

Rust: cargo llvm-cov via just test (floor: 70% line coverage)
Python: --cov-fail-under=90
codecov.yml maps components to code paths. Update it when files or directories are added, moved, or renamed.

Fast debug with capsem MCP tools

When the capsem MCP server is configured, Claude Code has direct VM control via MCP tools -- no shell commands or just recipes needed. This is the fastest way to test changes interactively because you stay in the conversation loop: create a VM, run commands, inspect results, fix code, repeat.

The tools

Tool	What it does
`capsem_create`	Spin up a fresh VM (returns VM id). Named VMs are persistent.
`capsem_run`	One-shot: boot temp VM, exec command, destroy, return output
`capsem_exec`	Run a command inside a running guest
`capsem_stop`	Stop VM (persistent: preserve state; ephemeral: destroy)
`capsem_resume`	Resume a stopped persistent VM
`capsem_read_file`	Read a file from the guest filesystem
`capsem_write_file`	Write a file into the guest
`capsem_inspect_schema`	Get session.db table schema
`capsem_inspect`	Run SQL against session.db (telemetry)
`capsem_list`	Show all VMs (running + stopped persistent)
`capsem_info`	VM details (config, status, persistent, PID)
`capsem_delete`	Destroy VM and wipe all state
`capsem_persist`	Convert running ephemeral VM to persistent
`capsem_purge`	Kill all temp VMs (all=true includes persistent)
`capsem_fork`	Fork a running/stopped VM into a reusable image
`capsem_image_list`	List all user images
`capsem_image_inspect`	Inspect a specific image's metadata
`capsem_image_delete`	Delete a user image

Debug workflow

Quick one-shot (no VM management): capsem_run with the command you want to test.

Iterative debugging (long-lived VM):

Create: capsem_create -- boots a fresh VM in ~10s
Test: capsem_exec with the command you want to verify (e.g., capsem-doctor -k net, cat /etc/resolv.conf, curl https://example.com)
Inspect: capsem_read_file to check config files, logs; capsem_inspect to query telemetry tables
Iterate: fix code on host, rebuild (just build), create a new VM to test again
Cleanup: capsem_delete when done

When to use MCP tools vs just recipes

Scenario	Use
Quick check: "does this command work in the guest?"	`capsem_run`
Read a guest file to understand state	`capsem_read_file`
Verify telemetry was recorded correctly	`capsem_inspect` with SQL query
Full regression suite	`just test`
Build + boot + validate in one shot	`just smoke`
Benchmark performance	`just bench`

MCP tools are for fast, targeted checks during development. Just recipes are for comprehensive validation before committing.

Common debug queries

-- Check network events for a domain
SELECT * FROM net_events WHERE domain LIKE '%example%' ORDER BY timestamp DESC LIMIT 10;

-- Verify MCP tool calls were logged
SELECT server_name, tool_name, decision, duration_ms FROM mcp_calls ORDER BY timestamp DESC;

-- Check model API calls
SELECT provider, model, status_code, duration_ms FROM model_calls ORDER BY timestamp DESC;

-- File system events
SELECT operation, path, success FROM fs_events ORDER BY timestamp DESC LIMIT 20;

End-to-end validation is not optional

After any change touching guest binaries, network policy, telemetry, MCP, or VM lifecycle:

just run "capsem-doctor" -- verifies sandbox integrity inside the VM
After telemetry/logging changes: run a real session and verify with just inspect-session that all 6 tables (net_events, model_calls, tool_calls, tool_responses, mcp_calls, fs_events) are populated correctly

When tests fail

Never dismiss a test failure as "pre-existing" or "unrelated." Every failure must be investigated. Follow the dev-debugging workflow:

Do not change the test to make it pass. The test is evidence. Changing the assertion to match broken behavior destroys that evidence.
Reproduce and diagnose first. Understand why it fails before writing any fix. See the dev-debugging skill for the full methodology: reproduce with a test, diagnose root cause, then fix comprehensively.
Fix the code, not the test. If the test is genuinely wrong (not the code), explain in detail why the test's expectation is incorrect before changing it.

Platform gating tests

cargo test --test platform_gating scans all .rs files under crates/ for macOS-only and Linux-only symbols (libc::clonefile, AppleVzHypervisor, KvmHypervisor, FICLONE, etc.) and verifies they appear inside #[cfg(target_os = "...")] blocks. This catches ungated platform APIs before they reach CI. Run this test when adding any platform-specific code.

Testable design

Extract logic into capsem-core -- never embed business logic in the app layer where it's coupled to Tauri. If you can't test something without booting a VM or launching the GUI, it belongs in core.

ナビゲーション

Skillsとは？

リンク

dev-testing

Testing

Test tiers

TDD workflow

Parallel tests as dogfooding (n=4 is non-negotiable)

Orphan processes across runs are a product bug (not a test bug)

When `-n 1` is actually the right answer: multi-service-only gotchas

Adversarial testing

Security invariants to verify in tests

Test fixture anti-pattern: masking races with polling

wait_exec_ready is a single call, not a loop

Exec latency regression gate

Where tests live

Rust unit tests: sibling `tests.rs` pattern

Integration test suites

Test matrix: what runs where

Rust crate CI coverage

Python integration suite tier map

Coverage targets

Coverage

Fast debug with capsem MCP tools

The tools

Debug workflow

When to use MCP tools vs just recipes

Common debug queries

End-to-end validation is not optional

When tests fail

Platform gating tests

Testable design

関連スキル(🔧 開発ツール)

ナビゲーション

Skillsとは？

リンク

dev-testing

Testing

Test tiers

TDD workflow

Parallel tests as dogfooding (n=4 is non-negotiable)

Orphan processes across runs are a product bug (not a test bug)

When -n 1 is actually the right answer: multi-service-only gotchas

Adversarial testing

Security invariants to verify in tests

Test fixture anti-pattern: masking races with polling

wait_exec_ready is a single call, not a loop

Exec latency regression gate

Where tests live

Rust unit tests: sibling tests.rs pattern

Integration test suites

Test matrix: what runs where

Rust crate CI coverage

Python integration suite tier map

Coverage targets

Coverage

Fast debug with capsem MCP tools

The tools

Debug workflow

When to use MCP tools vs just recipes

Common debug queries

End-to-end validation is not optional

When tests fail

Platform gating tests

Testable design

関連スキル(🔧 開発ツール)

When `-n 1` is actually the right answer: multi-service-only gotchas

Rust unit tests: sibling `tests.rs` pattern