name: dev-testing description: Capsem testing policy and workflow. Use whenever running tests, writing new tests, or verifying changes work. Covers the three test tiers (unit, smoke, full), TDD red-green-refactor, adversarial security testing, coverage policy, and the mandatory end-to-end VM validation. For VM-specific tests see dev-testing-vm, for hypervisor tests see dev-testing-hypervisor, for frontend tests see dev-testing-frontend.
Testing
Test tiers
Three tiers, fast to thorough. Every change must pass all three before it ships.
| Command | What | VM? |
|---|---|---|
just test | Everything: unit tests (llvm-cov, warnings-as-errors for service crates) + cross-compile + frontend + all Python integration tests + injection + benchmarks | Yes |
just smoke | Quick end-to-end: repack + sign + boot + capsem-doctor + MCP + service integration (~30s) | Yes |
just test is the single source of truth. There is no "fast" tier that skips integration tests -- that's how the "Connection refused" bug shipped while tests said green. Individual test-* recipes exist for targeted debugging but just test is the gate.
TDD workflow
Write tests first:
- Write failing tests that capture expected behavior
- Verify they fail for the right reason
- Write minimal implementation to pass them
- Refactor
Without a failing test first, it's easy to write tests that pass by accident or don't actually verify the behavior you intended.
Parallel tests as dogfooding (n=4 is non-negotiable)
just test runs the python suite under pytest -n 4 --dist=loadfile. Four real VMs boot simultaneously. This is the canary, not just a speed-up. We ship Capsem as a multi-VM sandbox for AI agents -- if our own test suite cannot safely boot 4 concurrent VMs, real users running an agent farm will hit the exact same bug. Treat any concurrency flake as a Capsem-side bug, not a test-tuning problem:
- "Suspend timed out" under load -> service IPC handling is racy, not "bump the timeout"
- "Session did not become ready" -> Apple VZ resource serialization, VirtioFS lock contention, or service handling concurrent provisions; investigate, don't suppress
- Two tests both want the same VM name -> name-collision bug in
validate_vm_name/ registry, not "isolate test names better" - Stale socket between tests -> service didn't reap a child cleanly, real production bug
Anti-patterns when a test flakes under -n 4:
- Adding
time.sleep()to "let things settle" -- masking a race - Bumping the per-test timeout -- buying time for a real bug to manifest in prod instead of CI
- Marking the test
serialso it runs alone -- defeating the dogfooding signal
The host has plenty of headroom (48 GB RAM, 14 cores; 4 VMs at 2 GB / 2 CPU each = 8 GB / 8 cores). If concurrency surfaces a flake, fix the product, then re-run. Bumping -n higher (8, 12) is the natural follow-on once n=4 is stable -- real users will run more.
Orphan processes across runs are a product bug (not a test bug)
If a previous just test -n 4 run was interrupted (ctrl-C, pytest-xdist worker death, host crash) and the NEXT run flakes with "vm-ready never asserted", UDS "connection refused", or mysterious HTTP 500s -- the cause is companion processes from the interrupted run still alive under PID 1. pkill -f "target/debug/capsem-(service|process|gateway|tray|mcp)" will make the flake vanish, but that is cleanup-after-the-fact. The fix is on the COMPANION side: every spawned companion (gateway, tray, and any new one) must use capsem-guard::install(parent_pid, lock_path) to enforce (a) refuse-standalone, (b) singleton, (c) self-exit on parent death. See /dev-rust-patterns lesson 18. Regression tests live in tests/capsem-service/test_companion_lifecycle.py -- never remove them; when adding a new companion, extend that file.
Never pkill -f capsem- with a broad pattern during test debugging: capsem- matches --crate-name capsem-core in running rustc/cargo invocations and will SIGKILL the compiler mid-build. Use a binary-path pattern like pkill -f "target/debug/capsem-(service|process|gateway|tray|mcp)" instead.
When -n 1 is actually the right answer: multi-service-only gotchas
One narrow class of concurrency bug belongs at -n 1, not -n 4: bugs that only exist when two capsem-service processes run on the same host. Apple's Virtualization.framework does not tolerate overlapping saveMachineStateToURL / restoreMachineStateFromURL calls on sibling VMs, and we serialize with a per-service tokio::sync::Mutex (ServiceState::save_restore_lock). That lock is in-process, so it only serializes VMs inside one service. Production always has exactly one service per host per user, so the lock is sufficient in real deployments.
tests/capsem-mcp/test_stress_suspend_resume.py runs under pytest-xdist, which spawns one capsem-service per worker. At -n 2+, worker A's service can't see worker B's lock, and you re-expose the bug that never happens in production. This is the one case where the "n=4 dogfoods concurrency" rule doesn't apply -- the concurrency being tested would never happen outside the test harness. Keep this harness at -n 1. Full context and the failure signature live in docs/src/content/docs/gotchas/concurrent-suspend-resume.md.
This is NOT a blanket license to run any flaky test at -n 1. If you're tempted to demote another test, first ask: "Would this failure occur in production with one capsem-service and N VMs?" If yes, it belongs at -n 4; fix the product.
Adversarial testing
Capsem is a security product. Every security-relevant feature needs tests that actively try to break invariants. Think like an attacker:
- Can a corp-blocked domain be snuck through another provider's list?
- Does an overlapping wildcard in allow+block always deny?
- Does malformed input (empty strings, unicode, huge payloads, invalid JSON) get rejected?
- Can path traversal escape the VirtioFS sandbox?
- Can a guest process modify its own binaries?
Stress-test boundary conditions. Write tests for the attacks you'd attempt yourself.
Security invariants to verify in tests
When touching security-relevant code, check these invariants have test coverage:
| Invariant | What to test | Where |
|---|---|---|
VirtioFS share is guest/ only | session_dir/guest/ exists, symlinks resolve, host-only files (session.db, serial.log) are outside the share | capsem-core::lib::tests |
| UDS sockets are 0600 | After bind, verify permissions exclude other users | capsem-process |
| Process env is cleared | env_clear() called, only allowlisted vars passed | capsem-service spawn tests |
No process::exit on guest I/O | Control channel close causes loop break, not exit | capsem-process |
| Sensitive logs are 0600 | serial.log created with restricted permissions | capsem-process |
| Gateway auth on all routes | Every route except GET / returns 401 without token | capsem-gateway::auth::tests |
| Auth rate limiting | 429 after threshold, resets after window | capsem-gateway::auth::tests |
| CORS rejects external origins | Only localhost/127.0.0.1/tauri allowed | capsem-gateway::tests |
| Body size limit | 413 for >10MB payloads | capsem-gateway::proxy::tests |
| VM ID validation | Path traversal (../), dots, spaces, null bytes rejected | capsem-gateway::terminal::tests |
| Rootfs read-only | squashfs mounted ro, guest binaries 555 | capsem-doctor in-VM tests |
| Suspend reports errors | IPC failure and timeout both return 500, not silent success | capsem-service tests |
Test fixture anti-pattern: masking races with polling
If all test fixtures wait/poll before asserting, the tests will never catch server-side race conditions. For every endpoint that talks to a VM socket, write at least one test that calls it IMMEDIATELY after provision (no wait_exec_ready, no ready_vm fixture). The server must handle readiness internally.
Pattern to avoid (masks the bug -- server never needs wait logic because client always waits):
fixture calls provision -> fixture polls wait_exec_ready -> test calls exec
Required test pattern (catches the bug -- if server doesn't wait, test fails):
test calls provision -> test immediately calls exec -> server handles wait
See tests/capsem-service/test_svc_exec_ready.py for the regression tests that enforce this.
wait_exec_ready is a single call, not a loop
wait_exec_ready (in tests/helpers/service.py, tests/helpers/mcp.py, tests/capsem-gateway/test_gw_e2e.py) makes one exec call with the server-side timeout passed through. The server's handle_exec calls wait_for_vm_ready internally, which polls until the VM is ready. Do NOT add client-side retry loops -- that creates a double-wait where each retry can block for the full server timeout (30s client retries x 30s server wait = pathological cascade). One wait, one place.
Exec latency regression gate
tests/capsem-serial/test_boot_timing.py::test_exec_latency_under_1_5_seconds asserts that provision-to-first-exec completes in under 1.5s. If this test fails, investigate boot time (process.log boot_timeline spans), not the wait mechanism.
Where tests live
- Rust unit: sibling
tests.rsfile, not inlinemod tests { ... }. See the next subsection. - Rust integration:
crates/capsem-core/tests/ - In-VM diagnostics:
guest/artifacts/diagnostics/test_*.py(see dev-testing-vm) - Hypervisor: KVM + Apple VZ tests (see dev-testing-hypervisor)
- Frontend:
frontend/src/lib/__tests__/(see dev-testing-frontend) - Python (builder):
tests/test_*.py - Python integration (service daemon):
tests/capsem-*/directories, each with its own conftest.py and pytest marker
Rust unit tests: sibling tests.rs pattern
Every Rust module keeps its unit tests in a sibling tests.rs, not an inline mod tests { ... } block. The parent module declares:
// foo.rs OR foo/mod.rs
// ... production code ...
#[cfg(test)]
mod tests;
and the tests go in tests.rs in the same directory:
// tests.rs -- sibling of foo.rs or child of foo/
use super::*;
#[test]
fn roundtrip() { ... }
Why. Inline #[cfg(test)] mod tests { ... } blocks are appended at the bottom of prod files and commonly hit 50–99% of the file's line count. That means every Read, grep, and scroll to reach production code walks past thousands of test lines first. Several modules in this codebase hit 4,000+ lines that way before extraction. Agents and humans both read faster when prod code isn't buried.
Mechanics.
tests.rsis a submodule of the parent file --use super::*;works, private items are visible,#[cfg(test)]on themod tests;declaration still gates compilation.- For files that don't yet have a sibling directory (e.g.
lib.rs,foo.rs), puttests.rsnext to them in the samesrc/directory. - For files that are already
foo/mod.rs, puttests.rsinsidefoo/. - Attributes on the inline
mod testsblock (e.g.#[allow(unused_imports)]) move onto the declaration:#[cfg(test)]\n#[allow(unused_imports)]\nmod tests;.
Extraction recipe (for any remaining inline mod tests { ... }):
- Move the block body (everything between the outer
{and}) into a new siblingtests.rs. - Dedent one indentation level so contents read as top-level items.
- Replace the old inline block with
#[cfg(test)] mod tests;(plus any attributes that were on the original). cargo test -p <crate>-- should pass identically.
When to push back. If you see a new PR or agent output adding an inline mod tests { ... } block, request it be moved to tests.rs before merge. Exceptions are narrow: tiny helper modules under ~50 lines total where inline tests plus prod code fit on one screen, or a module that's already a test-only helper.
Integration test suites
All Python integration tests live under tests/capsem-*/ and use pytest markers. Each suite has a dedicated just recipe.
| Suite | Directory | Marker | VM? | What it tests |
|---|---|---|---|---|
| Service API | capsem-service/ | integration | Yes | HTTP endpoints: provision, list, info, exec, logs, file I/O, delete |
| CLI | capsem-cli/ | integration | Yes | CLI subcommands via subprocess |
| MCP | capsem-mcp/ | mcp | Yes | MCP server black-box (stdio, tool routing) |
| Session DB | capsem-session/ | session | Yes | Telemetry: net/model/tool/mcp/fs/snapshot events |
| Snapshots | capsem-snapshots/ | snapshot | Yes | Auto/manual snapshots, revert |
| Isolation | capsem-isolation/ | isolation | Yes | Multi-VM filesystem + network isolation |
| Security | capsem-security/ | security | Yes | Binary perms, codesigning, asset integrity, env blocklist |
| Config | capsem-config/ | config | Yes | Limits, resource bounds, hot-reload |
| Bootstrap | capsem-bootstrap/ | bootstrap | No | Setup flow, dev tools, asset checks |
| Stress | capsem-stress/ | stress | Yes | 5 concurrent VMs, rapid create/delete |
| Build chain | capsem-build-chain/ | build_chain | Yes | cargo build -> codesign -> pack -> manifest -> boot |
| Guest | capsem-guest/ | guest | Yes | Network, services, filesystem, env inside guest |
| Cleanup | capsem-cleanup/ | cleanup | Yes | Process killed, socket removed, session dir removed |
| Codesign | capsem-codesign/ | codesign | No | All binaries signed, entitlements present (FAIL not skip) |
| Serial | capsem-serial/ | serial | Yes | Console logs, boot timing < 30s |
| Session lifecycle | capsem-session-lifecycle/ | session_lifecycle | Yes | DB exists, schema, events, survives shutdown |
| Config runtime | capsem-config-runtime/ | config_runtime | Yes | CPU/RAM applied in guest, blocked domains |
| Recipes | capsem-recipes/ | recipe | No | just run-service, just doctor, cargo build |
| Recovery | capsem-recovery/ | recovery | Yes | Stale socket/instances, orphaned process, double service |
| Rootfs artifacts | capsem-rootfs-artifacts/ | rootfs | No | Artifact files, build context, doctor consistency |
| Session exhaustive | capsem-session-exhaustive/ | session_exhaustive | Yes | Per-table data validation, cross-table FK integrity |
| Install | capsem-install/ | install | No | Native installer: layout, auto-launch, service install, setup wizard, update, uninstall, lifecycle, reinstall, error paths |
Composite recipe: just test-vm runs build-chain + guest + cleanup + codesign + serial + session-lifecycle + config-runtime + recovery. just test-install runs the install suite in Docker with systemd. just test runs everything.
Test matrix: what runs where
Rust crate CI coverage
| Crate | Tests | CI macOS | CI Linux | Smoke | Full |
|---|---|---|---|---|---|
| capsem-core | ~1695 | Yes | Yes | No | Yes |
| capsem-agent | ~71 | Yes | No | No | Yes |
| capsem-logger | ~47 | Yes | Yes | No | Yes |
| capsem-proto | ~132 | Yes | Yes | No | Yes |
| capsem-gateway | ~38 | Yes | No | No | Yes |
| capsem-service | ~109 | Yes | Yes | No | Yes |
| capsem (CLI) | ~140 | Yes | Yes | No | Yes |
| capsem-mcp | ~67 | Yes | Yes | No | Yes |
| capsem-tray | ~47 | Yes | No | No | Yes |
| capsem-process | ~62 | Yes | No | No | Yes |
| capsem-app | ~35 | Check | No | No | Yes |
Python integration suite tier map
| Suite | Marker | VM? | CI | Smoke | Full |
|---|---|---|---|---|---|
| capsem-bootstrap | bootstrap | No | Run | No | Yes |
| capsem-codesign | codesign | No | Run | No | Yes |
| capsem-rootfs-artifacts | rootfs | No | Run | No | Yes |
| capsem-mcp | mcp | Yes | Collect | Yes | Yes |
| capsem-service | integration | Yes | Collect | Yes | Yes |
| capsem-cli | integration | Yes | Collect | Yes | Yes |
| capsem-gateway | gateway | Yes | Collect | Yes | Yes |
| capsem-e2e | e2e | Yes | Collect | No | Yes |
| capsem-session | session | Yes | Collect | No | Yes |
| capsem-session-lifecycle | session_lifecycle | Yes | Collect | No | Yes |
| capsem-session-exhaustive | session_exhaustive | Yes | Collect | No | Yes |
| capsem-security | security | Yes | Collect | No | Yes |
| capsem-isolation | isolation | Yes | Collect | No | Yes |
| capsem-snapshots | snapshot | Yes | Collect | No | Yes |
| capsem-config | config | Yes | Collect | No | Yes |
| capsem-config-runtime | config_runtime | Yes | Collect | No | Yes |
| capsem-guest | guest | Yes | Collect | No | Yes |
| capsem-cleanup | cleanup | Yes | Collect | No | Yes |
| capsem-stress | stress | Yes | Collect | No | Yes |
| capsem-recovery | recovery | Yes | Collect | No | Yes |
| capsem-serial | serial | Yes | Collect | No | Yes |
| capsem-lifecycle | integration | Yes | Collect | No | Yes |
| capsem-build-chain | build_chain | Yes | Collect | No | Yes |
| capsem-recipes | recipe | No | Run | No | Yes |
| capsem-install | install | No | Yes (Docker) | No | Yes |
"Run" = tests execute in CI. "Collect" = imports verified (--collect-only) but tests skip (need VM). "Yes (Docker)" = runs in dedicated Docker+systemd CI job.
Coverage targets
| Component | Floor | Enforced | Where |
|---|---|---|---|
| Rust workspace | 70% | --fail-under-lines 70 | CI (cargo llvm-cov), just test |
| Python builder | 90% | --cov-fail-under=90 | CI (pytest), just test |
| capsem-service | 80% | Codecov component | codecov.yml |
| capsem-mcp | 80% | Codecov component | codecov.yml |
| capsem-gateway | 80% | Codecov component | codecov.yml |
| capsem (CLI) | 80% | Codecov component | codecov.yml |
Coverage
- Rust:
cargo llvm-covviajust test(floor: 70% line coverage) - Python:
--cov-fail-under=90 codecov.ymlmaps components to code paths. Update it when files or directories are added, moved, or renamed.
Fast debug with capsem MCP tools
When the capsem MCP server is configured, Claude Code has direct VM control via MCP tools -- no shell commands or just recipes needed. This is the fastest way to test changes interactively because you stay in the conversation loop: create a VM, run commands, inspect results, fix code, repeat.
The tools
| Tool | What it does |
|---|---|
capsem_create | Spin up a fresh VM (returns VM id). Named VMs are persistent. |
capsem_run | One-shot: boot temp VM, exec command, destroy, return output |
capsem_exec | Run a command inside a running guest |
capsem_stop | Stop VM (persistent: preserve state; ephemeral: destroy) |
capsem_resume | Resume a stopped persistent VM |
capsem_read_file | Read a file from the guest filesystem |
capsem_write_file | Write a file into the guest |
capsem_inspect_schema | Get session.db table schema |
capsem_inspect | Run SQL against session.db (telemetry) |
capsem_list | Show all VMs (running + stopped persistent) |
capsem_info | VM details (config, status, persistent, PID) |
capsem_delete | Destroy VM and wipe all state |
capsem_persist | Convert running ephemeral VM to persistent |
capsem_purge | Kill all temp VMs (all=true includes persistent) |
capsem_fork | Fork a running/stopped VM into a reusable image |
capsem_image_list | List all user images |
capsem_image_inspect | Inspect a specific image's metadata |
capsem_image_delete | Delete a user image |
Debug workflow
Quick one-shot (no VM management): capsem_run with the command you want to test.
Iterative debugging (long-lived VM):
- Create:
capsem_create-- boots a fresh VM in ~10s - Test:
capsem_execwith the command you want to verify (e.g.,capsem-doctor -k net,cat /etc/resolv.conf,curl https://example.com) - Inspect:
capsem_read_fileto check config files, logs;capsem_inspectto query telemetry tables - Iterate: fix code on host, rebuild (
just build), create a new VM to test again - Cleanup:
capsem_deletewhen done
When to use MCP tools vs just recipes
| Scenario | Use |
|---|---|
| Quick check: "does this command work in the guest?" | capsem_run |
| Read a guest file to understand state | capsem_read_file |
| Verify telemetry was recorded correctly | capsem_inspect with SQL query |
| Full regression suite | just test |
| Build + boot + validate in one shot | just smoke |
| Benchmark performance | just bench |
MCP tools are for fast, targeted checks during development. Just recipes are for comprehensive validation before committing.
Common debug queries
-- Check network events for a domain
SELECT * FROM net_events WHERE domain LIKE '%example%' ORDER BY timestamp DESC LIMIT 10;
-- Verify MCP tool calls were logged
SELECT server_name, tool_name, decision, duration_ms FROM mcp_calls ORDER BY timestamp DESC;
-- Check model API calls
SELECT provider, model, status_code, duration_ms FROM model_calls ORDER BY timestamp DESC;
-- File system events
SELECT operation, path, success FROM fs_events ORDER BY timestamp DESC LIMIT 20;
End-to-end validation is not optional
After any change touching guest binaries, network policy, telemetry, MCP, or VM lifecycle:
just run "capsem-doctor"-- verifies sandbox integrity inside the VM- After telemetry/logging changes: run a real session and verify with
just inspect-sessionthat all 6 tables (net_events, model_calls, tool_calls, tool_responses, mcp_calls, fs_events) are populated correctly
When tests fail
Never dismiss a test failure as "pre-existing" or "unrelated." Every failure must be investigated. Follow the dev-debugging workflow:
- Do not change the test to make it pass. The test is evidence. Changing the assertion to match broken behavior destroys that evidence.
- Reproduce and diagnose first. Understand why it fails before writing any fix. See the dev-debugging skill for the full methodology: reproduce with a test, diagnose root cause, then fix comprehensively.
- Fix the code, not the test. If the test is genuinely wrong (not the code), explain in detail why the test's expectation is incorrect before changing it.
Platform gating tests
cargo test --test platform_gating scans all .rs files under crates/ for macOS-only and Linux-only symbols (libc::clonefile, AppleVzHypervisor, KvmHypervisor, FICLONE, etc.) and verifies they appear inside #[cfg(target_os = "...")] blocks. This catches ungated platform APIs before they reach CI. Run this test when adding any platform-specific code.
Testable design
Extract logic into capsem-core -- never embed business logic in the app layer where it's coupled to Tauri. If you can't test something without booting a VM or launching the GUI, it belongs in core.