name: dev-benchmark description: Capsem benchmarking with capsem-bench. Use when running benchmarks, adding new benchmark categories, interpreting results, or investigating performance regressions. Covers all 7 benchmark categories (disk, rootfs, startup, http, throughput, snapshot, all), the JSON output format, and how to add new benchmarks.

Benchmarking

Quick start

just bench                          # Run all benchmarks in VM (~2 min)
just run "capsem-bench snapshot"    # Snapshot benchmarks only
just run "capsem-bench disk"        # Disk I/O only
just test                           # Full validation including benchmarks

capsem-bench

Python tool that runs inside the VM. Rich tables to stderr (human), structured JSON saved to /tmp/capsem-benchmark.json (machine).

Location: guest/artifacts/capsem_bench/ (Python package, invoked via capsem-bench shell wrapper)

Benchmark categories

Category	Command	What it measures
disk	`capsem-bench disk`	Sequential/random I/O on scratch disk (write/read throughput, IOPS)
rootfs	`capsem-bench rootfs`	Read-only rootfs performance (sequential + random 4K reads)
startup	`capsem-bench startup`	Cold-start latency for python3, node, claude, gemini, codex
http	`capsem-bench http [URL] [N] [C]`	HTTP throughput through MITM proxy (requests/sec, latency percentiles)
throughput	`capsem-bench throughput`	100MB download through MITM proxy (end-to-end MB/s)
snapshot	`capsem-bench snapshot`	Snapshot create/list/changes/revert/delete via MCP (ms per op at 10/100/500 files)
all	`capsem-bench`	All of the above

Snapshot benchmarks

Tests the full MCP snapshot pipeline end-to-end (guest CLI -> MCP server -> vsock -> host gateway -> filesystem). Measures at 3 workspace sizes (10, 100, 500 files):

create: Populate workspace, create named snapshot via MCP
list: List all snapshots with change diffs
changes: List changed files since checkpoint
revert: Revert a single file from snapshot
delete: Delete the snapshot

Key metrics: per-operation latency in ms. Regressions in create usually mean the clone or hash stage got slower. Use RUST_LOG=capsem=debug to see per-stage breakdown (clone_ws_ms, clone_sys_ms, hash_ms).

JSON output format

{
  "version": "0.3.0",
  "timestamp": 1711561234.5,
  "hostname": "capsem",
  "disk": { "seq_write_mbps": 450, ... },
  "rootfs": { ... },
  "startup": { "python3": { "min_ms": 45, "mean_ms": 48, "max_ms": 52 }, ... },
  "http": { "rps": 120, "p50_ms": 42, ... },
  "throughput": { "throughput_mbps": 85, ... },
  "snapshot": {
    "10_files": { "create_ms": 120, "list_ms": 50, ... },
    "100_files": { "create_ms": 250, ... },
    "500_files": { "create_ms": 800, ... }
  }
}

Environment variables

CAPSEM_BENCH_DIR: Test directory for disk benchmarks (default: /root)
CAPSEM_BENCH_SIZE_MB: Write test size in MB (default: 256)

Investigating slowness

Snapshot performance

Run snapshot benchmark: just run "capsem-bench snapshot"
Check per-stage timing: RUST_LOG=capsem=debug just run "capsem-bench snapshot" -- look for snapshot_into_slot timing log lines showing clone_ws_ms, clone_sys_ms, hash_ms
Check session data: just inspect-session -- MCP tool usage section shows avg duration per snapshot operation
Query detailed durations: just query-session "SELECT tool_name, duration_ms FROM mcp_calls WHERE tool_name LIKE 'snapshot%' ORDER BY duration_ms DESC LIMIT 20"

Common causes:

clone_ws_ms high: Large workspace, or APFS clonefile falling back to byte copy
hash_ms high: Many files in workspace (walkdir overhead), or slow filesystem
compact slow: Merging many snapshots with overlapping files

Disk I/O regression

Run: just run "capsem-bench disk"
Compare sequential write/read throughput against baseline
Check if VirtioFS mode changed (block mode has different I/O characteristics)

Adding a new benchmark

Create a new module in guest/artifacts/capsem_bench/ (e.g., mytest.py) with a mytest_bench() function that returns a dict and prints a Rich table
Add the mode name to VALID_MODES in __main__.py
Wire it into main() with the if mode in ("name", "all"): pattern (lazy import)
Update this skill and the benchmarking doc page

Host-side lifecycle benchmark

Profiles individual VM lifecycle operations from the host. Runs outside the guest via pytest, not via capsem-bench.

uv run pytest tests/capsem-serial/test_lifecycle_benchmark.py -xvs

Location: tests/capsem-serial/test_lifecycle_benchmark.py

Operations measured

Operation	What it times
provision	HTTP POST `/provision` to service (VM creation + process spawn)
exec_ready	First `echo ready` exec succeeds (VM boot + vsock handshake)
exec	Simple `echo ok` on a running VM
delete	HTTP DELETE `/delete/{name}` (VM teardown + cleanup)

Output

Per-run breakdown printed to stdout
Summary table with min/mean/max per operation
JSON saved to benchmarks/lifecycle/data_{version}.json (committed to git for historical tracking)

Regression gates

Every operation must complete in under 1.2 seconds. The test runs 3 cycles and asserts each individual operation stays under the gate.

Host-side fork benchmark

Profiles fork (image creation) and boot-from-image. Same test file, separate test function.

uv run pytest tests/capsem-serial/test_lifecycle_benchmark.py::test_fork_benchmark -xvs

Operations measured

Metric	What it measures	Gate
fork	`POST /fork/{id}` — APFS clonefile of rootfs overlay + workspace	< 500ms
image_size	Actual disk usage of forked image (blocks, not logical size)	< 12MB
boot_provision	`POST /provision` with `image` param — clone image into new session	< 1200ms
boot_ready	First exec succeeds on the image-booted VM	< 1200ms
pkg_survived	Packages installed via apt survive fork (rootfs overlay)	must pass
ws_survived	Files written to /root/ survive fork (VirtioFS workspace)	must pass

Output

Per-run breakdown with timing + survival status
Summary table with min/mean/max + gate thresholds
JSON saved to benchmarks/fork/data_{version}.json (committed to git for historical tracking)

When to run

After changes to fork/image code (capsem-core/src/image.rs)
After changes to VirtioFS session layout (capsem-core/src/lib.rs)
After changes to disk usage reporting (session/maintenance.rs)
After changes to boot-from-image path in capsem-service or capsem-process
Before cutting a release

When to run (lifecycle)

After changes to boot path (capsem-process, capsem-init, capsem-core/vm/boot.rs)
After changes to VM teardown / delete path
After changes to the service daemon (capsem-service)
Before cutting a release

Tests

In-VM benchmark test: just run "capsem-bench all"
In-VM availability: test_utilities.py::test_utility_available[capsem-bench]
Host-side lifecycle: uv run pytest tests/capsem-serial/test_lifecycle_benchmark.py::test_lifecycle_benchmark -xvs
Host-side fork: uv run pytest tests/capsem-serial/test_lifecycle_benchmark.py::test_fork_benchmark -xvs
Both host-side: uv run pytest tests/capsem-serial/test_lifecycle_benchmark.py -xvs
Full run: just bench or just test

Benchmark data directory

Host-side benchmarks save versioned JSON to benchmarks/ (committed to git):

benchmarks/
  fork/data_0.16.1.json          # Fork speed, image size, data survival
  lifecycle/data_0.16.1.json     # Provision, exec-ready, exec, delete

These data files feed the documentation benchmark page at docs/src/content/docs/benchmarks/results.md. Before a release, run both benchmarks and update the results page with the new numbers. See /release-process for the full checklist.

ナビゲーション

Skillsとは？

リンク

dev-benchmark

Benchmarking

Quick start

capsem-bench

Benchmark categories

Snapshot benchmarks

JSON output format

Environment variables

Investigating slowness

Snapshot performance

Disk I/O regression

Adding a new benchmark

Host-side lifecycle benchmark

Operations measured

Output

Regression gates

Host-side fork benchmark

Operations measured

Output

When to run

When to run (lifecycle)

Tests

Benchmark data directory

関連スキル(🔧 開発ツール)