Test system resilience through controlled failures. Use when validating fault tolerance, disaster recovery, or system reliability. Covers chaos experiments.
name: chaos-engineering
description: Test system resilience through controlled failures. Use when validating fault tolerance, disaster recovery, or system reliability. Covers chaos experiments.
allowed-tools: Read, Write, Bash, Glob, Grep
Chaos Engineering
Principles
Build a Hypothesis: Define expected behavior
Minimize Blast Radius: Start small
Run in Production: Real conditions matter
Automate: Make experiments repeatable
Minimize Impact: Have abort conditions
Experiment Process
Steady State: Define normal metrics
Hypothesis: "System will maintain X under condition Y"
Introduce Variables: Inject failure
Observe: Compare to steady state
Analyze: Confirm or disprove hypothesis
Common Experiments
Network Failures
# Add latency
tc qdisc add dev eth0 root netem delay 100ms
# Packet loss
tc qdisc add dev eth0 root netem loss 10%
# Remove
tc qdisc del dev eth0 root
Resource Exhaustion
# CPU stress
stress --cpu 4 --timeout 60s
# Memory stress
stress --vm 2 --vm-bytes 1G --timeout 60s
# Disk fill
dd if=/dev/zero of=/tmp/fill bs=1M count=1024