Handle production incidents effectively. Use when responding to outages, conducting post-mortems, or improving reliability. Covers incident response and blameless culture.
name: incident-management
description: Handle production incidents effectively. Use when responding to outages, conducting post-mortems, or improving reliability. Covers incident response and blameless culture.
allowed-tools: Read, Write, Glob, Grep
Incident Management
Incident Severity
Level
Impact
Response Time
SEV1
Complete outage
Immediate
SEV2
Major degradation
< 15 min
SEV3
Minor degradation
< 1 hour
SEV4
Low impact
Next business day
Incident Response
1. Detect
Monitoring alerts
Customer reports
Error logs
2. Triage
Assess severity
Assign incident commander
Create communication channel
3. Investigate
Check recent changes
Review logs and metrics
Identify root cause
4. Mitigate
Apply quick fix
Rollback if needed
Communicate status
5. Resolve
Confirm fix
Monitor for recurrence
Close incident
6. Learn
Post-mortem meeting
Document findings
Create action items
Post-Mortem Template
# Post-Mortem: [Incident Title]
## Summary
[Brief description of what happened]
## Timeline
- HH:MM - [Event]
- HH:MM - [Event]
- HH:MM - [Resolution]
## Impact
- Duration: [X hours]
- Users affected: [X]
- Revenue impact: [if applicable]
## Root Cause
[What caused this incident]
## Contributing Factors
- [Factor 1]
- [Factor 2]
## What Went Well
- [Positive 1]
- [Positive 2]
## What Could Be Improved
- [Improvement 1]
- [Improvement 2]
## Action Items
- [ ] [Action 1] - Owner: [Name]
- [ ] [Action 2] - Owner: [Name]