name: "Incident Response" description: "Structured production incident triage, resolution, and post-mortem. Apply when production systems are down, degraded, or behaving unexpectedly. Covers detection, containment, resolution, and learning." allowed-tools: Read, Write, Edit, Bash, Grep, Glob version: 2.1.0 compatibility: Claude Opus 4.6, Sonnet 4.6, Claude Code v2.1.x updated: 2026-03-26

Incident Response

Structured workflow for production incidents: detect, contain, resolve, learn.

When to Use

Production service is down or degraded
Users reporting errors or data issues
Monitoring alerts firing
Security incident detected
Data integrity issue discovered

Severity Classification

Severity	Impact	Response Time	Examples
SEV-1	Full outage, data breach, revenue loss	Immediate	Service down, auth broken, data leak
SEV-2	Major degradation, subset of users affected	< 30 min	Slow responses, feature broken, partial outage
SEV-3	Minor impact, workaround available	< 4 hours	UI bug, non-critical feature down
SEV-4	No user impact, internal issue	Next business day	Log errors, monitoring gaps

Phase 1: Detection & Triage (First 5 Minutes)

## Incident Triage

**Time detected:** [timestamp]
**Reporter:** [who noticed]
**Severity:** SEV-[1-4]

### What's happening?
[One sentence description of symptoms]

### Who's affected?
[All users / subset / internal only]

### What changed recently?
[Deployments, config changes, traffic spikes]

Quick Diagnostic Commands

# Check recent deployments
git log --oneline -5 --since="2 hours ago"

# Check application logs
tail -100 /var/log/app/error.log | grep -i "error\|fatal\|panic"

# Check system resources
top -bn1 | head -20
df -h
free -h

# Check database connectivity
pg_isready -h $DB_HOST

# Check external service health
curl -sI https://api.stripe.com/v1 -o /dev/null -w "%{http_code}"

Phase 2: Containment (Minutes 5-15)

Goal: Stop the bleeding. Don't fix root cause yet.

Containment Action	When to Use
Rollback deployment	Problem started after a deploy
Feature flag disable	New feature causing issues
Scale up	Traffic/load related
Failover to backup	Primary system unrecoverable
Rate limit	Being overwhelmed by requests
Block bad actor	Malicious traffic identified

# Quick rollback (if deployment caused it)
git revert HEAD --no-edit && git push

# Or revert to last known good deployment
# (platform-specific: Vercel, Railway, etc.)

Phase 3: Resolution

Hypothesis-Driven Debugging

## Debugging Log

### Hypothesis 1: [Most likely cause]
Evidence for: [what supports this]
Evidence against: [what contradicts this]
Test: [how to verify]
Result: [confirmed / rejected]

### Hypothesis 2: [Second most likely]
...

Common Root Causes

Symptom	Common Cause	Quick Fix
500 errors spike	Bad deployment	Rollback
Slow responses	Database query regression	Kill slow queries, add index
Connection timeouts	Connection pool exhaustion	Restart, increase pool size
OOM crashes	Memory leak	Restart, set memory limits
Auth failures	Token/cert expiry	Rotate credentials
Data inconsistency	Race condition	Add locking, retry logic

Phase 4: Post-Mortem

Write within 48 hours of resolution. Blameless — focus on systems, not people.

# Post-Mortem: [Incident Title]

**Date:** YYYY-MM-DD
**Duration:** [start time] to [resolution time] ([X] minutes)
**Severity:** SEV-[X]
**Author:** [Name]

## Summary
[2-3 sentence summary of what happened and impact]

## Timeline
| Time | Event |
|------|-------|
| HH:MM | [First symptom / alert] |
| HH:MM | [Incident declared] |
| HH:MM | [Root cause identified] |
| HH:MM | [Fix deployed] |
| HH:MM | [Confirmed resolved] |

## Root Cause
[Technical explanation of what went wrong]

## Impact
- Users affected: [number or percentage]
- Duration: [minutes]
- Revenue impact: [if applicable]
- Data affected: [if applicable]

## What Went Well
- [Good thing 1]
- [Good thing 2]

## What Went Wrong
- [Process/system gap 1]
- [Process/system gap 2]

## Action Items
| Action | Owner | Due Date | Priority |
|--------|-------|----------|:--------:|
| [Preventive measure] | [Name] | [Date] | P1 |
| [Detection improvement] | [Name] | [Date] | P2 |
| [Process improvement] | [Name] | [Date] | P3 |

## Lessons Learned
[What the team should take away from this]

Communication Template

## Status Update: [Incident Title]

**Status:** Investigating / Identified / Monitoring / Resolved
**Impact:** [Who/what is affected]
**Current action:** [What we're doing right now]
**ETA:** [When we expect resolution, if known]
**Next update:** [When the next status update will be]

ナビゲーション

Skillsとは？

リンク

Incident Response

Incident Response

When to Use

Severity Classification

Phase 1: Detection & Triage (First 5 Minutes)

Quick Diagnostic Commands

Phase 2: Containment (Minutes 5-15)

Phase 3: Resolution

Hypothesis-Driven Debugging

Common Root Causes

Phase 4: Post-Mortem

Communication Template

Sources

関連スキル(⚙️ DevOps)