name: incident-analysis description: Analyze and resolve production incidents using systematic investigation, root cause analysis, and autonomous remediation
Incident Analysis and Resolution Skill
This skill guides you through systematic incident response, from initial alert to resolution and documentation.
When to Use This Skill
Use this skill when:
- Production system is degraded or failing
- Users are reporting issues or errors
- Metrics show abnormal behavior
- Need to investigate and resolve incidents autonomously
Incident Response Workflow
MANDATORY FIRST STEP:
Before using any tools, use the Read tool to read:
.claude/skills/incident-analysis/phases/triage.md
This file contains Phase 1 instructions and will tell you which file to read next.
DO NOT proceed with tool calls until you've read Phase 1.
The complete workflow consists of 6 phases:
- Triage & Assessment (2-5 min) →
phases/triage.md - Investigation (5-10 min) →
phases/investigation.md - Root Cause Analysis (2-5 min) →
phases/rca.md - Remediation Planning (2-3 min) →
phases/remediation.md - Execution (2-5 min) →
phases/execution.md - Communication & Documentation (3-5 min) →
phases/documentation.md
Each phase file contains the "Next Step" section that directs you to the next phase file.
Available Tools
MCP servers provide the necessary tools:
- monitoring-analysis server - System metrics, log analysis, health checks
- workflow-orchestration server - Incident tickets, remediation, notifications
Tools are available throughout all phases as needed.
Key Principles
Progressive Disclosure: The phase files reveal detailed instructions progressively. Read each phase file in sequence - do not skip ahead or assume you know what to do.
Autonomous Execution: Make decisions based on evidence. Take action. Show reasoning.
TodoWrite: Create todos at start for all 6 phases. Update status as you progress.
Success Criteria
Incident is resolved when:
- ✅ Root cause identified with 90%+ confidence
- ✅ Remediation executed and verified
- ✅ Metrics returned to baseline
- ✅ No new errors occurring
- ✅ Team notified and documented