name: error-analysis description: This skill should be used when analyzing errors, stack traces, and logs to identify root causes and implement fixes.
Error Analysis Skill
Systematically analyze errors and logs to find root causes.
When to Use
- Debugging production errors
- Analyzing stack traces
- Investigating log patterns
- Performing root cause analysis
- Categorizing error types
Reference Documents
- Log Patterns - Common error patterns by type
- Root Cause Analysis - RCA techniques
- Error Categorization - Classifying errors
- Fix Patterns - Common fixes by error type
Analysis Workflow
1. Gather Information
## Error Report
### Error Message
TypeError: Cannot read property 'id' of undefined at UserService.getUser (src/services/user.ts:45:23) at async Router.handle (src/api/routes.ts:67:12)
### Context
- **Environment:** Production
- **Time:** 2024-01-15 14:30 UTC
- **Frequency:** 15 occurrences in last hour
- **Affected Users:** ~5% of requests
- **Recent Changes:** Deploy at 14:00 UTC
2. Categorize Error
## Error Classification
**Type:** Runtime Error
**Category:** Null/Undefined Reference
**Severity:** High (user-facing)
### Common Causes
1. Missing data validation
2. Race condition
3. API contract change
4. Data migration issue
3. Root Cause Analysis
## Root Cause Investigation
### Timeline
- 14:00 - Deploy to production
- 14:25 - First error reported
- 14:30 - Error rate increased
### Hypothesis 1: Deploy introduced bug
- Check: git diff between versions
- Result: New code path added
### Hypothesis 2: Data issue
- Check: Query for affected users
- Result: All have specific condition
### Root Cause
Deploy introduced code that assumes `user.profile` exists,
but 5% of users don't have profiles (legacy accounts).
4. Implement Fix
## Fix Implementation
### Short-term (Hotfix)
```typescript
// Add null check
const userId = user?.profile?.id ?? user.id;
Long-term
- Add data validation at API boundary
- Migrate legacy accounts
- Add regression test
## Error Categories
### By Origin
| Category | Description | Example |
|----------|-------------|---------|
| Client | User/browser errors | Invalid input |
| Server | Application errors | Null reference |
| Infrastructure | System errors | Connection timeout |
| External | Third-party errors | API rate limit |
### By Severity
| Level | Impact | Response |
|-------|--------|----------|
| Critical | System down | Immediate |
| High | Major feature broken | Hours |
| Medium | Degraded experience | This sprint |
| Low | Minor inconvenience | Backlog |
## Stack Trace Analysis
### Reading Stack Traces
```markdown
## Stack Trace Components
Error: Connection refused ← Error type and message at Database.connect ← Where error was thrown (src/db/connection.ts:23) ← File and line at async initialize ← Call chain (src/server.ts:45) at async main ← Entry point (src/index.ts:12)
### Key Questions
1. Where was the error thrown? (top of stack)
2. What called that code? (stack trace)
3. What was the input/state? (logs)
4. What changed recently? (git history)
Common Patterns
## Null Reference
TypeError: Cannot read property 'x' of undefined
**Check:** Variable existence, API response shape
## Connection Error
Error: ECONNREFUSED 127.0.0.1:5432
**Check:** Service running, network config, credentials
## Timeout
Error: Operation timed out after 30000ms
**Check:** Service health, query performance, network
## Memory
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed
**Check:** Memory leaks, large data processing
Log Analysis
Correlation
## Correlating Logs
### Using Request ID
```bash
grep "req-12345" application.log
Timeline Reconstruction
14:30:00.123 [req-12345] User login started
14:30:00.456 [req-12345] Fetching user profile
14:30:00.789 [req-12345] ERROR: Profile not found
14:30:00.790 [req-12345] TypeError: Cannot read 'id'
Pattern Identification
# Count error types
grep "ERROR" app.log | cut -d: -f2 | sort | uniq -c | sort -rn
# Find error spikes
grep "ERROR" app.log | cut -d' ' -f1 | uniq -c
## Resolution Tracking
### Fix Verification
```markdown
## Fix Verification Checklist
- [ ] Error no longer reproducible locally
- [ ] Unit test added for fix
- [ ] Integration test added
- [ ] Deployed to staging
- [ ] Verified in staging
- [ ] Deployed to production
- [ ] Monitoring error rate
- [ ] Error rate returned to baseline
Post-mortem Template
# Incident Post-mortem: [Title]
## Summary
Brief description of what happened.
## Timeline
- HH:MM - Event
- HH:MM - Detection
- HH:MM - Resolution
## Impact
- Users affected: X
- Duration: Y hours
- Revenue impact: $Z
## Root Cause
Detailed explanation.
## Resolution
What was done to fix it.
## Lessons Learned
1. What went well
2. What went poorly
3. Where we got lucky
## Action Items
- [ ] Prevent: [Action]
- [ ] Detect: [Action]
- [ ] Respond: [Action]