name: investigate-incident description: Investigate platform incidents, perform RCA, create incident documentation, and follow alert runbooks in the Kagenti platform
Investigate Incident Skill
This skill helps you investigate incidents, perform root cause analysis (RCA), and create comprehensive incident documentation.
When to Use
- After alerts fire (check runbooks first)
- When tests fail unexpectedly
- User reports service unavailable
- Pods in CrashLoopBackOff or ImagePullBackOff
- Services showing degraded performance
- After deployment issues
What This Skill Does
- Follow Runbooks: Execute alert-specific investigation steps
- Gather Evidence: Collect logs, metrics, events, pod status
- Root Cause Analysis: Identify underlying issues
- Document Incidents: Create structured RCA in TODO_INCIDENTS.md
- Track Fixes: Plan and validate remediation
Investigation Workflow
1. Check If Alert Has Runbook
# List all available runbooks
ls -1 docs/runbooks/alerts/*.md
# For specific alert, check annotation
kubectl exec -n observability deployment/grafana -- \
curl -s 'http://localhost:3000/api/v1/provisioning/alert-rules' \
-u admin:admin123 | python3 -c "
import sys, json
rules = json.load(sys.stdin)
alert_uid = 'prometheus-down' # Change this
rule = next((r for r in rules if r.get('uid') == alert_uid), None)
if rule:
print('Runbook URL:', rule['annotations'].get('runbook_url', 'N/A'))
"
2. Follow Runbook Steps
Runbooks are located at: docs/runbooks/alerts/<alert-uid>.md
Standard runbook sections:
- Meaning - What this alert indicates
- Impact - Business/platform impact
- Diagnosis - Investigation commands
- Mitigation - How to fix
Example: Follow Prometheus Down runbook:
# From docs/runbooks/alerts/prometheus-down.md
# 1. Check pod status
kubectl get pods -n observability -l app=prometheus
# 2. Check pod logs
kubectl logs -n observability deployment/prometheus --tail=100
# 3. Check events
kubectl get events -n observability --field-selector involvedObject.name=prometheus --sort-by='.lastTimestamp'
# 4. Test Prometheus endpoint
kubectl exec -n observability deployment/grafana -- \
curl -s http://prometheus.observability.svc:9090/-/ready
3. Gather Comprehensive Evidence
Pod Status & Events:
# Get pod status in namespace
kubectl get pods -n <namespace>
# Detailed pod description
kubectl describe pod <pod-name> -n <namespace>
# Recent events sorted by time
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
# All failing pods across platform
kubectl get pods -A | grep -E "Error|CrashLoop|ImagePull|Pending"
Logs:
# Current container logs
kubectl logs -n <namespace> <pod-name> --tail=100
# Previous container (if crashed)
kubectl logs -n <namespace> <pod-name> --previous
# Specific container in pod
kubectl logs -n <namespace> <pod-name> -c <container-name>
# All containers in pod
kubectl logs -n <namespace> <pod-name> --all-containers=true
# Query Loki for errors (last 5 minutes)
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://loki.observability.svc:3100/loki/api/v1/query_range' \
--data-urlencode 'query={kubernetes_namespace_name="<namespace>"} |= "error"' \
--data-urlencode 'limit=100' \
--data-urlencode "start=$(date -u -v-5M +%s)000000000" \
--data-urlencode "end=$(date -u +%s)000000000" | python3 -m json.tool
Metrics:
# Check if service is up
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
--data-urlencode 'query=up{job="<job-name>"}' | python3 -m json.tool
# Check replica availability
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
--data-urlencode 'query=kube_deployment_status_replicas_available{deployment="<name>"}' \
| python3 -m json.tool
# Check pod restarts
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
--data-urlencode 'query=kube_pod_container_status_restarts_total{pod=~"<pod-pattern>"}' \
| python3 -m json.tool
ArgoCD Application Status:
# Check application health
argocd app get <app-name> --port-forward --port-forward-namespace argocd --grpc-web
# Check sync status
argocd app list --port-forward --port-forward-namespace argocd --grpc-web | grep -E "Degraded|OutOfSync"
# View recent sync history
argocd app history <app-name> --port-forward --port-forward-namespace argocd --grpc-web
4. Identify Root Cause
Common Root Causes:
-
Configuration Error
- Check recent Git commits:
git log --oneline -10 - Check ArgoCD diff:
argocd app diff <app-name> --port-forward ... - Validate YAML:
kustomize build components/...
- Check recent Git commits:
-
Image Issues
- Check image availability:
docker exec kagenti-demo-control-plane crictl images | grep <image> - Check ImagePullBackOff:
kubectl describe pod <pod-name> | grep -A10 "Events"
- Check image availability:
-
Resource Constraints
- Check pod resources:
kubectl top pods -n <namespace> - Check node resources:
kubectl top nodes - Check OOM kills:
kubectl get events -A | grep OOM
- Check pod resources:
-
Dependency Failure
- Check if dependent service is healthy
- Check service endpoints:
kubectl get endpoints -n <namespace> - Test connectivity:
kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it -- curl http://service-name
-
mTLS/Network Issues
- Check Istio sidecar:
kubectl get pods -n <namespace>(should show 2/2) - Check PeerAuthentication:
kubectl get peerauthentication -A - Check sidecar logs:
kubectl logs -n <namespace> <pod-name> -c istio-proxy
- Check Istio sidecar:
-
Certificate Issues
- Check certificate status:
kubectl get certificate -A - Check cert-manager logs:
kubectl logs -n cert-manager deployment/cert-manager
- Check certificate status:
5. Create Incident Documentation
Create entry in TODO_INCIDENTS.md:
## Incident #X: [Alert Name] - [Brief Description]
**Status**: 🔴 Active / 🟡 Investigating / 🟢 Resolved
**Detected**: 2025-11-17 08:31:40 UTC
**Severity**: Critical / Warning / Info
**Components Affected**:
- Component 1
- Component 2
### Summary
Brief description of what happened.
### Investigation
**Timeline**:
- 08:31 - Alert fired
- 08:35 - Checked pod status, found CrashLoopBackOff
- 08:40 - Reviewed logs, identified error message
- 08:45 - Identified root cause
**Evidence Collected**:
1. **Pod Status**:
NAME READY STATUS RESTARTS component-xxx 0/2 CrashLoopBackOff 5
2. **Error Logs**:
ERROR: Failed to connect to database: connection refused
3. **Events**:
Back-off restarting failed container
### Root Cause Analysis
**Root Cause**: [Specific technical reason]
**Why it happened**:
- Contributing factor 1
- Contributing factor 2
**Why alert fired**:
- PromQL query: `query_here`
- Query returned: `value`
- Threshold: `> threshold`
### Resolution
**Fix Applied**:
```bash
# Commands to fix the issue
kubectl apply -f ...
Verification:
# Commands to verify fix
kubectl get pods -n <namespace>
# Output showing healthy state
Time to Resolution: XX minutes
Lessons Learned
- What went well: Early detection via alerting
- What could improve: Need better validation before deploy
- Action items:
- Add pre-deployment validation
- Update runbook with this scenario
- Add integration test for this case
Related
- Alert:
alert-uid - Runbook: docs/runbooks/alerts/alert-uid.md
- Git commits: abc1234, def5678
### 6. Validate Fix
**After applying fix**:
```bash
# 1. Verify pods are healthy
kubectl get pods -n <namespace>
# 2. Check alert stopped firing
kubectl exec -n observability deployment/grafana -- \
curl -s 'http://localhost:3000/api/alertmanager/grafana/api/v2/alerts' \
-u admin:admin123 | python3 -c "
import sys, json
alerts = json.load(sys.stdin)
firing = [a for a in alerts if a.get('status', {}).get('state') == 'active']
for a in firing:
print(f\"{a['labels']['alertname']}: {a['labels']['severity']}\")
"
# 3. Run integration tests
pytest tests/integration/test_<component>.py -v
# 4. Check platform status
./scripts/platform-status.sh
# 5. Verify in Grafana UI
open https://grafana.localtest.me:9443/alerting/list
Common Investigation Patterns
Pattern 1: Pod Won't Start
# Investigation flow
kubectl get pods -n <namespace> # Get pod status
kubectl describe pod <pod-name> -n <namespace> # Check events
kubectl logs <pod-name> -n <namespace> # Check logs (if available)
# Common causes:
# - ImagePullBackOff: Image not found or not loaded
# - CrashLoopBackOff: Application exits on startup
# - Pending: Resource constraints or scheduling issues
# - Init:Error: Init container failed
Pattern 2: Service Unavailable
# Investigation flow
kubectl get pods -n <namespace> -l app=<service> # Check pods
kubectl get svc -n <namespace> <service-name> # Check service
kubectl get endpoints -n <namespace> <service-name> # Check endpoints
kubectl get httproute -n <namespace> # Check routes (if using Gateway API)
# Test connectivity
kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it \
-- curl -v http://<service-name>.<namespace>.svc:PORT
Pattern 3: High Resource Usage
# Investigation flow
kubectl top pods -n <namespace> --sort-by=memory # Memory usage
kubectl top pods -n <namespace> --sort-by=cpu # CPU usage
# Check limits
kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources}{"\n"}{end}'
# Check for OOM kills
kubectl get events -n <namespace> | grep OOM
Pattern 4: Frequent Restarts
# Investigation flow
kubectl get pods -n <namespace> # Check RESTARTS column
kubectl describe pod <pod-name> -n <namespace> # Check restart reason
kubectl logs <pod-name> -n <namespace> --previous # Logs before crash
# Check restart metrics
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
--data-urlencode 'query=increase(kube_pod_container_status_restarts_total{pod="<pod-name>"}[1h])'
Pattern 5: ArgoCD App OutOfSync
# Investigation flow
argocd app get <app-name> --port-forward ... # Get status
argocd app diff <app-name> --port-forward ... # See differences
# Check last sync
argocd app history <app-name> --port-forward ...
# Force sync if needed
argocd app sync <app-name> --force --prune --port-forward ...
Integration with Other Skills
Use check-alerts skill to find which alerts are firing:
# This automatically invokes check-alerts skill
"What alerts are currently firing?"
Use check-logs skill to query specific error patterns:
# This automatically invokes check-logs skill
"Show me error logs from keycloak namespace in the last 10 minutes"
Use check-metrics skill to verify service health:
# This automatically invokes check-metrics skill
"What's the CPU usage of pods in observability namespace?"
Incident Priority Guidelines
Critical (P0):
- Platform-wide outage
- All users affected
- Data loss risk
- Security breach
High (P1):
- Major feature unavailable
- Multiple users affected
- Performance severely degraded
Medium (P2):
- Minor feature unavailable
- Some users affected
- Workaround available
Low (P3):
- Cosmetic issue
- Minimal impact
- Can be fixed in next release
Related Documentation
- TODO_INCIDENTS.md - Active incident tracking
- docs/runbooks/alerts/ - Alert-specific runbooks
- CLAUDE.md Alert Monitoring - Alert workflow
- docs/04-observability/ALERT_TESTING_GUIDE.md - Alert testing
Pro Tips
- Always check runbook first: Save time by following proven steps
- Capture evidence early: Logs/events may be lost if pod restarts
- Use --previous logs: For crashed pods, previous container logs are critical
- Check Git history: Recent commits often reveal configuration changes
- Document as you go: Don't wait until end to write RCA
- Verify fix completely: Check pods, alerts, tests, and platform status
- Update runbook: Add new scenarios discovered during investigation
🤖 Generated with Claude Code