name: coreweave-incident-runbook description: 'Incident response runbook for CoreWeave GPU workload failures.

Use when inference services are down, GPUs are unavailable,

or responding to production incidents on CoreWeave.

Trigger with phrases like "coreweave incident", "coreweave outage",

"coreweave runbook", "coreweave service down".

' allowed-tools: Read, Bash(kubectl:*), Grep version: 1.0.0 license: MIT author: Jeremy Longshore jeremy@intentsolutions.io tags:

saas
gpu-cloud
kubernetes
inference
coreweave compatibility: Designed for Claude Code

CoreWeave Incident Runbook

Triage Steps

# 1. Check pod status
kubectl get pods -l app=inference -o wide

# 2. Check recent events
kubectl get events --sort-by=.lastTimestamp | tail -20

# 3. Check node status
kubectl get nodes -l gpu.nvidia.com/class -o wide

# 4. Check GPU health
kubectl exec -it $(kubectl get pod -l app=inference -o name | head -1) -- nvidia-smi

Common Incidents

Inference Service Down

Check pod status and events
If OOMKilled: reduce batch size or upgrade GPU
If ImagePullBackOff: check registry credentials
If Pending: check GPU quota and availability

GPU Node Failure

Pods will be rescheduled automatically
If no capacity: scale down non-critical workloads
Contact CoreWeave support for extended outages

Model Loading Failure

Check HuggingFace token secret exists
Verify model name spelling
Check PVC has sufficient storage
Review container logs for download errors

Rollback

kubectl rollout undo deployment/inference

Resources

Next Steps

For data handling, see coreweave-data-handling.

ナビゲーション

Skillsとは？

リンク

coreweave-incident-runbook