name: api-resilience-patterns description: Implement API resilience patterns — circuit breakers, retry with backoff, rate limiting, bulkhead isolation, timeout management, and graceful degradation. version: "1.0.0" last-updated: "2026-04-17" model_tested: "claude-sonnet-4-6" category: resilience platforms: [claude-code, codex, gemini-cli, cursor, copilot, windsurf, cline] language: en geo_relevance: [global] priority: medium dependencies: mcp: [] skills: [] apis: [] data: [] update_sources:
- url: "https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker" check_frequency: "yearly" last_checked: "2026-04-21" license: MIT
API Resilience Patterns
When to Use
- Calling external APIs that might fail or slow down
- Designing microservice communication
- Building agents that call multiple tools/APIs
- Handling rate limits from LLM providers
- Preventing cascade failures
Pattern 1: Circuit Breaker
Prevents repeated calls to a failing service.
States: Closed (normal) → Open (failing) → Half-Open (testing)
| State | Behavior | Transition |
|---|---|---|
| Closed | Forward requests normally | → Open after N consecutive failures |
| Open | Reject immediately (fail fast) | → Half-Open after cooldown period |
| Half-Open | Allow 1 test request | → Closed if success, → Open if fail |
Config: threshold=3 failures, cooldown=30s, half-open-max=1.
Pattern 2: Retry with Exponential Backoff
Attempt 1: immediate
Attempt 2: wait 1s + random(0-500ms)
Attempt 3: wait 2s + random(0-500ms)
Attempt 4: wait 4s + random(0-500ms)
Max: 5 attempts, 16s max wait
Rules:
- Only retry on transient errors (429, 500, 502, 503, timeout)
- Never retry on client errors (400, 401, 403, 404)
- Always add jitter to prevent thundering herd
- Set a total timeout budget (not just per-attempt)
Pattern 3: Rate Limiting (Client-Side)
Respect provider limits proactively:
| Strategy | When | How |
|---|---|---|
| Token bucket | Steady rate with bursts | Refill N tokens/sec, consume per request |
| Sliding window | Strict per-minute limits | Track timestamps of last N requests |
| Queue-based | Ordered processing | FIFO queue with configurable concurrency |
Pattern 4: Bulkhead Isolation
Isolate failures to prevent cascade:
- Separate connection pools per service
- Separate thread/worker pools per dependency
- If service A fails, services B and C are unaffected
Pattern 5: Timeout Management
| Tier | Timeout | Purpose |
|---|---|---|
| Connection | 5s | Detect unreachable host |
| Request | 30s | Detect slow response |
| Total operation | 60s | Budget for retries included |
Rule: Total timeout > (max_retries × request_timeout). Always set all three.
Pattern 6: Graceful Degradation
| Scenario | Fallback |
|---|---|
| Search API down | Return cached results + "results may not be current" |
| Payment API slow | Queue payment, confirm later |
| AI API rate-limited | Switch to cheaper/faster model |
| Database read replica down | Read from primary (accept perf hit) |
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Retry without backoff | Amplifies load on failing service | Exponential backoff + jitter |
| No timeout | Thread/connection leak | Always set timeouts |
| Retry on all errors | Retrying 401 wastes time | Only retry transient errors |
| Sync retry in UI thread | Blocks user interface | Async retry with status feedback |
| Cascading timeouts | Inner timeout > outer timeout | Budget timeouts from outside in |
What This Skill Does NOT Do
- Does not implement specific libraries (guides patterns)
- Does not monitor uptime (use APM tools)
- Does not manage API keys or authentication
- Does not handle business logic fallbacks (only infrastructure patterns)