name: design-system
type: reference
description: "Decomposes a product concept into architectural components, domain systems, data models, and integration boundaries. Use when starting system architecture or when the user mentions system design or component breakdown."
effort: 3
allowed-tools: Read, Glob, Grep, Write, Edit, Bash
user-invocable: true
when_to_use: "When designing system architecture, defining domain boundaries, or creating a component breakdown for a new product or feature"
System Design
Phase 1: Clarify requirements (always do this first)
Ask before designing:
- Scale: How many users/requests/day? Read-heavy or write-heavy?
- Consistency: Strong (banking) or eventual (social feed)?
- Availability target: 99.9% (8.7h/yr downtime) or 99.99% (52min/yr)?
- Latency budget: p99 < 100ms? < 1s?
- Geography: Single region or multi-region?
Capacity estimation shortcuts
1M users/day active → ~12 req/s avg, ~120 req/s peak (10x)
1KB per request → 1M req/day = ~1GB/day = ~365GB/year
Read:write ratio 10:1 (typical social) → optimize read path first
1 server handles ~1000 req/s (rule of thumb for I/O-bound services)
Component breakdown template
Client layer → Web / Mobile / API consumers
CDN → Static assets, edge caching
API Gateway → Rate limiting, auth, routing, SSL termination
Services → Domain-specific services (User, Order, Payment, Notification)
Cache → Redis for hot data (sessions, rate limits, computed results)
Database → Primary DB + Read replicas
Message queue → Async operations, event-driven decoupling
Storage → Object storage for files (S3/GCS)
Monitoring → Metrics, logs, traces, alerts
Database selection guide
| Need | Choose |
|---|
| ACID transactions, relations | PostgreSQL |
| High-scale document store | MongoDB |
| Key-value, cache, pub/sub | Redis |
| Time-series data | TimescaleDB / InfluxDB |
| Graph relationships | Neo4j |
| Full-text search | Elasticsearch |
| Analytical/OLAP | ClickHouse / BigQuery |
Caching strategies
Cache-aside (read): App checks cache → miss → DB → write to cache
Write-through: Write to cache AND DB simultaneously (consistent, slower writes)
Write-behind: Write to cache → async flush to DB (fast writes, risk of loss)
Read-through: Cache handles DB reads automatically
TTL guidelines:
- Sessions: 15-30 min
- User profile: 5 min
- Product catalog: 1 hour
- Config/settings: 24 hours
Message queue patterns
When to use queues:
✓ Async processing (email, PDF generation, notifications)
✓ Rate-limiting downstream services
✓ Decoupling services (order → payment → shipping)
✓ Fan-out (1 event → multiple consumers)
Queue selection:
- RabbitMQ: complex routing, request-reply, low latency
- Kafka: high throughput, event log/replay, stream processing
- SQS: managed, simple, AWS-native, at-least-once delivery
- Redis Streams: lightweight, same infra as cache
API design decisions
REST: Standard CRUD, simple clients, team familiarity (default choice)
GraphQL: Multiple clients with different data needs, reduce over-fetching
gRPC: Internal service-to-service, binary protocol, streaming needed
WebSocket: Real-time bidirectional (chat, live updates, collaborative tools)
Scaling patterns
Vertical (scale up): More CPU/RAM — quick, limited ceiling
Horizontal (scale out): More instances — requires stateless services
Database read replicas: Offload read traffic (good for 80%+ read workloads)
Database sharding: Shard by user_id, geography — last resort, complex
CQRS: Separate read/write models — when read/write patterns diverge heavily
Common design mistakes
| Mistake | Better approach |
|---|
| Over-engineering for scale you don't have | Start monolith, extract services at clear pain points |
| Synchronous calls to all dependencies | Use async queues for non-critical paths |
| No caching strategy | Cache at API layer + DB query results |
| Storing sessions in DB | Use Redis; DB sessions don't scale horizontally |
| Single point of failure | Redundancy at every critical layer |