name: prometheus-patterns
description: PromQL queries, alerting rules, recording rules, Grafana dashboard JSON, SLO
Prometheus Patterns
PromQL Essentials
Rate and Error Calculations
# Request rate (per second, 5m window)
rate(http_requests_total[5m])
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
# P99 latency from histogram
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# P50 latency by endpoint
histogram_quantile(0.50,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, handler)
)
# Saturation: CPU usage per pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
/ sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod) * 100
SLO: Error Budget
# SLO: 99.9% availability over 30 days
# Error budget = 0.1% = 43.2 minutes/month
# Current burn rate (how fast consuming budget)
1 - (
sum(rate(http_requests_total{status!~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) / (1 - 0.999)
# Remaining error budget (percentage)
1 - (
sum(increase(http_requests_total{status=~"5.."}[30d]))
/ (sum(increase(http_requests_total[30d])) * 0.001)
)
Recording Rules
groups:
- name: sli_rules
interval: 30s
rules:
- record: job:http_request_rate:5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_error_rate:5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/ sum(rate(http_requests_total[5m])) by (job)
- record: job:http_latency_p99:5m
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)
Alerting Rules
groups:
- name: slo_alerts
rules:
- alert: HighErrorRate
expr: job:http_error_rate:5m > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 1% for {{ $labels.job }}"
runbook: "https://wiki.internal/runbooks/high-error-rate"
- alert: HighLatency
expr: job:http_latency_p99:5m > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "P99 latency above 500ms for {{ $labels.job }}"
- alert: ErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) > 14.4 * 0.001
for: 2m
labels:
severity: critical
annotations:
summary: "Burning error budget 14.4x faster than allowed"
Instrumentation (Go)
import "github.com/prometheus/client_golang/prometheus"
var (
httpRequests = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "handler", "status"},
)
httpDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: []float64{0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5},
},
[]string{"method", "handler"},
)
)
Checklist
Anti-Patterns
- High cardinality labels (user_id, trace_id) causing metric explosion
- Using
avg() for latency instead of histograms/quantiles
- Missing
for clause in alerts causing alert storms
- Recording rules with too short intervals wasting resources
- Alerting on symptoms without linking to causes
- Not setting meaningful histogram bucket boundaries