DevOps Intermediate
SLO, SLI and Error Budgets — Deep Dive¶
SLOSLIError BudgetSRE 6 min read
A practical guide to implementing SLOs and SLIs. Metric selection, error budget calculation, alerting and burn rate.
SLI — What to Measure¶
- Availability — % of successful requests
- Latency — % of requests below threshold (p99 < 300ms)
- Throughput — successfully processed operations/s
- Correctness — % of correct results
- Freshness — data age below threshold
SLO Definition¶
# SLO for API Gateway
SLO: 99.9% availability (monthly rolling window)
SLI: sum(http_requests{status!~"5.."})/sum(http_requests)
Error Budget: 0.1% = 43.2 minutes/month
# Prometheus recording rule
- record: sli:api_availability:ratio_rate30d
expr: |
sum(increase(http_requests_total{status!~"5.."}[30d]))
/ sum(increase(http_requests_total[30d]))
Error Budget & Burn Rate¶
Error budget = 1 - SLO. Burn rate tells you how fast you’re consuming the budget.
# Multi-window burn rate alert
- alert: HighErrorBudgetBurn
expr: |
(
sli:error_ratio:rate1h > (14.4 * 0.001)
and
sli:error_ratio:rate5m > (14.4 * 0.001)
)
labels:
severity: critical
annotations:
summary: "Error budget burn rate 14.4x"
Error Budget Policy¶
- Budget OK — deploy new features, experiment
- Budget < 50% — increased caution
- Budget exhausted — feature freeze, focus on stability
The error budget policy is an agreement between the SRE and product teams.
Summary¶
The SLO/SLI framework with error budgets and burn rate alerting transforms monitoring from reactive to proactive.
Need Help with Implementation?¶
Our team has experience designing and implementing modern architectures. We’re happy to help.