SLO, SLI and Error Budgets — Deep Dive

DevOps Intermediate

SLO, SLI and Error Budgets — Deep Dive¶

SLOSLIError BudgetSRE 6 min read

A practical guide to implementing SLOs and SLIs. Metric selection, error budget calculation, alerting and burn rate.

SLI — What to Measure¶

Availability — % of successful requests
Latency — % of requests below threshold (p99 < 300ms)
Throughput — successfully processed operations/s
Correctness — % of correct results
Freshness — data age below threshold

SLO Definition¶

# SLO for API Gateway
SLO: 99.9% availability (monthly rolling window)
SLI: sum(http_requests{status!~"5.."})/sum(http_requests)
Error Budget: 0.1% = 43.2 minutes/month

# Prometheus recording rule
- record: sli:api_availability:ratio_rate30d
  expr: |
    sum(increase(http_requests_total{status!~"5.."}[30d]))
    / sum(increase(http_requests_total[30d]))

Error Budget & Burn Rate¶

Error budget = 1 - SLO. Burn rate tells you how fast you’re consuming the budget.

# Multi-window burn rate alert
- alert: HighErrorBudgetBurn
  expr: |
    (
      sli:error_ratio:rate1h > (14.4 * 0.001)
      and
      sli:error_ratio:rate5m > (14.4 * 0.001)
    )
  labels:
    severity: critical
  annotations:
    summary: "Error budget burn rate 14.4x"

Error Budget Policy¶

Budget OK — deploy new features, experiment
Budget < 50% — increased caution
Budget exhausted — feature freeze, focus on stability

The error budget policy is an agreement between the SRE and product teams.

Summary¶

The SLO/SLI framework with error budgets and burn rate alerting transforms monitoring from reactive to proactive.

Need Help with Implementation?¶

Our team has experience designing and implementing modern architectures. We’re happy to help.

Free Consultation

CORE SYSTEMS team

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

All articles

SLO, SLI and Error Budgets — Deep Dive