Observability & SRE
Monitoring tells you THAT. Observability tells you WHY.
Three pillars of observability + SRE processes. SLO/SLI, error budgets, incident management, blameless post-mortems.
Three Pillars of Observability¶
Metrics (Prometheus)¶
Numerical data over time. Latency, error rate, throughput, saturation. Effective for alerting and trending. Counter, gauge, histogram, summary.
Logs (Loki / Elasticsearch)¶
Textual event records. Structured logging (JSON) with context (trace ID, user ID, request ID). Correlation with traces for root cause analysis.
Traces (Jaeger / Tempo)¶
Distributed trace across the entire request path. You see how long each service call takes, where the bottleneck is, where it fails. OpenTelemetry for vendor-agnostic instrumentation.
Integration¶
Click on alert (metric) → see relevant logs → click through to trace → see exactly which service call is slow. This is observability — not three isolated tools.
SLO/SLI Framework¶
SLI (Service Level Indicator): Metric that measures quality from the user’s perspective. - Availability: ratio of successful requests - Latency: ratio of requests under threshold (e.g., P99 < 500ms) - Throughput: number of processed requests
SLO (Service Level Objective): Target for SLI. - “99.9% of requests are successful over rolling 30 days” - “95% of requests have latency < 200ms”
Error Budget: SLO = 99.9% → error budget = 0.1% = ~43 minutes/month. - Error budget remaining → ship features, experiment, innovate - Error budget approaching zero → stop features, fix reliability - Error budget exhausted → freeze deploys, focus on stability
Alerting Philosophy¶
Alert on symptoms, not causes: - ✅ “API error rate > 1% in the last 5 minutes” - ❌ “CPU > 80%” (CPU can be high and everything works fine)
Multi-window alerting: - Fast burn: large spike in short time → page immediately - Slow burn: slow degradation → ticket, not page
Page vs. Ticket: - Page (PagerDuty/OpsGenie): requires immediate action, wakes people up - Ticket: important but not urgent, handled during work hours
SRE Processes¶
Incident Management¶
- Detection — Automatic alert on SLO violation
- Triage — Severity classification (P1-P4)
- Response — On-call engineer, runbook, communication
- Mitigation — Rollback, feature flag off, scale up
- Resolution — Root cause fix, deploy
- Post-mortem — Blameless, action items, follow-up
Post-Mortem Template¶
- Timeline (what happened, chronologically)
- Impact (how many users, for how long)
- Root cause (technical cause)
- Contributing factors (what enabled this to happen)
- Action items (what we’ll do to prevent recurrence)
- Lessons learned
No blame. Goal: systemic improvement, not finding culprits.
Časté otázky
Monitoring tells you the API is slow. Observability shows you the specific trace: query on the orders table takes 8s due to a missing index. Fix in 5 minutes instead of 5 hours.
Open-source stack (Grafana, Prometheus, Loki, Jaeger): 4-6 weeks implementation. SaaS (Datadog, New Relic): faster setup, higher ongoing cost. We decide based on budget and team capability.
Not necessarily a dedicated SRE team. SRE principles (SLO/SLI, error budgets, post-mortems) can be adopted by any engineering team. Start with principles, not organizational change.