Observability & SRE

Monitoring tells you THAT. Observability tells you WHY.

Three pillars of observability + SRE processes. SLO/SLI, error budgets, incident management, blameless post-mortems.

I want observability assessment Back to Cloud & Platform

<5 min

MTTD

<30 min

MTTR

>99.9%

SLO compliance

<5%

False positive rate

Three Pillars of Observability¶

Metrics (Prometheus)¶

Numerical data over time. Latency, error rate, throughput, saturation. Effective for alerting and trending. Counter, gauge, histogram, summary.

Logs (Loki / Elasticsearch)¶

Textual event records. Structured logging (JSON) with context (trace ID, user ID, request ID). Correlation with traces for root cause analysis.

Traces (Jaeger / Tempo)¶

Distributed trace across the entire request path. You see how long each service call takes, where the bottleneck is, where it fails. OpenTelemetry for vendor-agnostic instrumentation.

Integration¶

Click on alert (metric) → see relevant logs → click through to trace → see exactly which service call is slow. This is observability — not three isolated tools.

SLO/SLI Framework¶

SLI (Service Level Indicator): Metric that measures quality from the user’s perspective. - Availability: ratio of successful requests - Latency: ratio of requests under threshold (e.g., P99 < 500ms) - Throughput: number of processed requests

SLO (Service Level Objective): Target for SLI. - “99.9% of requests are successful over rolling 30 days” - “95% of requests have latency < 200ms”

Error Budget: SLO = 99.9% → error budget = 0.1% = ~43 minutes/month. - Error budget remaining → ship features, experiment, innovate - Error budget approaching zero → stop features, fix reliability - Error budget exhausted → freeze deploys, focus on stability

Alerting Philosophy¶

Alert on symptoms, not causes: - ✅ “API error rate > 1% in the last 5 minutes” - ❌ “CPU > 80%” (CPU can be high and everything works fine)

Multi-window alerting: - Fast burn: large spike in short time → page immediately - Slow burn: slow degradation → ticket, not page

Page vs. Ticket: - Page (PagerDuty/OpsGenie): requires immediate action, wakes people up - Ticket: important but not urgent, handled during work hours

SRE Processes¶

Incident Management¶

Detection — Automatic alert on SLO violation
Triage — Severity classification (P1-P4)
Response — On-call engineer, runbook, communication
Mitigation — Rollback, feature flag off, scale up
Resolution — Root cause fix, deploy
Post-mortem — Blameless, action items, follow-up

Post-Mortem Template¶

Timeline (what happened, chronologically)
Impact (how many users, for how long)
Root cause (technical cause)
Contributing factors (what enabled this to happen)
Action items (what we’ll do to prevent recurrence)
Lessons learned

No blame. Goal: systemic improvement, not finding culprits.

Časté otázky

Monitoring tells you the API is slow. Observability shows you the specific trace: query on the orders table takes 8s due to a missing index. Fix in 5 minutes instead of 5 hours.

Open-source stack (Grafana, Prometheus, Loki, Jaeger): 4-6 weeks implementation. SaaS (Datadog, New Relic): faster setup, higher ongoing cost. We decide based on budget and team capability.

Not necessarily a dedicated SRE team. SRE principles (SLO/SLI, error budgets, post-mortems) can be adopted by any engineering team. Start with principles, not organizational change.

Souvisí s

Cloud & Platform Engineering {'cs': 'Kubernetes, IaC, CI/CD a provoz v cloudu.', 'en': 'Kubernetes, IaC, CI/CD and cloud operations.'}

QA, Testing & Observability {'cs': 'Automatizované testování, monitoring a observability stack.', 'en': 'Automated testing, monitoring and observability stack.'}

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku