_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Observability & SRE

Monitoring tells you THAT. Observability tells you WHY.

Three pillars of observability + SRE processes. SLO/SLI, error budgets, incident management, blameless post-mortems.

<5 min
MTTD
<30 min
MTTR
>99.9%
SLO compliance
<5%
False positive rate

Three Pillars of Observability

Metrics (Prometheus)

Numerical data over time. Latency, error rate, throughput, saturation. Effective for alerting and trending. Counter, gauge, histogram, summary.

Logs (Loki / Elasticsearch)

Textual event records. Structured logging (JSON) with context (trace ID, user ID, request ID). Correlation with traces for root cause analysis.

Traces (Jaeger / Tempo)

Distributed trace across the entire request path. You see how long each service call takes, where the bottleneck is, where it fails. OpenTelemetry for vendor-agnostic instrumentation.

Integration

Click on alert (metric) → see relevant logs → click through to trace → see exactly which service call is slow. This is observability — not three isolated tools.

SLO/SLI Framework

SLI (Service Level Indicator): Metric that measures quality from the user’s perspective. - Availability: ratio of successful requests - Latency: ratio of requests under threshold (e.g., P99 < 500ms) - Throughput: number of processed requests

SLO (Service Level Objective): Target for SLI. - “99.9% of requests are successful over rolling 30 days” - “95% of requests have latency < 200ms”

Error Budget: SLO = 99.9% → error budget = 0.1% = ~43 minutes/month. - Error budget remaining → ship features, experiment, innovate - Error budget approaching zero → stop features, fix reliability - Error budget exhausted → freeze deploys, focus on stability

Alerting Philosophy

Alert on symptoms, not causes: - ✅ “API error rate > 1% in the last 5 minutes” - ❌ “CPU > 80%” (CPU can be high and everything works fine)

Multi-window alerting: - Fast burn: large spike in short time → page immediately - Slow burn: slow degradation → ticket, not page

Page vs. Ticket: - Page (PagerDuty/OpsGenie): requires immediate action, wakes people up - Ticket: important but not urgent, handled during work hours

SRE Processes

Incident Management

  1. Detection — Automatic alert on SLO violation
  2. Triage — Severity classification (P1-P4)
  3. Response — On-call engineer, runbook, communication
  4. Mitigation — Rollback, feature flag off, scale up
  5. Resolution — Root cause fix, deploy
  6. Post-mortem — Blameless, action items, follow-up

Post-Mortem Template

  • Timeline (what happened, chronologically)
  • Impact (how many users, for how long)
  • Root cause (technical cause)
  • Contributing factors (what enabled this to happen)
  • Action items (what we’ll do to prevent recurrence)
  • Lessons learned

No blame. Goal: systemic improvement, not finding culprits.

Časté otázky

Monitoring tells you the API is slow. Observability shows you the specific trace: query on the orders table takes 8s due to a missing index. Fix in 5 minutes instead of 5 hours.

Open-source stack (Grafana, Prometheus, Loki, Jaeger): 4-6 weeks implementation. SaaS (Datadog, New Relic): faster setup, higher ongoing cost. We decide based on budget and team capability.

Not necessarily a dedicated SRE team. SRE principles (SLO/SLI, error budgets, post-mortems) can be adopted by any engineering team. Start with principles, not organizational change.

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku