What you don’t measure, you can’t manage. Here is a complete guide to monitoring.
Three Pillars of Observability¶
- Metrics — numerical data (CPU, latency, error rate)
- Logs — text records of events
- Traces — the path of a request through the system
Metrics — Prometheus¶
The Complete Guide to Monitoring¶
Counter — monotonically increasing (requests_total) Gauge — current value (temperature) Histogram — distribution (request_duration_seconds) Summary — percentiles
Logs — Loki¶
Structured JSON logs → central storage → query and alerting.
Traces — Jaeger/Tempo¶
Distributed tracing tracks a request across all microservices. Essential for debugging distributed systems.
SLI/SLO/SLA¶
- SLI (Indicator) — what you measure (P99 latency, availability)
- SLO (Objective) — target (99.9% availability)
- SLA (Agreement) — contract with the client (99.9% + penalties)
Error Budgets¶
SLO 99.9% = 43 minutes downtime/month = error budget. If you exhaust it, stop new features and fix reliability.
Recommended Stack¶
- Metrics: Prometheus + Grafana
- Logs: Loki + Promtail + Grafana
- Traces: Tempo or Jaeger
- Alerting: Alertmanager + PagerDuty/OpsGenie
- All-in-one: Grafana Cloud (free tier)
Principle¶
Monitor symptoms (error rate, latency), not causes (CPU). Alert on what affects users.