What you don’t measure, you can’t manage. Here is a complete guide to monitoring.

Three Pillars of Observability¶

Metrics — numerical data (CPU, latency, error rate)
Logs — text records of events
Traces — the path of a request through the system

Metrics — Prometheus¶

The Complete Guide to Monitoring¶

Counter — monotonically increasing (requests_total) Gauge — current value (temperature) Histogram — distribution (request_duration_seconds) Summary — percentiles

Logs — Loki¶

Structured JSON logs → central storage → query and alerting.

Traces — Jaeger/Tempo¶

Distributed tracing tracks a request across all microservices. Essential for debugging distributed systems.

SLI/SLO/SLA¶

SLI (Indicator) — what you measure (P99 latency, availability)
SLO (Objective) — target (99.9% availability)
SLA (Agreement) — contract with the client (99.9% + penalties)

Error Budgets¶

SLO 99.9% = 43 minutes downtime/month = error budget. If you exhaust it, stop new features and fix reliability.

Recommended Stack¶

Metrics: Prometheus + Grafana
Logs: Loki + Promtail + Grafana
Traces: Tempo or Jaeger
Alerting: Alertmanager + PagerDuty/OpsGenie
All-in-one: Grafana Cloud (free tier)

Principle¶

Monitor symptoms (error rate, latency), not causes (CPU). Alert on what affects users.

monitoringobservabilitydevops

CORE SYSTEMS team

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

All articles

The Complete Guide to Monitoring