Skip to content
_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN DE
Let's talk

The Complete Guide to Monitoring

16. 04. 2025 1 min read intermediate

What you don’t measure, you can’t manage. Here is a complete guide to monitoring.

Three Pillars of Observability

  • Metrics — numerical data (CPU, latency, error rate)
  • Logs — text records of events
  • Traces — the path of a request through the system

Metrics — Prometheus

The Complete Guide to Monitoring

Counter — monotonically increasing (requests_total) Gauge — current value (temperature) Histogram — distribution (request_duration_seconds) Summary — percentiles

Logs — Loki

Structured JSON logs → central storage → query and alerting.

Traces — Jaeger/Tempo

Distributed tracing tracks a request across all microservices. Essential for debugging distributed systems.

SLI/SLO/SLA

  • SLI (Indicator) — what you measure (P99 latency, availability)
  • SLO (Objective) — target (99.9% availability)
  • SLA (Agreement) — contract with the client (99.9% + penalties)

Error Budgets

SLO 99.9% = 43 minutes downtime/month = error budget. If you exhaust it, stop new features and fix reliability.

  • Metrics: Prometheus + Grafana
  • Logs: Loki + Promtail + Grafana
  • Traces: Tempo or Jaeger
  • Alerting: Alertmanager + PagerDuty/OpsGenie
  • All-in-one: Grafana Cloud (free tier)

Principle

Monitor symptoms (error rate, latency), not causes (CPU). Alert on what affects users.

monitoringobservabilitydevops
Share:

CORE SYSTEMS team

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.