Nagios served us well for years. Then came containers, microservices and dynamic infrastructure — and Nagios started choking. The transition to Prometheus with Grafana wasn’t just a tool replacement. It was a completely different way of thinking about monitoring.
Why Prometheus¶
Nagios, Zabbix, Icinga — all work on the push model principle. An agent on the server sends metrics to a central instance. This works great as long as you have static infrastructure with named servers. Once containers arrive that live for minutes and don’t have stable hostnames, the push model collapses.
Prometheus works in reverse — pull model. Prometheus itself goes to HTTP endpoints of applications and collects metrics. Thanks to service discovery, it automatically finds new instances in Kubernetes, Consul or EC2. Container starts, Prometheus discovers it, starts scraping. Container disappears, Prometheus stops. No configuration, no agents.
Instrumentation — Metrics from First Hand¶
The key concept of Prometheus is that applications themselves export metrics.
No external agent that parses logs or monitors ports. The application
exposes a /metrics endpoint with current values — request
count, latency, error rate, queue size.
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/api/users",status="200"} 48293
http_requests_total{method="POST",path="/api/orders",status="201"} 7841
http_requests_total{method="GET",path="/api/users",status="500"} 23
# HELP http_request_duration_seconds Request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 41209
http_request_duration_seconds_bucket{le="0.5"} 47891
http_request_duration_seconds_bucket{le="1.0"} 48102
Client libraries exist for Go, Java, Python, Node.js, .NET — practically anything. Instrumenting one service takes an hour. And from that moment on, you have metrics that truly correspond to what’s happening in the application.
PromQL — The Language That Will Change Your Monitoring¶
Most monitoring systems can display “CPU is at 87%”. Prometheus can answer questions like “what is the 95th percentile latency for the last 5 minutes for service X, filtered by HTTP method”.
# Error rate for the last 5 minutes
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) * 100
# 95th percentile latency
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
# Top 5 slowest endpoints
topk(5,
avg by (path) (rate(http_request_duration_seconds_sum[5m])
/ rate(http_request_duration_seconds_count[5m]))
)
PromQL has a learning curve. But once you understand it, you won’t want anything else. Aggregations, filters, rate calculations — everything in one query. And you use the same query in dashboards and alert rules.
Grafana — Visualization That Makes Sense¶
Prometheus has its own UI, but it’s spartan. Grafana is the visualization layer that turns Prometheus data into readable dashboards. Graphs, heatmaps, tables, single-stat panels — everything configurable, everything shareable.
Our setup: one dashboard per service (RED metrics — Rate, Errors, Duration), one infrastructure dashboard (CPU, memory, disk, network) and one “war room” dashboard for the on-call engineer with an overview of the entire system. We version dashboards in Git as JSON — provisioning via Grafana API during deployment.
Alerting — Alertmanager in Practice¶
We define alert rules in Prometheus config. When a condition holds for longer than a specified time, Prometheus sends an alert to Alertmanager. It handles deduplication, grouping, silencing and routing — alerts go to Slack, PagerDuty or email based on severity.
groups:
- name: sla
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate > 5% on {{ $labels.instance }}"
Key lesson: alert on symptoms, not causes. Don’t alert “CPU > 90%”. Alert “p99 latency > 2s” or “error rate > 5%”. CPU can be at 90% and everything works fine. But when users wait 5 seconds for a response, that’s a problem.
Kubernetes Service Discovery¶
In Kubernetes environments, Prometheus automatically discovers pods through the Kubernetes API. Just add annotations to the pod and Prometheus starts scraping. No manual configuration, no maintaining lists of targets.
Node Exporter on each node collects system metrics. cAdvisor (integrated in kubelet) provides container metrics. kube-state-metrics exports the state of Kubernetes objects — deployments, replica sets, pods. Together you have a complete picture of the entire cluster.
What Surprised Us¶
Retention — Prometheus isn’t long-term storage. By default it keeps data for 15 days. For long-term trends you need remote storage (Thanos, Cortex, VictoriaMetrics). We started with 30-day retention and for capacity planning we send aggregated data to InfluxDB.
High availability — Prometheus doesn’t have native clustering. Solution: two independent instances scrape the same targets. Alertmanager has its own clustering for deduplication. It works, but it’s a different model than “one cluster that takes care of everything”.
Cardinality — too many unique label combinations will kill Prometheus. One team added user_id as a label on HTTP metrics. 50,000 unique users × 10 metrics × 3 HTTP methods = memory explosion. Label values must be low-cardinality.
Monitoring as Culture¶
Prometheus + Grafana aren’t just tools — they’re catalysts for cultural change. Developers add metrics to code because they want to see how their service behaves in production. The ops team has dashboards instead of manual SSH. The on-call engineer sees problems before customers call. Start with one service, instrument it, create a dashboard. The rest will come naturally.
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us