Observability Stack

We don't say 'it works'. We show data.

We implement an observability stack built on OpenTelemetry, Grafana and Prometheus. Metrics, logs and traces in one place — from code to infrastructure.

I want observability Back to QA & Testing

<2 min

MTTD

-60%

MTTR

90 days

Data retention

>95%

Alert accuracy

Why observability¶

Monitoring says the server has 95% CPU. Observability says it’s caused by a single SQL query on the /api/orders endpoint that has been scanning the entire table instead of using an index since yesterday’s deploy.

The difference? Monitoring detects symptoms. Observability reveals causes. In a distributed system with dozens of microservices, message queues and databases, it’s not enough to know that something is broken. You need to know why, where and since when — in minutes, not hours.

Three pillars¶

Observability rests on three types of telemetry data:

Metrics — numerical values over time. CPU, memory, request count, error rate, latency. Fast to query, cheap to store, ideal for dashboards and alerts. Prometheus is the de facto standard.

Logs — structured records of events. A request arrived, a query ran, an error occurred. Context that metrics lack. Loki aggregates logs from your entire infrastructure in one place.

Traces — a request’s path through the entire system. User clicked → API gateway → auth service → order service → database → response. Distributed tracing shows where the request spent time and where it got lost. Tempo/Jaeger visualize the entire journey.

The power is in correlation. Alert on high latency → click on metric → switch to trace → drill down into the log of the specific service. One seamless flow from symptom to cause.

OpenTelemetry — instrumentation without vendor lock-in¶

OpenTelemetry (OTel) is an open-source framework for collecting telemetry data. A CNCF project, backed by all major vendors, and the de facto industry standard.

Why OTel¶

Vendor neutrality: You instrument your application once using the OTel SDK. Data is sent through the OTel Collector to anywhere — Grafana stack, Datadog, New Relic, Honeycomb. Changing backends? Change the Collector config, not the application code.

Auto-instrumentation: For most languages (Java, Python, Node.js, Go, .NET) auto-instrumentation is available. Add a single dependency and you get traces for HTTP requests, database queries, gRPC calls — without writing a single line of code. Manual instrumentation is added only where you need specific context.

Standardized semantics: OTel defines conventions for naming metrics, attributes and span names. http.request.method, db.system, server.address — consistent across languages and frameworks. Dashboards and alerts work universally.

OTel Collector¶

The Collector is the central point for receiving, processing and exporting telemetry data. It accepts data from applications, transforms it (filters, enriches, samples) and sends it to backends.

Deployment patterns: - Sidecar — Collector as a sidecar container next to each application. Isolation, simplicity. - Agent — DaemonSet on every node. Collects data from all pods on the node. - Gateway — centralized Collector cluster. All data flows through a single point. Scales horizontally.

Typically we combine: agents on nodes for collection, gateway for processing and routing.

Prometheus — metrics¶

Prometheus is a time-series database and monitoring system. Pull model — Prometheus actively scrapes metrics from your applications via an HTTP endpoint.

What we measure¶

RED metrics for services: - Rate — requests per second - Errors — ratio of error responses - Duration — latency (p50, p95, p99)

USE metrics for infrastructure: - Utilization — how heavily the resource is used - Saturation — how overloaded it is (queue depth) - Errors — resource error rate

Business metrics: - Orders per minute - Conversion rate - Revenue per minute - Queue depth

PromQL (Prometheus Query Language) enables sophisticated queries: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) — error rate over the last 5 minutes. Alerts defined directly in PromQL.

Grafana — visualization and alerting¶

Grafana brings all data together in one place. Dashboards for Prometheus metrics, Loki logs, Tempo traces — with drill-through between them.

Dashboards¶

Service overview: Golden signals for every service — request rate, error rate, latency. At a glance you see the health of the entire system. Red? Drill down.

Infrastructure: Node-level metrics — CPU, memory, disk, network. Kubernetes pod metrics. Cluster health. Capacity planning.

Business: Real-time business metrics alongside technical ones. Orders dropping? Is it a business issue (campaign ended) or a technical one (payment gateway not responding)?

Alerting¶

Grafana alerting with multi-condition rules. Not just “CPU > 90%” — combined conditions: “error rate > 5% AND request rate > 100 rps AND it’s been going on for more than 5 minutes”. We reduce false positives and increase the signal-to-noise ratio.

Notifications: Slack, PagerDuty, OpsGenie, email, webhook. Escalation rules — the alert goes to Slack first; after 15 minutes without acknowledgment it goes to PagerDuty on-call.

Distributed tracing¶

In a monolith, debugging is straightforward — the stack trace tells you everything. In a microservices architecture, a request passes through dozens of services. Where’s the bottleneck? Distributed tracing answers that.

How it works¶

Every request gets a unique trace ID. Every operation within the request is a span — with a start time, duration, attributes and a relationship to a parent span. The result: a tree of spans representing the request’s entire path through the system.

Example: A user opens a product page. - Frontend → API Gateway (5ms) - API Gateway → Product Service (2ms) - Product Service → PostgreSQL query (45ms) ← bottleneck - Product Service → Redis cache write (1ms) - Product Service → Recommendation Service (15ms) - Recommendation Service → ML model inference (12ms)

The trace shows that the 45ms database call is the problem. Click on the span, see the SQL query, explain plan, connection pool stats. Problem identified in minutes.

Sampling¶

In production with thousands of requests per second, we don’t store every trace. Sampling strategies:

Head-based sampling: Decision at the start — 10% of traces are recorded. Simple but can miss interesting traces.
Tail-based sampling: Decision at the end — we record traces with errors, high latency, specific attributes. Smarter, more demanding to implement.

Typically we combine: 5-10% head-based sampling + 100% of traces with errors and high latency.

Log aggregation with Loki¶

Loki by Grafana Labs is a log aggregation system inspired by Prometheus. It indexes only labels (metadata), not full text — which makes it significantly cheaper than Elasticsearch.

Structured logging: Applications log in JSON format with standardized fields. trace_id, service, level, user_id — filterable, correlatable with traces and metrics.

LogQL: A query language similar to PromQL. {service="order-api"} |= "error" | json | duration > 1s — find error logs from order-api where the operation took longer than one second.

How we implement¶

Assessment — we map the current state of monitoring and identify gaps
Architecture — we design the observability stack based on scale, budget and requirements
Infrastructure — we deploy Prometheus, Grafana, Loki, Tempo (self-hosted or cloud)
Instrumentation — OTel auto-instrumentation + manual instrumentation of critical paths
Dashboards and alerts — golden signals, business metrics, escalation rules
Runbooks — what to do when an alert fires. Linked directly from the alert.
Training — the team learns to read dashboards, debug with traces, and write PromQL queries

Stack¶

Metrics: Prometheus, Thanos/Mimir (long-term storage), VictoriaMetrics.

Logs: Grafana Loki, Elasticsearch (legacy), Fluentd/Fluent Bit.

Traces: Grafana Tempo, Jaeger, Zipkin.

Instrumentation: OpenTelemetry SDK, OTel Collector.

Visualization: Grafana, Grafana Cloud.

Alerting: Grafana Alerting, Alertmanager, PagerDuty, OpsGenie.

Časté otázky

Observability is the ability to understand the internal state of a system based on its outputs — metrics, logs and traces. Monitoring tells you 'something is wrong'. Observability tells you 'why it's wrong and exactly where'. For distributed systems it's essential.

OpenTelemetry is a vendor-neutral standard. You instrument once and send data anywhere — Grafana, Datadog, New Relic, Jaeger. No vendor lock-in. Plus it's a CNCF project with massive community adoption.

Self-hosted Grafana + Prometheus + Loki + Tempo is free (open-source). You pay for infrastructure — typically 2-5% of total cloud costs. Managed options (Grafana Cloud) have a free tier for smaller projects. The ROI is clear: one incident detected in minutes instead of hours pays for a year of operation.

You don't have to switch all at once. OpenTelemetry lets you send data to multiple backends in parallel. You can migrate gradually, compare and decide. Often, though, clients save 40-60% by switching to an open-source stack.

A basic stack (Prometheus + Grafana + alerting) in 1-2 weeks. A full observability stack with distributed tracing, log aggregation and custom dashboards in 4-6 weeks. Application code instrumentation runs in parallel.

Souvisí s

QA, Testing & Observability {'cs': 'Automatizované testování, monitoring a observability stack.', 'en': 'Automated testing, monitoring and observability stack.'}

Logistics & E-commerce {'cs': 'Supply chain, WMS, fulfillment automatizace', 'en': 'Supply chain, WMS, fulfillment automation'}

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku