Observability in 2026: From Logs to Telemetry — How to See Inside a Production System

A production system without observability is like driving at night without headlights — you’re moving but can’t see where. In 2026, grepping through logs and a few Grafana dashboards aren’t enough. Distributed systems, microservices, and event-driven architectures require telemetry that connects logs, metrics, and traces into one coherent picture. Here’s how.

Why Logs Aren’t Enough — Three Pillars of Observability¶

Most teams start with logs. And on a monolith with one server, that works. The problem arises when you have 40 microservices, a request passes through six of them, and you don’t know where two seconds of latency were lost. Logs tell you what happened in one process. They don’t tell you why the entire system is slow.

That’s why observability in recent years has settled on three pillars — and none of them is sufficient on its own.

Logs¶

Discrete events with context. They tell you what happened. Structured logs (JSON) with correlation IDs are the foundation — plain text logs in production in 2026 are unacceptable.

Metrics¶

Numerical time-series. They tell you how much and how fast. Request rate, error rate, latency (RED), resource saturation. Aggregated, cheap to store, ideal for alerting.

Traces¶

The path of a request through the entire system. They tell you where and why. Distributed tracing shows which service is slowing things down, where retries fail, and how latency propagates.

The key is connecting them. When you see a spike in error rate (metric), you want to drill down to the specific traces that failed, and from a trace get to the logs of individual spans. Without this connection, you have three separate tools instead of one observability system.

OpenTelemetry as the Standard — Why and How¶

Before OpenTelemetry (OTel), you had vendor-specific agents: Datadog agent, New Relic agent, Jaeger client, Prometheus client library — each with its own format, its own SDK, and its own lock-in. Switching to a different backend meant rewriting the instrumentation of your entire application.

OTel solves this. It’s a vendor-neutral standard for generating, collecting, and exporting telemetry. In 2026, OTel is the de facto standard — traces and metrics are stable, logs have reached GA stability. Main benefits:

One SDK for all three signals. You instrument your application once and export anywhere — Prometheus, Jaeger, Datadog, Grafana Cloud.
Auto-instrumentation. For Go, Java, Python, .NET, and Node.js, agents exist that automatically instrument HTTP, gRPC, database clients, and messaging frameworks without code changes.
OTel Collector as the central point. A single process receives telemetry from the entire cluster, processes it (filtering, sampling, enrichment), and routes it to backends. Want to add a new backend? Add an exporter to the config — no changes in applications.
No vendor lock-in. Today you use Prometheus + Loki + Tempo. Next year you want to migrate to Datadog? Change the exporters in the Collector. The application is untouched.

OTel Collector Configuration¶

A typical production pipeline looks like this. The Collector receives OTLP data, processes it in batches, enriches it with Kubernetes metadata, and exports to three backends:

`# otel-collector-config.yaml

receivers:

otlp:

protocols:

  grpc:

    endpoint: "0.0.0.0:4317"

  http:

    endpoint: "0.0.0.0:4318"

processors:

batch:

timeout: 5s

send_batch_size: 1024

k8sattributes:

extract:

  metadata:

    - k8s.namespace.name

    - k8s.deployment.name

    - k8s.pod.name

tail_sampling:

policies:

  - name: errors-always

    type: status_code

    status_code: { status_codes: [ERROR] }

  - name: slow-requests

    type: latency

    latency: { threshold_ms: 2000 }

  - name: probabilistic

    type: probabilistic

    probabilistic: { sampling_percentage: 10 }

exporters:

prometheusremotewrite:

endpoint: "http://mimir:9009/api/v1/push"

loki:

endpoint: "http://loki:3100/loki/api/v1/push"

otlp/tempo:

endpoint: "tempo:4317"

tls:

  insecure: true

service:

pipelines:

metrics:

  receivers: [otlp]

  processors: [batch, k8sattributes]

  exporters: [prometheusremotewrite]

logs:

  receivers: [otlp]

  processors: [batch, k8sattributes]

  exporters: [loki]

traces:

  receivers: [otlp]

  processors: [batch, k8sattributes, tail_sampling]

  exporters: [otlp/tempo]`

An important detail: tail sampling. In production, you can’t store 100% of traces — it’s expensive and unnecessary. Tail sampling ensures that errors and slow requests are always stored, and from normal traffic you take only a sample. Unlike head sampling (decision at the start of a trace), you see the entire request flow before making the decision.

Practical Stack: Grafana + Prometheus + Loki + Tempo¶

Why this stack? Because it’s open-source, battle-tested, and scalable. And most importantly — all components are designed to work together.

Prometheus / Mimir — metrics. Prometheus for smaller deployments, Grafana Mimir for long-term storage and multi-tenancy. PromQL is the lingua franca of metrics — every SRE, every exporter, and every dashboard knows it.
Loki — logs. It doesn’t index log content (unlike Elasticsearch), it only indexes labels. That means significantly lower storage costs and operational simplicity. LogQL syntax is close to PromQL, so the transition is painless.
Tempo — traces. Column-oriented backend optimized for trace storage. Supports direct integration with Loki and Prometheus — from a trace you drill down to logs, from metrics to traces. This connection is what turns individual tools into a system.
Grafana — visualization and correlation. One UI for all three signals. Explore mode for ad-hoc debugging, dashboards for overview, Alerting for notifications. Grafana in 2026 can correlate metrics → traces → logs in a single flow.

Alternatives exist — Elastic Stack (ELK), Datadog, New Relic, Splunk. The choice depends on context. For teams that want control over data, no per-host fees, and the option for self-hosted or managed deployment, the Grafana stack is hard to beat.

Alerting That Works — Not the Kind That Wakes the Whole Team¶

Bad alerting is worse than no alerting. When the team gets 50 alerts a day and 48 of them are noise, they learn to ignore alerts. And when the real one comes, nobody responds. This phenomenon — alert fatigue — is a real problem that kills incident response.

Three Rules of Good Alerting¶

Actionable. Every alert must have a clear action. If you don’t know what to do with an alert, it shouldn’t exist. “CPU is at 80%” is not an actionable alert — what do you do with it? “Error rate of payment-service exceeded 5% in the last 5 minutes” is actionable because you know customers can’t pay.
Runbook. Every alert has a link to a runbook. A document that says: what the alert means, what the impact is, how to diagnose the cause, and how to escalate. An on-call engineer at 3 AM shouldn’t have to think — they should follow a procedure.
SLO-based. The alert fires when you’re approaching an SLO violation — not when you exceed an arbitrary threshold. More on this shortly.

Practical tips: use grouping (group related alerts into one notification), inhibition (if the entire cluster is down, you don’t need 200 alerts for each pod), and silencing for planned maintenance. Alertmanager handles all of this out of the box.

SLO/SLI Driven Approach — You Measure What Matters¶

SLI (Service Level Indicator) is a metric that measures service quality from the user’s perspective. SLO (Service Level Objective) is the target value of the SLI. Sounds simple — in practice, it’s a paradigmatic shift in how you think about monitoring.

Instead of hundreds of alerts on infrastructure metrics, you have a handful of SLOs that tell you whether users are having a good experience. Examples:

Availability: 99.9% of requests to /api/checkout return 2xx over the last 30 days.
Latency: 95% of requests to /api/search have a response under 200 ms.
Correctness: 99.99% of payment transactions are processed correctly.

The key concept is error budget. If your SLO is 99.9% availability, you have a monthly budget of 0.1% — roughly 43 minutes of downtime. As long as you have budget, you can deploy, experiment, make risky changes. When the budget runs low, you slow down and focus on stability.

Error budget burn rate alerting is far more effective than static thresholds. The alert fires when you’re burning budget faster than sustainable — not “error rate is 1%,” but “at the current error rate, you’ll exhaust your monthly budget in 6 hours.” That’s an alert you respond to.

Grafana has native SLO support — you define the SLI query, target value, and period, and Grafana automatically generates dashboards and alerting rules for burn rate. Prometheus recording rules calculate error budget in real time.

How We Build It at CORE SYSTEMS¶

At CORE SYSTEMS, we consider observability a fundamental infrastructure layer, not a nice-to-have. Every system we deliver to production — whether it’s an information system, data platform, or AI agent — has an observability stack as part of delivery.

Our approach is pragmatic and built on proven principles:

Observability from day zero. We handle instrumentation from the start of the project, not as an afterthought after the first production incident. The OTel SDK is part of the application template, the OTel Collector runs in every cluster.
SLO-first alerting. Before we write the first alert rule, we define SLOs with the business owner. What’s acceptable latency? What availability? Alerts then derive from error budget burn rate — not from arbitrary thresholds.
Runbook for every alert. An alert without a runbook is just noise. Our alerts always have a link to a diagnostic procedure, escalation matrix, and contact for the responsible team.
Dashboards as code. We version Grafana dashboards in Git, deploy via CI/CD, use Jsonnet/Grafonnet for templating. No manual clicking in the UI, no “who changed that dashboard.”
Cost-aware telemetry. High-cardinality metrics and full-fidelity traces are expensive. We help clients find the right balance between visibility and costs — tail sampling, metric relabeling, log level management.

We work with clients in banking, logistics, and retail — industries where downtime costs real money and regulators require an audit trail. Observability there isn’t a luxury. It’s an operational necessity.

Conclusion: Observability Is an Investment, Not a Cost¶

A quality observability stack pays for itself at the first production incident. Instead of hours of blind searching through logs — minutes of targeted debugging. Instead of alert fatigue — actionable notifications. Instead of “I think it works” — an SLO dashboard that objectively tells you how you’re doing.

The technologies are available and open-source. OpenTelemetry unified instrumentation. The Grafana stack provides a complete solution without vendor lock-in. What remains is the decision to do it right — instrument from the start, define SLOs, set up alerting that makes sense, and above all connect all three pillars into one system. Because observability isn’t about tools. It’s about seeing inside your system and making decisions based on data.

CORE SYSTEMS

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.