Observability Beyond Logs — OpenTelemetry and Monitoring's Future

Observability February 8, 2026 ~12–14 min read

If your monitoring is built primarily on logs, in 2026 you’re probably dealing with two pains: too much data and too few answers. Logs are great for forensic analysis and context, but without tracing, metrics, and profiles, they often just “illuminate” the problem — they don’t show you where and why it occurred. OpenTelemetry (OTel) has shifted observability from a set of tools to a unified telemetry standard that enables cross-signal correlation, and thus meaningful automation (AIOps) without noise.

What Changed: Why “Observability Beyond Logs”¶

Modern systems are composed of microservices, managed services, serverless functions, frontends, event buses, and third parties. In such a world, a log as a unit of diagnostics is often too local. In practice, this leads to three typical situations:

An incident looks like “everything is slow”, but without tracing, you don’t know which dependency is the root cause.
A metrics graph shows a spike, but without context, you don’t know which requests and code paths caused it.
Logs are expensive (storage + index) and simultaneously the hardest signal to normalize.

The goal of observability in 2026 is not “having data.” The goal is reducing MTTR and increasing reliability through: (1) good signals, (2) consistent semantics, (3) correlation, and (4) automated workflows.

Principle: First define what questions you want to answer (SLO/SLI, golden signals). Only then decide how many logs to send and where.

OpenTelemetry in One Sentence (and Why It Won)¶

OpenTelemetry is a vendor-agnostic standard for collecting, transporting, and describing telemetry (traces, metrics, logs, and now profiles), built on three building blocks:

API/SDK in the application (instrumentation + context propagation)
OTLP protocol (transport) + semantic conventions (semantics)
OpenTelemetry Collector (receivers → processors → exporters)

This is the fundamental difference from “agents” of the past: OTel is not one product. It’s a standard and ecosystem that allows changing backends without rewriting applications.

Signals: Traces vs Metrics vs Logs vs Profiles¶

Most teams in 2026 have realized that three signals are not competitors but layers. Each answers different questions:

Metrics: “What’s happening?” (SLO, capacity, saturation, trends). Low volume, high stability.
Traces: “Where is it happening?” (latency in a distributed request, dependency map). Medium volume.
Logs: “Why did it happen?” (context, domain events, exceptions). Highest volume, most expensive.
Profiles (continuous profiling): “Which code and which line is burning CPU/mem?” (hot paths, contention). Critical for performance and cost.

Practical correlation: a metric shows “CPU spike,” a trace shows “who caused it” (specific endpoint), and a profile says “where exactly in the code” time/CPU was lost.

OTel Architecture in Practice: How Telemetry Flows¶

The basic data flow looks like this:

App (OTel SDK/Auto-instrumentation)
  └── OTLP (gRPC/HTTP)
       └── OTel Collector (agent or gateway)
            ├── receivers: otlp, prometheus, filelog, ...
            ├── processors: batch, memory_limiter, k8sattributes, transform, tail_sampling, ...
            └── exporters: otlp, prometheusremotewrite, loki, tempo/jaeger, ...
                 └── Backend (Grafana stack / SaaS / data lake)

The key element is the Collector pipeline. It lets you do things you don’t want to implement in every application separately:

batching, retry, backpressure, buffering
enrichment (Kubernetes metadata, cloud resource attributes)
PII redaction and security policy
tail-based sampling (sampling only after “evaluating” the trace)
routing (different exports for different signals / tenancy)

Agent vs Gateway: Two Deployment Patterns¶

In 2026, two deployment patterns have stabilized:

Agent (DaemonSet / node agent): Collector runs on every node and collects local telemetry (OTLP from pods, host metrics, logs, eBPF). Advantage: low latency, local collection, better isolation.
Gateway (central collectors): Collector as a scaled service that handles heavy processing (tail sampling, routing, multi-tenant policy). Advantage: centralized control and less complexity on the node.

The most common architecture is a combination: the agent collects and enriches, the gateway aggregates and decides (sampling/routing).

eBPF: Telemetry Without Instrumentation (and Why It’s Not a “Silver Bullet”)¶

eBPF has shifted observability into the kernel: instead of modifying applications, you can capture events at the OS level (network, syscalls, scheduling) and get signals even from “black box” processes. In the OTel world, eBPF typically supplements three areas:

network maps and latency between processes/pods (who communicates with whom)
security and runtime evidence (unexpected binary exec, netflow anomalies)
profiling (system-wide stack traces across languages)

Note: eBPF gives you signals even without code, but won’t solve domain context. Without well-defined services, resource attributes, and semantics (semconv), you’ll just have another stream of data.

Continuous Profiling: The Fourth Signal That Changes the Game¶

Profiling has long been “ad hoc” (turn on the profiler when there’s a problem). But performance and cost issues are often intermittent and load-dependent. Continuous profiling provides the ability to track performance over time, with labels (service, endpoint, region, build) and correlate it with incidents.

In the OTel ecosystem, a new profiles signal has appeared in recent years, and the Collector can already receive and send profiles via OTLP (at the time of writing, typically as experimental / with a feature gate). Also important is work around eBPF profiling: an open-source Linux eBPF profiler exists within the OpenTelemetry project, aiming to integrate as a Collector receiver long-term.

When to Deploy Profiles First¶

latency issues without a clear bottleneck in the trace
unexplained CPU/memory spikes (garbage collector, contention, locks)
cost optimization (CPU time per request, hot functions)
suspected regression after a release (profile by build_sha)

AIOps and Correlation: Reality vs Hype¶

“AIOps” in 2026 is not about AI magically solving an incident. It works mainly as a correlation and assistive layer on top of quality telemetry:

Dedup and grouping of alerts (one incident instead of 200 pager pages)
Root-cause candidates based on service topology and changes (deploy, config, infra)
Automated queries into trace/log/metric stores and suggested next steps (runbook)
Anomaly detection on metrics and profile signals (e.g., changed hot path)

The condition is simple: AI can’t correlate what isn’t consistently named and connected. That’s why in OTel, things like Resource attributes, trace/span ID in logs, consistent service.name, and semantic conventions are key.

Implementation Guide: How to Start with OTel (Without Big Bang)¶

1) Define Questions, Not Dashboards¶

Start with SLOs. Choose 2–4 key user journeys and introduce SLIs (latency, error rate, availability) for them. Only then expand telemetry coverage.

2) Introduce Minimal Semantics (Naming and Attributes)¶

service.name and service.version mandatory
Environment identity: deployment.environment (prod/stage)
Build metadata: commit SHA, release tag
Standard HTTP/DB/RPC attributes per semconv (avoid “custom chaos”)

3) Instrumentation: Auto vs Manual¶

Auto-instrumentation (where it makes sense) quickly creates basic traces and metrics. But for domain operations, you’ll want manual spans too:

“place_order”, “calculate_price”, “reserve_inventory”
attributes: order_id (no PII), tenant, payment_method

4) Logs: Correlation Instead of “Log Everything”¶

Don’t throw away logs — but make them structured events. The biggest win is correlation: a log record should carry trace_id / span_id so you can “jump” from a trace to relevant logs.

5) Collector: Start with a Reasonable Pipeline¶

Below is a minimalist example for traces/metrics/logs (OTLP receiver) with batching and basic hygiene. In production, you’ll typically add K8s enrichment and sampling.

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
  batch:
    send_batch_size: 8192
    timeout: 2s

exporters:
  otlp:
    endpoint: YOUR_BACKEND_OTLP_ENDPOINT:4317
    tls:
      insecure: false

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]

When You’re Serious: Enrichment, Redaction, and Tail Sampling¶

Once you have the first services instrumented, the biggest difference comes from “telemetry hygiene” in the Collector: unified attributes from Kubernetes/cloud, filtering sensitive data, and tail sampling that keeps only what’s truly important.

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1024

  # Adds metadata from Kubernetes (namespace, pod, labels…)
  k8sattributes:
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.pod.name
        - k8s.node.name

  # Resource attribute detection (cloud, host, region…)
  resourcedetection:
    detectors: [env, system]

  # Redaction of sensitive values (example — adjust per your policy)
  attributes/redact:
    actions:
      - key: enduser.id
        action: delete
      - key: http.request.header.authorization
        action: delete

  # Tail sampling: decision only after trace completion
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow_requests
        type: latency
        latency:
          threshold_ms: 1000
      - name: baseline
        type: probabilistic
        probabilistic:
          sampling_percentage: 2

  batch:
    send_batch_size: 8192
    timeout: 2s

exporters:
  otlp:
    endpoint: YOUR_BACKEND_OTLP_ENDPOINT:4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, resourcedetection, attributes/redact, tail_sampling, batch]
      exporters: [otlp]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, resourcedetection, attributes/redact, batch]
      exporters: [otlp]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, resourcedetection, attributes/redact, batch]
      exporters: [otlp]

Tip: Do tail-based sampling (e.g., errors + slow requests) on the gateway Collector. That way, apps won’t make “blind” decisions (head sampling).

6) Profiles: Plan Ahead, Even Though the Signal Is Still Young¶

Profiling is evolving rapidly in the OTel ecosystem. OpenTelemetry officially announced support for a “profiling” signal and OTLP/Collector support is gradually expanding. In practice today, this means: expect iterations (data model changes, compatibility) and start where you have the biggest performance or cost pain points.

Important: the ideal state isn’t just “having a flamegraph.” The ideal is having profile ↔ trace linking: jumping from a slow request to a profile in the same time window and finding the hot function.

# Observability Beyond Logs — OpenTelemetry and the Future of Monitoring 2026
receivers:
  otlp:
    protocols:
      grpc:

exporters:
  otlp:
    endpoint: YOUR_BACKEND_OTLP_ENDPOINT:4317

service:
  pipelines:
    profiles:
      receivers: [otlp]
      exporters: [otlp]

Note: In some Collector versions, profile support may be behind a feature gate / configuration flag. Take it as a signal that it’s appropriate to deploy it as a pilot first (e.g., on one node pool / one service) and monitor backend compatibility.

Practical approach:

Start in one domain (e.g., JVM or Go services with the highest CPU).
Collect a baseline (before optimization) and define KPIs (CPU/request, p95 latency).
Only then automate profile ↔ trace correlation (where the backend supports it).

7) eBPF: Use It Intentionally¶

eBPF delivers the best ROI on Linux nodes where you want to:

get telemetry from processes you can’t instrument
profile the “entire node” and reveal noisy neighbors / contention
validate whether application data matches reality at the network and OS level

Most Common Mistakes (and How to Avoid Them)¶

OTel without Collectors: sending directly from applications works for starters but you lose central policy (sampling, PII, routing).
Inconsistent service.name: breaks correlation and topology.
“Everything into logs”: expensive and slow. Prefer metrics + traces for most questions.
Missing ownership: telemetry is a product. It needs standards, review, and CI checks (lint semconv, sampling policy).
PII in attributes: trace attributes are easily indexed — watch what you put in them.

Where It’s Heading: Observability 2026 → 2027¶

The direction is clear: unified telemetry expands to include profiles, eBPF makes signal collection from infrastructure cheaper, and AIOps moves from “chatbot” to workflows that shorten investigation. And what’s most important: the winning teams understand that observability is not buying a tool, but a discipline (semantics, data quality, correlation, and clear SLOs).

Want help? The fastest path is an observability assessment: mapping signals, costs, SLO coverage, and designing an OTel roadmap (instrumentation → Collectors → correlation → profiling).

CORE SYSTEMS

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Need help with implementation? Schedule a meeting