Prometheus excels at collecting metrics. But it has limitations: single node, limited retention, no global view across multiple clusters. Thanos extends Prometheus with long-term storage and HA.
Problems with Plain Prometheus¶
- Single node: Prometheus goes down → data loss
- Retention: 15-30 days locally, more consumes too much disk
- Multi-cluster: no global query across clusters
- Dedup: HA Prometheus pairs generate duplicate data
Thanos Architecture¶
Sidecar: runs alongside Prometheus, uploads blocks to object storage (S3). Store Gateway: serves historical data from S3. Query: global query frontend — merges data from sidecars and store gateway. Compactor: downsampling and compaction in S3.
The Result¶
Unlimited retention at the cost of S3 storage (~$0.023/GB/month). Global query across all clusters. HA without duplicates. Prometheus remains as the scraper, Thanos adds the global layer.
Alternatives¶
Cortex: similar approach, but the write path differs (remote write). More distributed, but more complex. VictoriaMetrics: single binary, PromQL-compatible, simpler. For our needs, we chose Thanos — direct integration with existing Prometheus.
Thanos Is the Natural Evolution of the Prometheus Stack¶
If you’re running Prometheus in production, Thanos is the logical next step for long-term storage and multi-cluster observability.
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us