Apache Cassandra — Distributed Database for Large-Scale Data

Relational databases have accompanied us throughout our careers. But what if you need to write millions of events per second, replicate data across data centers, and still guarantee 99.99% availability? Neither PostgreSQL nor Oracle can handle that. We started experimenting with Apache Cassandra.

Why Not a Relational Database?¶

For a telco client, we were building a system for collecting CDR records (Call Detail Records). Requirements: 50,000 writes per second, 2-year retention, access to data from two geographically separated locations. The classic approach — PostgreSQL with partitioning — hit the limits of vertical scaling. Adding RAM and CPU to a single server has its ceiling.

Cassandra offers horizontal scaling — add a new node to the cluster and data is automatically redistributed. No single point of failure, no master node. Every node is equal.

Architecture: Ring and Partitioning¶

Cassandra organizes nodes into a logical ring. Every row of data has a partition key, from which a hash is calculated. The hash determines which node the data belongs to. With a replication factor of 3, every row is stored on three nodes — if one goes down, data is still available.

CREATE KEYSPACE telco_cdr
  WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'dc_prague': 3,
    'dc_brno': 3
  };

CREATE TABLE telco_cdr.call_records (
    phone_number text,
    call_date date,
    call_time timestamp,
    duration int,
    called_number text,
    cell_id text,
    PRIMARY KEY ((phone_number, call_date), call_time)
) WITH CLUSTERING ORDER BY (call_time DESC);

The design of the partition key is critical. We chose the composite key (phone_number, call_date) — each partition contains records for one number for one day. This ensures even data distribution and efficient queries of the type “show me calls for number X on date Y”.

Tunable Consistency¶

Unlike relational databases with ACID transactions, Cassandra offers tunable consistency. For each query you choose a level:

ONE — response from one node, fastest, but you risk stale data
QUORUM — response from a majority of replicas, a good compromise
ALL — response from all replicas, slowest, but strong consistency
LOCAL_QUORUM — quorum within a single data center, ideal for geo-replication

For writing CDR records, we chose LOCAL_QUORUM — data is confirmed locally and asynchronously replicated to the second DC. For reading analytical reports, we also use LOCAL_QUORUM, guaranteeing read-after-write consistency within a single location.

Data Modeling: Query-First Approach¶

The biggest mental shift from relational databases: in Cassandra, you model data around queries, not around entities. Normalization doesn’t exist. Data duplication is normal and desirable.

If you need to display the same data in two different ways (by number and by network cell), you create two tables with the same data but different partition keys. You write to both simultaneously. Disk is cheap, query latency is expensive.

Operations in Production¶

Compaction¶

Cassandra writes data to immutable files (SSTables). Periodically, the compaction process merges them. We chose DateTieredCompactionStrategy — optimal for time-series data where older records don’t change. Compaction is CPU and I/O intensive; we schedule it outside peak hours.

Repair¶

If a node was offline and missed writes, the data on it may be out of date. The nodetool repair command synchronizes data between replicas. We run it once a week on each node — it’s necessary maintenance, but it puts load on the cluster.

Monitoring¶

Cassandra exports metrics via JMX. We primarily track: read/write latency (p99 under 10 ms), compaction pending (should not grow), heap usage, and tombstone count. Too many tombstones (deletions in Cassandra) slow down reads — hence we prefer TTL over explicit DELETE.

Cassandra vs. MongoDB¶

Both are NoSQL, but they solve different problems. MongoDB excels at flexible schemas and ad-hoc queries — it’s closer to a relational database. Cassandra excels at write performance, horizontal scaling, and multi-DC replication. For our CDR records (write-heavy, time-series, geo-distributed), Cassandra was the clear choice.

Results¶

A six-node cluster (3 in each DC) processes 80,000 writes/s with p99 latency under 5 ms. Reading analytical queries (calls from one number over a month) takes 15–30 ms. Data replicates between Prague and Brno with latency under 50 ms. Over six months of operation, we had zero downtime — even during planned maintenance (rolling restart), the cluster ran without interruption.

Cassandra Is Not a Replacement for a Relational Database¶

Cassandra is a specialized tool for specific use cases: write-heavy workloads, time-series data, geo-distributed systems. If you need JOINs, ad-hoc queries, or ACID transactions, stay with PostgreSQL.

But if your requirements match — and we believe that with growing data volumes, such projects will become more common — Cassandra is an exceptionally reliable solution.

cassandranosqlbig datadistribuované systémy

CORE SYSTEMS

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.