Apache Kafka — Real-Time Data Streaming in Practice

Your batch ETL runs once a day and your dashboard shows yesterday’s data? In 2020, that’s too slow for many businesses. Apache Kafka enables processing millions of events per second in real time. Let’s look at how and when to deploy it.

What Is Kafka and Why You Need It¶

Apache Kafka is a distributed streaming platform. Originally created at LinkedIn for processing activity logs, it’s now the de facto standard for event-driven architectures. Kafka is not a message queue (even though it’s often used as one) — it’s a distributed commit log with ordering guarantees, persistence, and replay capability.

Key features that differentiate it from RabbitMQ or ActiveMQ:

Retention: messages remain in the topic even after being read — consumer groups can read independently, replay anytime
Horizontal scaling: partitioning enables parallel processing — add brokers, increase throughput
Ordering guarantee: messages within a single partition are strictly ordered
Throughput: millions of messages per second on commodity hardware — LinkedIn processes 7 trillion messages per day

Architecture: Brokers, Topics, and Partitions¶

A Kafka cluster consists of brokers (servers). Data is organized into topics (logical channels) and each topic is divided into partitions (physical logs). The replication factor determines how many copies of each partition exist — typically 3 for production.

For our deployment in the banking sector, we chose 5 brokers, replication factor 3, and partitioning by customer ID. This guarantees that all events for a single customer are in one partition — strict ordering without compromise.

Kafka Connect — Integration Without Code¶

Kafka Connect is a framework for streaming data between Kafka and external systems. Source connectors read data into Kafka, sink connectors write them out. Ready-made connectors exist for most databases:

Debezium: CDC (Change Data Capture) for PostgreSQL, MySQL, SQL Server, Oracle — captures every database change as an event
JDBC Source/Sink: polling-based, simpler but without real-time guarantees
Elasticsearch Sink: automatic indexing into Elasticsearch for full-text search
S3 Sink: archival to S3 for long-term storage and analytics

On a project for an insurance company, we used Debezium CDC to capture changes in an Oracle database of insurance policies and stream them into a data lake on Azure Blob Storage. From daily batch ETL to real-time — latency under 5 seconds.

Schema Registry — Evolution Without Chaos¶

Without schema management, Kafka is just a glorified byte pipe. Confluent Schema Registry stores Avro/JSON/Protobuf schemas and enforces compatibility. A backward-compatible change (adding an optional field) passes; a breaking change (removing a required field) is rejected.

In practice, this means the producer team can add a new field to an event without coordinating with all consumer teams. Old consumers simply ignore it, new ones can leverage it. Schema evolution instead of big-bang migration.

Kafka Streams vs. Apache Flink¶

For stream processing in 2020, you have two main choices:

Kafka Streams: a Java library, no extra cluster needed, ideal for simpler transformations, filtering, aggregations. Runs as a regular Java application
Apache Flink: a full-featured streaming engine, its own cluster, significantly more powerful for complex event processing, windowing, pattern detection

Our recommendation: start with Kafka Streams. If you need complex event processing (CEP), sliding windows over millions of events, or exactly-once processing across systems — move to Flink.

Production Operations — Lessons Learned¶

After two years of running Kafka clusters in production, we have several takeaways:

Monitoring is critical: Kafka JMX metrics into Prometheus + Grafana. Watch consumer lag, under-replicated partitions, request latency
Don’t set partition count too high: more partitions = more file handles, slower leader election. 12 partitions per topic is a good starting point
Plan retention ahead: 7 days default retention × 100 MB/s = 60 TB storage. Plan accordingly
ZooKeeper is a single point of failure: dedicated ZooKeeper ensemble, not on the same nodes as brokers (in 2020 you still need ZK; KRaft comes later)
Consumer group rebalancing: deploying new consumers can cause a storm — use cooperative rebalancing (Kafka 2.4+)

Managed vs. Self-Hosted¶

Confluent Cloud, AWS MSK, Azure Event Hubs (Kafka-compatible) — managed services reduce operational burden, but at a cost. For a client with 50 GB/day, Confluent Cloud is significantly more expensive than self-hosted. For a startup with one DevOps engineer, a managed service is a no-brainer.

Kafka as the Backbone of Event-Driven Architecture¶

Apache Kafka in 2020 isn’t experimental technology — it’s a production-proven foundation for real-time data platforms. The key to success isn’t Kafka itself, but thoughtful event design, schema management, and monitoring. Without them, it’s just fast chaos.

apache kafkastreamingdata platformevent-driven

CORE SYSTEMS

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Need help with implementation? Schedule a meeting

Apache Kafka — Real-Time Data Streaming in Practice

What Is Kafka and Why You Need It¶

Architecture: Brokers, Topics, and Partitions¶

Kafka Connect — Integration Without Code¶

Schema Registry — Evolution Without Chaos¶

Kafka Streams vs. Apache Flink¶

Production Operations — Lessons Learned¶

Managed vs. Self-Hosted¶

Kafka as the Backbone of Event-Driven Architecture¶

CORE SYSTEMS

Need help with implementation?

Related articles

Apache Kafka — Distributed Streaming Platform

Real-time Streaming with Apache Kafka and Flink: Architecture for 2026

Real-Time Data Mesh — Streaming Data Architecture for Enterprise

Apache Kafka — Event-Driven Architecture in Practice