_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Apache Kafka — Real-Time Data Streaming in Practice

18. 05. 2020 4 min read CORE SYSTEMSdata
Apache Kafka — Real-Time Data Streaming in Practice

Your batch ETL runs once a day and your dashboard shows yesterday’s data? In 2020, that’s too slow for many businesses. Apache Kafka enables processing millions of events per second in real time. Let’s look at how and when to deploy it.

What Is Kafka and Why You Need It

Apache Kafka is a distributed streaming platform. Originally created at LinkedIn for processing activity logs, it’s now the de facto standard for event-driven architectures. Kafka is not a message queue (even though it’s often used as one) — it’s a distributed commit log with ordering guarantees, persistence, and replay capability.

Key features that differentiate it from RabbitMQ or ActiveMQ:

  • Retention: messages remain in the topic even after being read — consumer groups can read independently, replay anytime
  • Horizontal scaling: partitioning enables parallel processing — add brokers, increase throughput
  • Ordering guarantee: messages within a single partition are strictly ordered
  • Throughput: millions of messages per second on commodity hardware — LinkedIn processes 7 trillion messages per day

Architecture: Brokers, Topics, and Partitions

A Kafka cluster consists of brokers (servers). Data is organized into topics (logical channels) and each topic is divided into partitions (physical logs). The replication factor determines how many copies of each partition exist — typically 3 for production.

For our deployment in the banking sector, we chose 5 brokers, replication factor 3, and partitioning by customer ID. This guarantees that all events for a single customer are in one partition — strict ordering without compromise.

Kafka Connect — Integration Without Code

Kafka Connect is a framework for streaming data between Kafka and external systems. Source connectors read data into Kafka, sink connectors write them out. Ready-made connectors exist for most databases:

  • Debezium: CDC (Change Data Capture) for PostgreSQL, MySQL, SQL Server, Oracle — captures every database change as an event
  • JDBC Source/Sink: polling-based, simpler but without real-time guarantees
  • Elasticsearch Sink: automatic indexing into Elasticsearch for full-text search
  • S3 Sink: archival to S3 for long-term storage and analytics

On a project for an insurance company, we used Debezium CDC to capture changes in an Oracle database of insurance policies and stream them into a data lake on Azure Blob Storage. From daily batch ETL to real-time — latency under 5 seconds.

Schema Registry — Evolution Without Chaos

Without schema management, Kafka is just a glorified byte pipe. Confluent Schema Registry stores Avro/JSON/Protobuf schemas and enforces compatibility. A backward-compatible change (adding an optional field) passes; a breaking change (removing a required field) is rejected.

In practice, this means the producer team can add a new field to an event without coordinating with all consumer teams. Old consumers simply ignore it, new ones can leverage it. Schema evolution instead of big-bang migration.

For stream processing in 2020, you have two main choices:

  • Kafka Streams: a Java library, no extra cluster needed, ideal for simpler transformations, filtering, aggregations. Runs as a regular Java application
  • Apache Flink: a full-featured streaming engine, its own cluster, significantly more powerful for complex event processing, windowing, pattern detection

Our recommendation: start with Kafka Streams. If you need complex event processing (CEP), sliding windows over millions of events, or exactly-once processing across systems — move to Flink.

Production Operations — Lessons Learned

After two years of running Kafka clusters in production, we have several takeaways:

  • Monitoring is critical: Kafka JMX metrics into Prometheus + Grafana. Watch consumer lag, under-replicated partitions, request latency
  • Don’t set partition count too high: more partitions = more file handles, slower leader election. 12 partitions per topic is a good starting point
  • Plan retention ahead: 7 days default retention × 100 MB/s = 60 TB storage. Plan accordingly
  • ZooKeeper is a single point of failure: dedicated ZooKeeper ensemble, not on the same nodes as brokers (in 2020 you still need ZK; KRaft comes later)
  • Consumer group rebalancing: deploying new consumers can cause a storm — use cooperative rebalancing (Kafka 2.4+)

Managed vs. Self-Hosted

Confluent Cloud, AWS MSK, Azure Event Hubs (Kafka-compatible) — managed services reduce operational burden, but at a cost. For a client with 50 GB/day, Confluent Cloud is significantly more expensive than self-hosted. For a startup with one DevOps engineer, a managed service is a no-brainer.

Kafka as the Backbone of Event-Driven Architecture

Apache Kafka in 2020 isn’t experimental technology — it’s a production-proven foundation for real-time data platforms. The key to success isn’t Kafka itself, but thoughtful event design, schema management, and monitoring. Without them, it’s just fast chaos.

apache kafkastreamingdata platformevent-driven
Share:

CORE SYSTEMS

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Contact us