Why Kafka?
Traditional message queues (like RabbitMQ) push messages to consumers and delete them after acknowledgment. Kafka is fundamentally different: it is a Distributed Commit Log. Messages are persisted on disk for days or weeks. Consumers "pull" messages at their own pace, and multiple consumers can independently read the same data.
This design makes Kafka ideal for scenarios where multiple services need to process the same events, or where you need the ability to replay past events (e.g., rebuilding a derived view after a bug fix).
Interactive: The Commit Log
Kafka Pub/Sub Flow
1. Core Concepts
A. Topic
A category of messages (e.g., user-clicks, payments, order-updates). Producers write to topics, and consumers read from topics.
- Topics are logical containers. Physically, they are split into Partitions — the actual unit of storage and parallelism.
B. Partition (The Unit of Scalability)
A Topic is divided into ordered, immutable sequences called partitions (P0, P1, P2).
- Ordering: Messages are ordered only within a single partition, not across partitions.
- Parallelism: If you have 3 partitions, you can have up to 3 consumers reading in parallel — one per partition.
- Storage: Each partition is an append-only log stored as segment files on disk.
- Typical sizing: 10-30 partitions per topic for most workloads. High-throughput topics may use 100+.
C. Offset
Every message in a partition gets a sequential ID called an Offset (0, 1, 2, 3...).
- Kafka doesn't delete messages after consumption. Instead, each consumer tracks its own "offset pointer" — how far it has read.
- This enables Replayability: A consumer can rewind to offset 0 and re-process all events from the beginning.
- Offsets are stored in a special internal topic called
__consumer_offsets.
D. Broker
A single Kafka server is called a Broker. A Kafka Cluster consists of multiple brokers:
- Each broker stores one or more partitions.
- One broker in the cluster acts as the Controller (manages partition assignments and leader elections).
- A typical production cluster has 3-30+ brokers.
2. Producers & Keys
How does a producer know which partition to send to?
# Producer sends a message with a key
producer.send(
topic="user-clicks",
key=b"user:123", # Partition key
value=b'{"page": "/home", "time": "2026-01-15"}'
)
Partition = hash(key) % num_partitions- Why keys matter: All events for
user:123are guaranteed to go to the same partition, ensuring they are processed in order. Without keys, messages are round-robin distributed across partitions, and ordering is lost. - Null keys: If no key is provided, the producer uses a round-robin or sticky partitioner for even distribution.
3. Consumer Groups
This is the key to Kafka's scalability for consumption.
- Problem: We need to process 1 million messages per second.
- Solution: Create a Consumer Group with 10 consumer instances. Kafka automatically assigns partitions to consumers within the group.
How It Works
-
Topic with 6 partitions + Consumer Group with 3 consumers:
- Consumer 1 → P0, P1
- Consumer 2 → P2, P3
- Consumer 3 → P4, P5
-
Rebalancing: If Consumer 2 dies, Kafka re-assigns P2 and P3 to the surviving consumers:
- Consumer 1 → P0, P1, P2
- Consumer 3 → P3, P4, P5
-
Scaling rule: Adding consumers beyond the number of partitions is wasteful — the extra consumers sit idle.
Multiple Consumer Groups
Multiple consumer groups can read the same topic independently. Each group maintains its own offsets:
- Group A (Analytics): Counting clicks per page.
- Group B (Recommendations): Building user profiles.
- Group C (Alerting): Detecting anomalies.
All three groups read every message, but at their own pace.
4. Reliability (Replication & ISR)
How do we ensure zero data loss?
Replication
- Replication Factor: Typically 3 (1 Leader, 2 Followers).
- The Leader handles all reads and writes for a partition. Followers replicate data from the Leader.
- If the Leader broker fails, one of the Followers is promoted to Leader automatically.
In-Sync Replicas (ISR)
- ISR: The set of replicas that are fully caught up with the Leader's latest data.
- The producer's
ackssetting controls durability:acks=0: Don't wait for any acknowledgment (fastest, data loss possible).acks=1: Wait for Leader to acknowledge (balanced).acks=all: Wait for ALL replicas in the ISR to acknowledge (slowest, strongest guarantee).
# Producer config for maximum durability acks=all min.insync.replicas=2 retries=3
5. ZooKeeper vs KRaft
The Old Way (Pre-2.8): ZooKeeper
- Kafka relied on Apache ZooKeeper for metadata management: tracking which brokers are alive, which broker is the Controller, and partition leadership.
- Problems: ZooKeeper was a separate system to deploy, monitor, and scale. It became a bottleneck as clusters grew beyond ~200,000 partitions.
The New Way: KRaft (Kafka Raft)
- Starting in Kafka 2.8 (GA in 3.3), Kafka manages its own metadata using an internal Raft consensus quorum.
- Benefits: Simpler operations (no ZooKeeper), faster Controller failover, support for millions of partitions per cluster.
- ZooKeeper mode is deprecated as of Kafka 3.5+ and will be removed in Kafka 4.0.
6. Real-World Use Cases
- Activity Tracking (LinkedIn, Netflix): Track every user click, page view, and search query in real-time. Kafka was originally built at LinkedIn for this purpose.
- Log Aggregation: Collect logs from thousands of microservices into a central system (ELK stack, Splunk) via Kafka.
- Event Sourcing / CQRS: Store every state change as an immutable event. Replay events to rebuild system state.
- Stream Processing: Process events in real-time with Kafka Streams or Apache Flink.
- Change Data Capture (CDC): Capture row-level changes from databases (MySQL, PostgreSQL) and stream them to downstream systems via Debezium + Kafka.
Summary
| Concept | Purpose |
|---|---|
| Topic | Logical category for messages |
| Partition | Unit of parallelism and ordering |
| Offset | Position in the log (enables replay) |
| Consumer Group | Mechanism for scalable, parallel consumption |
| Replication (ISR) | Ensures durability even if brokers fail |
| KRaft | Self-managed metadata (replaces ZooKeeper) |
Interview Tips 💡
- Start with the problem: "Traditional queues delete messages after reading. Kafka retains them — enabling replay, multiple consumers, and event sourcing."
- Ordering guarantee: "Kafka guarantees ordering within a partition, not across partitions. Choose partition keys carefully."
- Scaling consumers: "The max parallelism is limited by the number of partitions. More consumers than partitions = idle consumers."
- Durability vs. latency: "
acks=allwithmin.insync.replicas=2ensures no data loss at the cost of higher write latency." - KRaft: "Modern Kafka (3.3+) no longer needs ZooKeeper. It uses an internal Raft quorum for metadata management."
Related Concepts
- Message Queues — Kafka vs. RabbitMQ vs. SQS
- Event Sourcing & CQRS — Pattern that leverages Kafka's log
- Distributed Transactions — Kafka as the event backbone
- Backpressure — Handling producer/consumer speed mismatches
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
BitTorrent Protocol (P2P File Sharing)
Complete guide to peer-to-peer file sharing using BitTorrent protocol, covering torrent structure, piece exchange, tit-for-tat algorithm, DHT for decentralization, and real-world implementations powering massive file distribution networks.
Gossip Protocol
A peer-to-peer communication protocol where information spreads like a virus (or rumor) through the cluster.
Load Balancing
Layer 4 vs Layer 7 Load Balancing. Algorithms (Round Robin, Least Connections, Consistent Hashing). Health checks and real-world implementation with Nginx.