Why Kafka?

Traditional message queues (like RabbitMQ) push messages to consumers and delete them after acknowledgment. Kafka is fundamentally different: it is a Distributed Commit Log. Messages are persisted on disk for days or weeks. Consumers "pull" messages at their own pace, and multiple consumers can independently read the same data.

This design makes Kafka ideal for scenarios where multiple services need to process the same events, or where you need the ability to replay past events (e.g., rebuilding a derived view after a bug fix).

Interactive: The Commit Log

Kafka Pub/Sub Flow

ProducerEvents/Logs

Topic: "orders"

Off:0

Message

Consumer Offset

1. Core Concepts

A. Topic

A category of messages (e.g., user-clicks, payments, order-updates). Producers write to topics, and consumers read from topics.

Topics are logical containers. Physically, they are split into Partitions — the actual unit of storage and parallelism.

B. Partition (The Unit of Scalability)

A Topic is divided into ordered, immutable sequences called partitions (P0, P1, P2).

Ordering: Messages are ordered only within a single partition, not across partitions.
Parallelism: If you have 3 partitions, you can have up to 3 consumers reading in parallel — one per partition.
Storage: Each partition is an append-only log stored as segment files on disk.
Typical sizing: 10-30 partitions per topic for most workloads. High-throughput topics may use 100+.

C. Offset

Every message in a partition gets a sequential ID called an Offset (0, 1, 2, 3...).

Kafka doesn't delete messages after consumption. Instead, each consumer tracks its own "offset pointer" — how far it has read.
This enables Replayability: A consumer can rewind to offset 0 and re-process all events from the beginning.
Offsets are stored in a special internal topic called __consumer_offsets.

D. Broker

A single Kafka server is called a Broker. A Kafka Cluster consists of multiple brokers:

Each broker stores one or more partitions.
One broker in the cluster acts as the Controller (manages partition assignments and leader elections).
A typical production cluster has 3-30+ brokers.

2. Producers & Keys

How does a producer know which partition to send to?

python

# Producer sends a message with a key
producer.send(
    topic="user-clicks",
    key=b"user:123",      # Partition key
    value=b'{"page": "/home", "time": "2026-01-15"}'
)

Click to expand code...

Partition = hash(key) % num_partitions
Why keys matter: All events for user:123 are guaranteed to go to the same partition, ensuring they are processed in order. Without keys, messages are round-robin distributed across partitions, and ordering is lost.
Null keys: If no key is provided, the producer uses a round-robin or sticky partitioner for even distribution.

3. Consumer Groups

This is the key to Kafka's scalability for consumption.

Problem: We need to process 1 million messages per second.
Solution: Create a Consumer Group with 10 consumer instances. Kafka automatically assigns partitions to consumers within the group.

How It Works

Topic with 6 partitions + Consumer Group with 3 consumers:
- Consumer 1 → P0, P1
- Consumer 2 → P2, P3
- Consumer 3 → P4, P5
Rebalancing: If Consumer 2 dies, Kafka re-assigns P2 and P3 to the surviving consumers:
- Consumer 1 → P0, P1, P2
- Consumer 3 → P3, P4, P5
Scaling rule: Adding consumers beyond the number of partitions is wasteful — the extra consumers sit idle.

Multiple Consumer Groups

Multiple consumer groups can read the same topic independently. Each group maintains its own offsets:

Group A (Analytics): Counting clicks per page.
Group B (Recommendations): Building user profiles.
Group C (Alerting): Detecting anomalies.

All three groups read every message, but at their own pace.

4. Reliability (Replication & ISR)

How do we ensure zero data loss?

Replication

Replication Factor: Typically 3 (1 Leader, 2 Followers).
The Leader handles all reads and writes for a partition. Followers replicate data from the Leader.
If the Leader broker fails, one of the Followers is promoted to Leader automatically.

In-Sync Replicas (ISR)

ISR: The set of replicas that are fully caught up with the Leader's latest data.
The producer's acks setting controls durability:
- acks=0: Don't wait for any acknowledgment (fastest, data loss possible).
- acks=1: Wait for Leader to acknowledge (balanced).
- acks=all: Wait for ALL replicas in the ISR to acknowledge (slowest, strongest guarantee).

properties

# Producer config for maximum durability
acks=all
min.insync.replicas=2
retries=3

Click to expand code...

5. ZooKeeper vs KRaft

The Old Way (Pre-2.8): ZooKeeper

Kafka relied on Apache ZooKeeper for metadata management: tracking which brokers are alive, which broker is the Controller, and partition leadership.
Problems: ZooKeeper was a separate system to deploy, monitor, and scale. It became a bottleneck as clusters grew beyond ~200,000 partitions.

The New Way: KRaft (Kafka Raft)

Starting in Kafka 2.8 (GA in 3.3), Kafka manages its own metadata using an internal Raft consensus quorum.
Benefits: Simpler operations (no ZooKeeper), faster Controller failover, support for millions of partitions per cluster.
ZooKeeper mode is deprecated as of Kafka 3.5+ and will be removed in Kafka 4.0.

6. Real-World Use Cases

Activity Tracking (LinkedIn, Netflix): Track every user click, page view, and search query in real-time. Kafka was originally built at LinkedIn for this purpose.
Log Aggregation: Collect logs from thousands of microservices into a central system (ELK stack, Splunk) via Kafka.
Event Sourcing / CQRS: Store every state change as an immutable event. Replay events to rebuild system state.
Stream Processing: Process events in real-time with Kafka Streams or Apache Flink.
Change Data Capture (CDC): Capture row-level changes from databases (MySQL, PostgreSQL) and stream them to downstream systems via Debezium + Kafka.

Summary

Concept	Purpose
Topic	Logical category for messages
Partition	Unit of parallelism and ordering
Offset	Position in the log (enables replay)
Consumer Group	Mechanism for scalable, parallel consumption
Replication (ISR)	Ensures durability even if brokers fail
KRaft	Self-managed metadata (replaces ZooKeeper)

Interview Tips 💡

Start with the problem: "Traditional queues delete messages after reading. Kafka retains them — enabling replay, multiple consumers, and event sourcing."
Ordering guarantee: "Kafka guarantees ordering within a partition, not across partitions. Choose partition keys carefully."
Scaling consumers: "The max parallelism is limited by the number of partitions. More consumers than partitions = idle consumers."
Durability vs. latency: "acks=all with min.insync.replicas=2 ensures no data loss at the cost of higher write latency."
KRaft: "Modern Kafka (3.3+) no longer needs ZooKeeper. It uses an internal Raft quorum for metadata management."

Related Concepts

Message Queues — Kafka vs. RabbitMQ vs. SQS
Event Sourcing & CQRS — Pattern that leverages Kafka's log
Distributed Transactions — Kafka as the event backbone
Backpressure — Handling producer/consumer speed mismatches

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Advanced

BitTorrent Protocol (P2P File Sharing)

Complete guide to peer-to-peer file sharing using BitTorrent protocol, covering torrent structure, piece exchange, tit-for-tat algorithm, DHT for decentralization, and real-world implementations powering massive file distribution networks.

System DesignNetworkingDistributed Systems