Raft: Understandable Distributed Consensus
Raft is a consensus algorithm designed as an alternative to Paxos. It was specifically designed to be understandable. It is the foundation of modern distributed systems like Etcd (Kubernetes), Consul, and CockroachDB.
1. The Problem: State Machine Replication
Imagine you have a single server storing x=5. If it crashes, data is lost. To survive crashes, we need multiple servers (Replicas). If we have 3 servers (S1, S2, S3), they must all execute the same commands in the same order.
Consensus is the process of agreeing on one result among a group of participants.
2. Node States
At any given time, each server is in one of three states:
- Follower: Passive. Issues no requests but responds to requests from leaders and candidates.
- Candidate: Active. Used to elect a new leader.
- Leader: Active. Handles all client requests and replicates entries to followers.
3. Leader Election
Raft uses a heartbeat mechanism to trigger elections.
- Heartbeats: The Leader sends empty AppendEntries RPCs (heartbeats) to all Followers periodically (e.g., every 50ms).
- Election Timeout: Each Follower has a randomized timeout (e.g., 150ms - 300ms). If it doesn't hear from a Leader within this time, it assumes the Leader is dead.
The Election Process
- Follower increments its currentTerm.
- Transitions to Candidate state.
- Votes for itself.
- Sends RequestVote RPCs to all other servers.
Winning
A Candidate wins if it receives votes from a majority of the cluster (N/2 + 1).
Why Randomization Works
Randomized timeouts (150-300ms) prevent split votes:
Bad Scenario (Fixed Timeout):
- All 5 nodes timeout at exactly 150ms
- All 5 become candidates simultaneously
- Each votes for itself (1 vote each)
- Nobody wins → Retry forever
Good Scenario (Random Timeout):
- Node A times out at 153ms (becomes candidate first)
- Nodes B, C, D, E still followers
- A sends RequestVote to everyone
- B, C, D vote for A (A wins with 4/5 votes)
- E's timeout is interrupted by A becoming leader
Election Algorithm Details
class RaftNode:
def __init__(self):
self.state = "FOLLOWER"
self.current_term = 0
self.voted_for = None
self.election_timeout = random.randint(150, 300) # milliseconds
def start_election(self):
# 1. Increment term
self.current_term += 1
# 2. Transition to candidate
self.state = "CANDIDATE"
# 3. Vote for self
self.voted_for = self.id
votes_received = 1
# 4. Send RequestVote to all peers
for peer in self.peers:
response = peer.request_vote({
"term": self.current_term,
"candidate_id": self.id,
"last_log_index": len(self.log),
"last_log_term": self.log[-1].term if self.log else 0
})
if response.vote_granted:
votes_received += 1
# 5. Check if won
if votes_received > len(self.peers) / 2:
self.become_leader()
4. Log Replication
Once a Leader is elected, it services client requests.
Client sends command SET x=5.
- Append: Leader appends command to its local log (not yet executed).
- Replicate: Leader sends AppendEntries RPC to all Followers.
- Wait for Majority: Leader waits for majority to acknowledge receipt.
- Commit: When majority confirms, Leader marks entry as committed.
- Apply: Leader applies the command to its state machine.
- Notify: On next heartbeat, Leader tells Followers which entries are committed.
- Followers Apply: Followers apply committed entries to their state machines.
Log Entry Structure
class LogEntry:
def __init__(self, term, index, command):
self.term = term # Term when entry was received by leader
self.index = index # Position in log
self.command = command # State machine command
# Example log
log = [
LogEntry(term=1, index=1, command="SET x=1"),
LogEntry(term=1, index=2, command="SET y=2"),
LogEntry(term=2, index=3, command="SET x=5"), # New leader in term 2
]
Commit Index vs Applied Index
- commitIndex: Highest log entry known to be committed (replicated on majority)
- lastApplied: Highest log entry applied to state machine
Invariant: lastApplied ≤ commitIndex ≤ lastLogIndex
def apply_committed_entries(self):
# Apply all entries between lastApplied and commitIndex
while self.last_applied < self.commit_index:
self.last_applied += 1
entry = self.log[self.last_applied]
self.state_machine.apply(entry.command)
Interactive: Replication Flow Diagram
Raft Log Replication
sequenceDiagram
participant C as Client
participant L as Leader
participant F1 as Follower 1
participant F2 as Follower 2
C->>L: SET x=5
L->>L: Append to log[3]
L->>F1: AppendEntries RPC
L->>F2: AppendEntries RPC
F1-->>L: ACK
F2-->>L: ACK
Note over L: 3/3 nodes have entry<br/>Commit it!
L->>L: commitIndex = 3
L->>L: Apply command
L-->>C: Success
L->>F1: commitIndex = 3 (next heartbeat)
L->>F2: commitIndex = 3
F1->>F1: Apply command
F2->>F2: Apply command
5. Safety Guarantees
Election Restriction: Log Completeness
A candidate can only win election if its log is "at least as up-to-date" as a majority.
Rule: When voting, a node rejects a candidate if:
def should_grant_vote(self, candidate):
# Reject if candidate's log is behind
last_log_term = self.log[-1].term if self.log else 0
last_log_index = len(self.log)
if candidate.last_log_term < last_log_term:
return False
if candidate.last_log_term == last_log_term and candidate.last_log_index < last_log_index:
return False
return True
Why? Ensures new leader has all committed entries. Otherwise, committed data could be lost.
Leader Completeness Property
If a log entry is committed in a given term, that entry will be present in the logs of all leaders for all higher-numbered terms.
This is Raft's key safety property.
Stale Leaders (Network Partitions)
Imagine a 5-node cluster (A, B, C, D, E) with Leader A. Network cuts A, B from C, D, E.
- Partition 1 (A, B): A is still leader, but cannot replicate to majority (needs 3). Writes will hang/fail (cannot commit).
- Partition 2 (C, D, E): C times out, wins election with term 2 (gets 3 votes: C, D, E). C accepts new writes.
When network heals:
- A sends AppendEntries with term 1
- C rejects ("Your term is stale")
- A sees term 2 > term 1, steps down to follower
- A's uncommitted entries are discarded
- A syncs with C's log
graph TB
subgraph Partition 1
A[Leader A<br/>Term 1]
B[Follower B]
end
subgraph Partition 2
C[New Leader C<br/>Term 2]
D[Follower D]
E[Follower E]
end
A -.X Network Cut X..- C
6. Configuration Changes
How do you add/remove nodes without downtime?
The Problem: Disjoint Majorities
If you instantly switch from 3 nodes to 5 nodes:
- Old majority: 2/3
- New majority: 3/5
- Danger: Could elect 2 leaders simultaneously
Solution: Joint Consensus
Raft uses a two-phase approach:
- Phase 1: Replicate configuration
C_old,new(requires majority in BOTH configs) - Phase 2: Replicate configuration
C_new(requires majority in new config only)
Old config: {A, B, C}
New config: {A, B, C, D, E}
Phase 1: C_old,new requires:
- 2/3 from {A, B, C}
- 3/5 from {A, B, C, D, E}
Phase 2: C_new requires:
- 3/5 from {A, B, C, D, E}
Real-World Usage
etcd (Kubernetes)
- Stores Kubernetes cluster state
- 3 or 5 node cluster (odd numbers for majority)
- Typical deployment: 5 etcd nodes across 3 availability zones
- Heartbeat interval: 100ms
- Election timeout: 1000ms
Consul (Service Discovery)
- Service mesh configuration
- Health checks and service registry
- Uses Raft for consistency
CockroachDB (Database)
- Uses Raft for replicating table ranges
- Each table range has its own Raft group
- Scales to thousands of Raft groups per node
Performance Tuning
Election Timeout Tuning
Network latency: 10ms Election timeout should be: 10-20x network latency Recommended: - Same datacenter: 150-300ms - Cross-region: 1000-3000ms
Too short: Frequent unnecessary elections Too long: Slow failure detection
Batch AppendEntries
Don't send 1 RPC per client request. Batch multiple log entries:
def replicate_logs(self):
# Wait for 10ms or 100 entries, whichever comes first
batch = self.collect_batch(max_wait_ms=10, max_entries=100)
self.send_append_entries(batch)
Interview Tips 💡
When discussing Raft in system design interviews:
- Explain the problem: "We need multiple servers to agree on the same log of operations"
- Three key components: Leader election, log replication, safety
- Majority quorum: Explain why 2/3 or 3/5, not 3/3
- Real examples: "etcd uses Raft to maintain Kubernetes state"
- Handle failures: Explain what happens when leader crashes or network partitions
- Compare to Paxos: "Raft is easier to understand and implement"
Raft vs Paxos
| Feature | Raft | Paxos |
|---|---|---|
| Understandability | High (designed for clarity) | Low (notoriously complex) |
| Leader-based | Yes (strong leader) | No (leader optional) |
| Log structure | Committed entries never change | More flexible |
| Production use | etcd, Consul, CockroachDB | Chubby (Google) |
| Learning curve | Moderate | Steep |
Common Pitfalls
⚠️ Clock Skew: Raft doesn't rely on wall-clock time, only logical terms. But heartbeat timers still need reasonable accuracy.
⚠️ Disk I/O: currentTerm and votedFor must be persisted before sending votes. Otherwise, node could vote twice in same term.
⚠️ Snapshot Timing: Large logs must be snapshot. Don't wait until log is 100GB!
Related Concepts
- Leader Election — General leader election patterns
- CAP Theorem — Raft chooses consistency over availability during partition
- Paxos — Alternative consensus algorithm
- Distributed Transactions — Achieving atomicity across services
- Kubernetes Architecture — etcd's role in Kubernetes
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
Leader Election
Comprehensive guide to distributed leader election algorithms including Raft, Paxos, and Bully algorithm, covering consensus, split-brain prevention, and real-world implementations in Kubernetes, ZooKeeper, and etcd.
System Design: Dropbox (Google Drive)
Designing a file synchronization service like Dropbox or Google Drive. Key concepts: Block-level Deduplication, Delta Sync, and Strong Consistency.
Geohashing (Location Encoding)
A geocoding system that encodes latitude/longitude coordinates into short alphanumeric strings for efficient proximity searches and spatial indexing.