Raft: Understandable Distributed Consensus

Raft is a consensus algorithm designed as an alternative to Paxos. It was specifically designed to be understandable. It is the foundation of modern distributed systems like Etcd (Kubernetes), Consul, and CockroachDB.

1. The Problem: State Machine Replication

Imagine you have a single server storing x=5. If it crashes, data is lost. To survive crashes, we need multiple servers (Replicas). If we have 3 servers (S1, S2, S3), they must all execute the same commands in the same order.

Consensus is the process of agreeing on one result among a group of participants.

2. Node States

At any given time, each server is in one of three states:

Follower: Passive. Issues no requests but responds to requests from leaders and candidates.
Candidate: Active. Used to elect a new leader.
Leader: Active. Handles all client requests and replicates entries to followers.

3. Leader Election

Raft uses a heartbeat mechanism to trigger elections.

Heartbeats: The Leader sends empty AppendEntries RPCs (heartbeats) to all Followers periodically (e.g., every 50ms).
Election Timeout: Each Follower has a randomized timeout (e.g., 150ms - 300ms). If it doesn't hear from a Leader within this time, it assumes the Leader is dead.

The Election Process

Follower increments its currentTerm.
Transitions to Candidate state.
Votes for itself.
Sends RequestVote RPCs to all other servers.

Winning

A Candidate wins if it receives votes from a majority of the cluster (N/2 + 1).

Why Randomization Works

Randomized timeouts (150-300ms) prevent split votes:

Bad Scenario (Fixed Timeout):

All 5 nodes timeout at exactly 150ms
All 5 become candidates simultaneously
Each votes for itself (1 vote each)
Nobody wins → Retry forever

Good Scenario (Random Timeout):

Node A times out at 153ms (becomes candidate first)
Nodes B, C, D, E still followers
A sends RequestVote to everyone
B, C, D vote for A (A wins with 4/5 votes)
E's timeout is interrupted by A becoming leader

Election Algorithm Details

python

class RaftNode:
    def __init__(self):
        self.state = "FOLLOWER"
        self.current_term = 0
        self.voted_for = None
        self.election_timeout = random.randint(150, 300)  # milliseconds
    
    def start_election(self):
        # 1. Increment term
        self.current_term += 1
        
        # 2. Transition to candidate
        self.state = "CANDIDATE"
        
        # 3. Vote for self
        self.voted_for = self.id
        votes_received = 1
        
        # 4. Send RequestVote to all peers
        for peer in self.peers:
            response = peer.request_vote({
                "term": self.current_term,
                "candidate_id": self.id,
                "last_log_index": len(self.log),
                "last_log_term": self.log[-1].term if self.log else 0
            })
            
            if response.vote_granted:
                votes_received += 1
        
        # 5. Check if won
        if votes_received > len(self.peers) / 2:
            self.become_leader()

Click to expand code...

4. Log Replication

Once a Leader is elected, it services client requests. Client sends command SET x=5.

Append: Leader appends command to its local log (not yet executed).
Replicate: Leader sends AppendEntries RPC to all Followers.
Wait for Majority: Leader waits for majority to acknowledge receipt.
Commit: When majority confirms, Leader marks entry as committed.
Apply: Leader applies the command to its state machine.
Notify: On next heartbeat, Leader tells Followers which entries are committed.
Followers Apply: Followers apply committed entries to their state machines.

Log Entry Structure

python

class LogEntry:
    def __init__(self, term, index, command):
        self.term = term      # Term when entry was received by leader
        self.index = index    # Position in log
        self.command = command  # State machine command

# Example log
log = [
    LogEntry(term=1, index=1, command="SET x=1"),
    LogEntry(term=1, index=2, command="SET y=2"),
    LogEntry(term=2, index=3, command="SET x=5"),  # New leader in term 2
]

Click to expand code...

Commit Index vs Applied Index

commitIndex: Highest log entry known to be committed (replicated on majority)
lastApplied: Highest log entry applied to state machine

Invariant: lastApplied ≤ commitIndex ≤ lastLogIndex

python

def apply_committed_entries(self):
    # Apply all entries between lastApplied and commitIndex
    while self.last_applied < self.commit_index:
        self.last_applied += 1
        entry = self.log[self.last_applied]
        self.state_machine.apply(entry.command)

Click to expand code...

Interactive: Replication Flow Diagram

Raft Log Replication

Waiting for Client Request...

LEADER

FOLLOWER 1

FOLLOWER 2

mermaid

sequenceDiagram
    participant C as Client
    participant L as Leader
    participant F1 as Follower 1
   participant F2 as Follower 2
    
    C->>L: SET x=5
    L->>L: Append to log[3]
    L->>F1: AppendEntries RPC
    L->>F2: AppendEntries RPC
    F1-->>L: ACK
    F2-->>L: ACK
    Note over L: 3/3 nodes have entry<br/>Commit it!
    L->>L: commitIndex = 3
    L->>L: Apply command
    L-->>C: Success
    L->>F1: commitIndex = 3 (next heartbeat)
    L->>F2: commitIndex = 3
    F1->>F1: Apply command
    F2->>F2: Apply command

Click to expand code...

5. Safety Guarantees

Election Restriction: Log Completeness

A candidate can only win election if its log is "at least as up-to-date" as a majority.

Rule: When voting, a node rejects a candidate if:

python

def should_grant_vote(self, candidate):
    # Reject if candidate's log is behind
    last_log_term = self.log[-1].term if self.log else 0
    last_log_index = len(self.log)
    
    if candidate.last_log_term < last_log_term:
        return False
    if candidate.last_log_term == last_log_term and candidate.last_log_index < last_log_index:
        return False
    
    return True

Click to expand code...

Why? Ensures new leader has all committed entries. Otherwise, committed data could be lost.

Leader Completeness Property

If a log entry is committed in a given term, that entry will be present in the logs of all leaders for all higher-numbered terms.

This is Raft's key safety property.

Stale Leaders (Network Partitions)

Imagine a 5-node cluster (A, B, C, D, E) with Leader A. Network cuts A, B from C, D, E.

Partition 1 (A, B): A is still leader, but cannot replicate to majority (needs 3). Writes will hang/fail (cannot commit).
Partition 2 (C, D, E): C times out, wins election with term 2 (gets 3 votes: C, D, E). C accepts new writes.

When network heals:

A sends AppendEntries with term 1
C rejects ("Your term is stale")
A sees term 2 > term 1, steps down to follower
A's uncommitted entries are discarded
A syncs with C's log

mermaid

graph TB
    subgraph Partition 1
        A[Leader A<br/>Term 1]
        B[Follower B]
    end
    
    subgraph Partition 2
        C[New Leader C<br/>Term 2]
        D[Follower D]
        E[Follower E]
    end
    
    A -.X Network Cut X..- C

Click to expand code...

6. Configuration Changes

How do you add/remove nodes without downtime?

The Problem: Disjoint Majorities

If you instantly switch from 3 nodes to 5 nodes:

Old majority: 2/3
New majority: 3/5
Danger: Could elect 2 leaders simultaneously

Solution: Joint Consensus

Raft uses a two-phase approach:

Phase 1: Replicate configuration C_old,new (requires majority in BOTH configs)
Phase 2: Replicate configuration C_new (requires majority in new config only)

Old config: {A, B, C}
New config: {A, B, C, D, E}

Phase 1: C_old,new requires:
  - 2/3 from {A, B, C}
  - 3/5 from {A, B, C, D, E}
  
Phase 2: C_new requires:
  - 3/5 from {A, B, C, D, E}

Click to expand code...

Real-World Usage

etcd (Kubernetes)

Stores Kubernetes cluster state
3 or 5 node cluster (odd numbers for majority)
Typical deployment: 5 etcd nodes across 3 availability zones
Heartbeat interval: 100ms
Election timeout: 1000ms

Consul (Service Discovery)

Service mesh configuration
Health checks and service registry
Uses Raft for consistency

CockroachDB (Database)

Uses Raft for replicating table ranges
Each table range has its own Raft group
Scales to thousands of Raft groups per node

Performance Tuning

Election Timeout Tuning

Network latency: 10ms
Election timeout should be: 10-20x network latency

Recommended:
- Same datacenter: 150-300ms
- Cross-region: 1000-3000ms

Click to expand code...

Too short: Frequent unnecessary elections Too long: Slow failure detection

Batch AppendEntries

Don't send 1 RPC per client request. Batch multiple log entries:

python

def replicate_logs(self):
    # Wait for 10ms or 100 entries, whichever comes first
    batch = self.collect_batch(max_wait_ms=10, max_entries=100)
    self.send_append_entries(batch)

Click to expand code...

Interview Tips 💡

When discussing Raft in system design interviews:

Explain the problem: "We need multiple servers to agree on the same log of operations"
Three key components: Leader election, log replication, safety
Majority quorum: Explain why 2/3 or 3/5, not 3/3
Real examples: "etcd uses Raft to maintain Kubernetes state"
Handle failures: Explain what happens when leader crashes or network partitions
Compare to Paxos: "Raft is easier to understand and implement"

Raft vs Paxos

Feature	Raft	Paxos
Understandability	High (designed for clarity)	Low (notoriously complex)
Leader-based	Yes (strong leader)	No (leader optional)
Log structure	Committed entries never change	More flexible
Production use	etcd, Consul, CockroachDB	Chubby (Google)
Learning curve	Moderate	Steep

Common Pitfalls

⚠️ Clock Skew: Raft doesn't rely on wall-clock time, only logical terms. But heartbeat timers still need reasonable accuracy.

⚠️ Disk I/O: currentTerm and votedFor must be persisted before sending votes. Otherwise, node could vote twice in same term.

⚠️ Snapshot Timing: Large logs must be snapshot. Don't wait until log is 100GB!

Related Concepts

Leader Election — General leader election patterns
CAP Theorem — Raft chooses consistency over availability during partition
Paxos — Alternative consensus algorithm
Distributed Transactions — Achieving atomicity across services
Kubernetes Architecture — etcd's role in Kubernetes

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Advanced

Leader Election

Comprehensive guide to distributed leader election algorithms including Raft, Paxos, and Bully algorithm, covering consensus, split-brain prevention, and real-world implementations in Kubernetes, ZooKeeper, and etcd.

Distributed SystemsConsensusAlgorithms

Advanced

System Design: Dropbox (Google Drive)

Designing a file synchronization service like Dropbox or Google Drive. Key concepts: Block-level Deduplication, Delta Sync, and Strong Consistency.

System DesignStorageDistributed Systems

Intermediate

Geohashing (Location Encoding)

A geocoding system that encodes latitude/longitude coordinates into short alphanumeric strings for efficient proximity searches and spatial indexing.

System DesignGeospatialAlgorithms

Raft Consensus Algorithm

Raft: Understandable Distributed Consensus

1. The Problem: State Machine Replication

2. Node States

3. Leader Election

The Election Process

Winning

Why Randomization Works

Election Algorithm Details

4. Log Replication

Log Entry Structure

Commit Index vs Applied Index

Interactive: Replication Flow Diagram

Raft Log Replication

5. Safety Guarantees

Election Restriction: Log Completeness

Leader Completeness Property

Stale Leaders (Network Partitions)

6. Configuration Changes

The Problem: Disjoint Majorities

Solution: Joint Consensus

Real-World Usage

etcd (Kubernetes)

Consul (Service Discovery)

CockroachDB (Database)

Performance Tuning

Election Timeout Tuning

Batch AppendEntries

Interview Tips 💡

Raft vs Paxos

Common Pitfalls

Related Concepts

About ScaleWiki

Related Articles

Leader Election

System Design: Dropbox (Google Drive)

Geohashing (Location Encoding)