Back to All Concepts
Distributed SystemsAlgorithmsConsensusSystem DesignPro

Raft Consensus Algorithm

A comprehensive guide to Raft, the consensus algorithm powering Etcd, Consul, and Kubernetes. Leader election, log replication, safety guarantees, and production deployment patterns.

Raft: Understandable Distributed Consensus

Raft is a consensus algorithm designed as an alternative to Paxos. It was specifically designed to be understandable. It is the foundation of modern distributed systems like Etcd (Kubernetes), Consul, and CockroachDB.

1. The Problem: State Machine Replication

Imagine you have a single server storing x=5. If it crashes, data is lost. To survive crashes, we need multiple servers (Replicas). If we have 3 servers (S1, S2, S3), they must all execute the same commands in the same order.

Consensus is the process of agreeing on one result among a group of participants.

2. Node States

At any given time, each server is in one of three states:

  1. Follower: Passive. Issues no requests but responds to requests from leaders and candidates.
  2. Candidate: Active. Used to elect a new leader.
  3. Leader: Active. Handles all client requests and replicates entries to followers.

3. Leader Election

Raft uses a heartbeat mechanism to trigger elections.

  1. Heartbeats: The Leader sends empty AppendEntries RPCs (heartbeats) to all Followers periodically (e.g., every 50ms).
  2. Election Timeout: Each Follower has a randomized timeout (e.g., 150ms - 300ms). If it doesn't hear from a Leader within this time, it assumes the Leader is dead.

The Election Process

  1. Follower increments its currentTerm.
  2. Transitions to Candidate state.
  3. Votes for itself.
  4. Sends RequestVote RPCs to all other servers.

Winning

A Candidate wins if it receives votes from a majority of the cluster (N/2 + 1).

Why Randomization Works

Randomized timeouts (150-300ms) prevent split votes:

Bad Scenario (Fixed Timeout):

  • All 5 nodes timeout at exactly 150ms
  • All 5 become candidates simultaneously
  • Each votes for itself (1 vote each)
  • Nobody wins → Retry forever

Good Scenario (Random Timeout):

  • Node A times out at 153ms (becomes candidate first)
  • Nodes B, C, D, E still followers
  • A sends RequestVote to everyone
  • B, C, D vote for A (A wins with 4/5 votes)
  • E's timeout is interrupted by A becoming leader

Election Algorithm Details

python
class RaftNode:
    def __init__(self):
        self.state = "FOLLOWER"
        self.current_term = 0
        self.voted_for = None
        self.election_timeout = random.randint(150, 300)  # milliseconds
    
    def start_election(self):
        # 1. Increment term
        self.current_term += 1
        
        # 2. Transition to candidate
        self.state = "CANDIDATE"
        
        # 3. Vote for self
        self.voted_for = self.id
        votes_received = 1
        
        # 4. Send RequestVote to all peers
        for peer in self.peers:
            response = peer.request_vote({
                "term": self.current_term,
                "candidate_id": self.id,
                "last_log_index": len(self.log),
                "last_log_term": self.log[-1].term if self.log else 0
            })
            
            if response.vote_granted:
                votes_received += 1
        
        # 5. Check if won
        if votes_received > len(self.peers) / 2:
            self.become_leader()
Click to expand code...

4. Log Replication

Once a Leader is elected, it services client requests. Client sends command SET x=5.

  1. Append: Leader appends command to its local log (not yet executed).
  2. Replicate: Leader sends AppendEntries RPC to all Followers.
  3. Wait for Majority: Leader waits for majority to acknowledge receipt.
  4. Commit: When majority confirms, Leader marks entry as committed.
  5. Apply: Leader applies the command to its state machine.
  6. Notify: On next heartbeat, Leader tells Followers which entries are committed.
  7. Followers Apply: Followers apply committed entries to their state machines.

Log Entry Structure

python
class LogEntry:
    def __init__(self, term, index, command):
        self.term = term      # Term when entry was received by leader
        self.index = index    # Position in log
        self.command = command  # State machine command

# Example log
log = [
    LogEntry(term=1, index=1, command="SET x=1"),
    LogEntry(term=1, index=2, command="SET y=2"),
    LogEntry(term=2, index=3, command="SET x=5"),  # New leader in term 2
]
Click to expand code...

Commit Index vs Applied Index

  • commitIndex: Highest log entry known to be committed (replicated on majority)
  • lastApplied: Highest log entry applied to state machine

Invariant: lastApplied ≤ commitIndex ≤ lastLogIndex

python
def apply_committed_entries(self):
    # Apply all entries between lastApplied and commitIndex
    while self.last_applied < self.commit_index:
        self.last_applied += 1
        entry = self.log[self.last_applied]
        self.state_machine.apply(entry.command)
Click to expand code...

Interactive: Replication Flow Diagram

Raft Log Replication

Waiting for Client Request...
LEADER
FOLLOWER 1
FOLLOWER 2
mermaid
sequenceDiagram
    participant C as Client
    participant L as Leader
    participant F1 as Follower 1
   participant F2 as Follower 2
    
    C->>L: SET x=5
    L->>L: Append to log[3]
    L->>F1: AppendEntries RPC
    L->>F2: AppendEntries RPC
    F1-->>L: ACK
    F2-->>L: ACK
    Note over L: 3/3 nodes have entry<br/>Commit it!
    L->>L: commitIndex = 3
    L->>L: Apply command
    L-->>C: Success
    L->>F1: commitIndex = 3 (next heartbeat)
    L->>F2: commitIndex = 3
    F1->>F1: Apply command
    F2->>F2: Apply command
Click to expand code...

5. Safety Guarantees

Election Restriction: Log Completeness

A candidate can only win election if its log is "at least as up-to-date" as a majority.

Rule: When voting, a node rejects a candidate if:

python
def should_grant_vote(self, candidate):
    # Reject if candidate's log is behind
    last_log_term = self.log[-1].term if self.log else 0
    last_log_index = len(self.log)
    
    if candidate.last_log_term < last_log_term:
        return False
    if candidate.last_log_term == last_log_term and candidate.last_log_index < last_log_index:
        return False
    
    return True
Click to expand code...

Why? Ensures new leader has all committed entries. Otherwise, committed data could be lost.

Leader Completeness Property

If a log entry is committed in a given term, that entry will be present in the logs of all leaders for all higher-numbered terms.

This is Raft's key safety property.

Stale Leaders (Network Partitions)

Imagine a 5-node cluster (A, B, C, D, E) with Leader A. Network cuts A, B from C, D, E.

  • Partition 1 (A, B): A is still leader, but cannot replicate to majority (needs 3). Writes will hang/fail (cannot commit).
  • Partition 2 (C, D, E): C times out, wins election with term 2 (gets 3 votes: C, D, E). C accepts new writes.

When network heals:

  • A sends AppendEntries with term 1
  • C rejects ("Your term is stale")
  • A sees term 2 > term 1, steps down to follower
  • A's uncommitted entries are discarded
  • A syncs with C's log
mermaid
graph TB
    subgraph Partition 1
        A[Leader A<br/>Term 1]
        B[Follower B]
    end
    
    subgraph Partition 2
        C[New Leader C<br/>Term 2]
        D[Follower D]
        E[Follower E]
    end
    
    A -.X Network Cut X..- C
Click to expand code...

6. Configuration Changes

How do you add/remove nodes without downtime?

The Problem: Disjoint Majorities

If you instantly switch from 3 nodes to 5 nodes:

  • Old majority: 2/3
  • New majority: 3/5
  • Danger: Could elect 2 leaders simultaneously

Solution: Joint Consensus

Raft uses a two-phase approach:

  1. Phase 1: Replicate configuration C_old,new (requires majority in BOTH configs)
  2. Phase 2: Replicate configuration C_new (requires majority in new config only)
Old config: {A, B, C}
New config: {A, B, C, D, E}

Phase 1: C_old,new requires:
  - 2/3 from {A, B, C}
  - 3/5 from {A, B, C, D, E}
  
Phase 2: C_new requires:
  - 3/5 from {A, B, C, D, E}
Click to expand code...

Real-World Usage

etcd (Kubernetes)

  • Stores Kubernetes cluster state
  • 3 or 5 node cluster (odd numbers for majority)
  • Typical deployment: 5 etcd nodes across 3 availability zones
  • Heartbeat interval: 100ms
  • Election timeout: 1000ms

Consul (Service Discovery)

  • Service mesh configuration
  • Health checks and service registry
  • Uses Raft for consistency

CockroachDB (Database)

  • Uses Raft for replicating table ranges
  • Each table range has its own Raft group
  • Scales to thousands of Raft groups per node

Performance Tuning

Election Timeout Tuning

Network latency: 10ms
Election timeout should be: 10-20x network latency

Recommended:
- Same datacenter: 150-300ms
- Cross-region: 1000-3000ms
Click to expand code...

Too short: Frequent unnecessary elections Too long: Slow failure detection

Batch AppendEntries

Don't send 1 RPC per client request. Batch multiple log entries:

python
def replicate_logs(self):
    # Wait for 10ms or 100 entries, whichever comes first
    batch = self.collect_batch(max_wait_ms=10, max_entries=100)
    self.send_append_entries(batch)
Click to expand code...

Interview Tips 💡

When discussing Raft in system design interviews:

  1. Explain the problem: "We need multiple servers to agree on the same log of operations"
  2. Three key components: Leader election, log replication, safety
  3. Majority quorum: Explain why 2/3 or 3/5, not 3/3
  4. Real examples: "etcd uses Raft to maintain Kubernetes state"
  5. Handle failures: Explain what happens when leader crashes or network partitions
  6. Compare to Paxos: "Raft is easier to understand and implement"

Raft vs Paxos

FeatureRaftPaxos
UnderstandabilityHigh (designed for clarity)Low (notoriously complex)
Leader-basedYes (strong leader)No (leader optional)
Log structureCommitted entries never changeMore flexible
Production useetcd, Consul, CockroachDBChubby (Google)
Learning curveModerateSteep

Common Pitfalls

⚠️ Clock Skew: Raft doesn't rely on wall-clock time, only logical terms. But heartbeat timers still need reasonable accuracy.

⚠️ Disk I/O: currentTerm and votedFor must be persisted before sending votes. Otherwise, node could vote twice in same term.

⚠️ Snapshot Timing: Large logs must be snapshot. Don't wait until log is 100GB!

Related Concepts

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Related Articles