Back to All Concepts
Distributed SystemsConsensusAlgorithmsAdvanced

Leader Election

Comprehensive guide to distributed leader election algorithms including Raft, Paxos, and Bully algorithm, covering consensus, split-brain prevention, and real-world implementations in Kubernetes, ZooKeeper, and etcd.

What is Leader Election?

In a distributed system, leader election is the process of designating a single node as the coordinator responsible for managing tasks across multiple machines.

Key question: When you have 5 database replicas, which one is the master? If the master dies, how do you pick a new one?

Why Need a Leader?

Scenario: 3 database replicas
Problem: All 3 accept writes → Data diverges!

Solution: Elect 1 leader
- Leader accepts all writes
- Followers replicate from leader
- Consistent view of data
Click to expand code...

Use cases:

  • Database replication (Postgres master)
  • Distributed locks (who runs the cron job?)
  • Task coordination (MapReduce job tracker)
  • Configuration management (etcd leader)

The Split-Brain Problem

Initial state:
┌──────┐  ┌──────┐  ┌──────┐
│Node A│  │Node B│  │Node C│
│LEADER│  │follower│ │follower│
└──────┘  └──────┘  └──────┘

Network partition:
┌──────┐     │     ┌──────┐  ┌──────┐
│Node A│     │     │Node B│  │Node C│
│LEADER│     │     │follower│ │follower│
└──────┘     │     └──────┘  └──────┘
 Partition 1 │  Partition 2

Both sides elect leaders:
┌──────┐     │     ┌──────┐  ┌──────┐
│Node A│     │     │Node B│  │Node C│
│LEADER│     │     │LEADER│ │follower│
└──────┘     │     └──────┘  └──────┘

PROBLEM: Two leaders accepting writes!
Click to expand code...

Solution: Quorum - Need majority (N/2 + 1) votes to become leader.

3 nodes: Need 2 votes (quorum)
5 nodes: Need 3 votes (quorum)

Partition 1: Node A (1 vote) → Can't become leader
Partition 2: Nodes B, C (2 votes) → Can elect leader

Only one partition can have quorum!
Click to expand code...

Consensus Algorithms

1. Bully Algorithm

Simple ID-based election: Highest ID wins.

Process:

  1. Node notices leader is dead
  2. Sends election message to all higher-ID nodes
  3. If no response from higher-ID nodes, declares itself leader
  4. If higher-ID node responds, that node starts election
mermaid
sequenceDiagram
    participant N1 as Node 1 (ID=1)
    participant N2 as Node 2 (ID=2)
    participant N3 as Node 3 (ID=3)
    participant N4 as Node 4 (ID=4 DEAD)
    
    Note over N4: Leader dies
    N2->>N3: Election message (to higher IDs)
    N2->>N4: Election message
    N3-->>N2: I'm alive! (bully)
    Note over N3: Node 3 starts election
    N3->>N4: Election message
    Note over N3: No response from N4
    N3->>N1: I am the leader!
    N3->>N2: I am the leader!
Click to expand code...

Implementation:

python
class BullyElection:
    def __init__(self, node_id, all_nodes):
        self.node_id = node_id
        self.all_nodes = sorted(all_nodes)
        self.leader = None
        self.in_election = False
    
    def start_election(self):
        """Start election process"""
        print(f"Node {self.node_id}: Starting election")
        self.in_election = True
        
        # Find nodes with higher IDs
        higher_nodes = [n for n in self.all_nodes if n > self.node_id]
        
        if not higher_nodes:
            # I have highest ID - become leader
            self.become_leader()
            return
        
        # Send election messages to higher nodes
        responses = []
        for node in higher_nodes:
            response = self.send_election_message(node)
            if response:
                responses.append(response)
        
        if responses:
            # Higher node responded - wait for them to win
            print(f"Node {self.node_id}: Higher node responded, waiting...")
        else:
            # No response from higher nodes - I win!
            self.become_leader()
    
    def become_leader(self):
        """Declare self as leader"""
        self.leader = self.node_id
        self.in_election = False
        print(f"Node {self.node_id}: I AM THE LEADER!")
        
        # Announce to all lower nodes
        for node in self.all_nodes:
            if node < self.node_id:
                self.send_coordinator_message(node)
    
    def send_election_message(self, target_node):
        """Send election message to target node"""
        # Simulate message - returns True if node alive
        return is_node_alive(target_node)
    
    def send_coordinator_message(self, target_node):
        """Announce leadership to node"""
        # Notify node of new leader
        pass
    
    def receive_election_message(self, from_node):
        """Handle incoming election message"""
        if from_node < self.node_id:
            # I have higher ID - respond and start my own election
            print(f"Node {self.node_id}: Received election from {from_node}, bullying...")
            self.start_election()
            return True
        return False

# Example usage
nodes = [1, 2, 3, 4, 5]
node3 = BullyElection(node_id=3, all_nodes=nodes)

# Node 5 (leader) dies
# Node 3 notices and starts election
node3.start_election()
# Output: Node 3 becomes leader (highest alive ID)
Click to expand code...

Pros:

  • Simple to implement
  • Always elects highest-ID node

Cons:

  • Many messages (O(n²) in worst case)
  • Not fault-tolerant during election
  • Can have multiple elections simultaneously

2. Raft Consensus

Modern, understandable consensus algorithm. Used by etcd (Kubernetes), Consul, and CockroachDB.

Key concepts:

  • Leader: Handles all client requests
  • Followers: Replicate leader's log
  • Candidates: Nodes trying to become leader

States:

┌──────────┐  timeout   ┌───────────┐  wins      ┌────────┐
│ Follower │───────────▶│ Candidate │───────────▶│ Leader │
└──────────┘            └───────────┘            └────────┘
     ▲                        │                       │
     │                        │ loses                 │
     │                        ▼                       │
     └────────────────────────────────────────────────┘
Click to expand code...

Election process:

mermaid
sequenceDiagram
    participant F1 as Follower 1
    participant F2 as Follower 2
    participant C as Candidate
    participant F3 as Follower 3
    
    Note over C: Election timeout!
    C->>C: Become candidate
    C->>C: Increment term
    C->>C: Vote for self
    C->>F1: Request vote (term 2)
    C->>F2: Request vote (term 2)
    C->>F3: Request vote (term 2)
    
    F1-->>C: Vote granted
    F2-->>C: Vote granted
    F3-->>C: Vote granted
    
    Note over C: Received majority (4/5)
    C->>C: Become leader
    C->>F1: Heartbeat (I'm leader)
    C->>F2: Heartbeat (I'm leader)
    C->>F3: Heartbeat (I'm leader)
Click to expand code...

Implementation (simplified):

javascript
class RaftNode {
  constructor(id, peers) {
    this.id = id;
    this.peers = peers;
    this.state = 'follower';
    this.currentTerm = 0;
    this.votedFor = null;
    this.leader = null;
    
    // Election timeout (random 150-300ms)
    this.electionTimeout = 150 + Math.random() * 150;
    this.lastHeartbeat = Date.now();
    
    this.startElectionTimer();
  }
  
  startElectionTimer() {
    setInterval(() => {
      if (this.state === 'leader') return;
      
      const timeSinceHeartbeat = Date.now() - this.lastHeartbeat;
      
      if (timeSinceHeartbeat > this.electionTimeout) {
        this.startElection();
      }
    }, 50);
  }
  
  startElection() {
    console.log(`Node ${this.id}: Starting election for term ${this.currentTerm + 1}`);
    
    // Become candidate
    this.state = 'candidate';
    this.currentTerm++;
    this.votedFor = this.id;  // Vote for self
    this.lastHeartbeat = Date.now();
    
    let votesReceived = 1;  // Self vote
    const votesNeeded = Math.floor(this.peers.length / 2) + 1;
    
    // Request votes from peers
    for (const peer of this.peers) {
      if (peer === this.id) continue;
      
      this.requestVote(peer, (granted) => {
        if (granted) {
          votesReceived++;
          
          if (votesReceived >= votesNeeded && this.state === 'candidate') {
            this.becomeLeader();
          }
        }
      });
    }
  }
  
  requestVote(peerId, callback) {
    // Send RequestVote RPC
    const request = {
      term: this.currentTerm,
      candidateId: this.id
    };
    
    // Simulate network call
    sendRPC(peerId, 'requestVote', request, (response) => {
      if (response.term > this.currentTerm) {
        // Discovered higher term - revert to follower
        this.currentTerm = response.term;
        this.state = 'follower';
        this.votedFor = null;
      }
      callback(response.voteGranted);
    });
  }
  
  handleRequestVote(request) {
    // Received vote request from candidate
    if (request.term < this.currentTerm) {
      return { voteGranted: false, term: this.currentTerm };
    }
    
    if (request.term > this.currentTerm) {
      this.currentTerm = request.term;
      this.state = 'follower';
      this.votedFor = null;
    }
    
    // Grant vote if haven't voted yet
    if (this.votedFor === null || this.votedFor === request.candidateId) {
      this.votedFor = request.candidateId;
      this.lastHeartbeat = Date.now();
      return { voteGranted: true, term: this.currentTerm };
    }
    
    return { voteGranted: false, term: this.currentTerm };
  }
  
  becomeLeader() {
    console.log(`Node ${this.id}: I AM THE LEADER for term ${this.currentTerm}!`);
    this.state = 'leader';
    this.leader = this.id;
    
    // Send heartbeats to maintain leadership
    this.startHeartbeats();
  }
  
  startHeartbeats() {
    this.heartbeatInterval = setInterval(() => {
      if (this.state !== 'leader') {
        clearInterval(this.heartbeatInterval);
        return;
      }
      
      // Send AppendEntries (heartbeat) to all peers
      for (const peer of this.peers) {
        if (peer === this.id) continue;
        this.sendHeartbeat(peer);
      }
    }, 50);  // Send every 50ms
  }
  
  sendHeartbeat(peerId) {
    const request = {
      term: this.currentTerm,
      leaderId: this.id
    };
    
    sendRPC(peerId, 'appendEntries', request, (response) => {
      if (response.term > this.currentTerm) {
        // Higher term discovered - step down
        this.currentTerm = response.term;
        this.state = 'follower';
        this.leader = null;
        clearInterval(this.heartbeatInterval);
      }
    });
  }
  
  handleAppendEntries(request) {
    // Received heartbeat from leader
    if (request.term >= this.currentTerm) {
      this.currentTerm = request.term;
      this.state = 'follower';
      this.leader = request.leaderId;
      this.lastHeartbeat = Date.now();
      return { success: true, term: this.currentTerm };
    }
    
    return { success: false, term: this.currentTerm };
  }
}

// Example
const node1 = new RaftNode(1, [1, 2, 3, 4, 5]);
const node2 = new RaftNode(2, [1, 2, 3, 4, 5]);
const node3 = new RaftNode(3, [1, 2, 3, 4, 5]);
Click to expand code...

Raft guarantees:

  • Safety: At most one leader per term
  • Liveness: Eventually elects a leader
  • Log replication: Leader's log is source of truth

Pros:

  • Understandable (simpler than Paxos)
  • Production-ready implementations
  • Strong consistency guarantees

Cons:

  • Complex to implement correctly
  • Network partition sensitive
  • Leader bottleneck for writes

3. Paxos

Classic consensus algorithm (Leslie Lamport, 1989). Very difficult to understand and implement.

Roles:

  • Proposers: Propose values
  • Acceptors: Vote on proposals
  • Learners: Learn chosen value

Phases:

Phase 1: Prepare
- Proposer sends prepare(n) to acceptors
- Acceptors promise not to accept proposals < n

Phase 2: Accept
- Proposer sends accept(n, value)
- Acceptors accept if >= highest promise
Click to expand code...

Why it's hard: Subtle edge cases, liveness not guaranteed, difficult to implement.

Modern variants:

  • Multi-Paxos: Optimized for multiple decisions
  • Fast Paxos: Faster under low contention
  • Raft: Easier-to-understand alternative

Quorum Systems

Definition: Subset of nodes that can make progress.

3 nodes: Quorum = 2
5 nodes: Quorum = 3
7 nodes: Quorum = 4

Formula: Quorum = ⌊N/2⌋ + 1
Click to expand code...

Why quorum works:

5 nodes split into 2 partitions:
Partition A: 3 nodes  (has quorum) ✓
Partition B: 2 nodes  (no quorum)  ✗

Only Partition A can elect leader!
Click to expand code...

Trade-offs:

More nodes:
+ Higher availability (tolerate more failures)
- Slower consensus (need more votes)
- More expensive

Fewer nodes:
+ Faster consensus
+ Cheaper
- Lower availability
Click to expand code...

Real-World Systems

Kubernetes (etcd)

etcd uses Raft for leader election

3 or 5 etcd nodes (always odd number)
Leader handles all writes
Followers replicate

If leader dies:
- Followers timeout (150-300ms)
- New election starts
- New leader elected in ~1 second
Click to expand code...

Apache ZooKeeper

ZooKeeper uses ZAB (ZooKeeper Atomic Broadcast)
Similar to Paxos/Raft

Used by:
- Kafka (broker coordination)
- HBase (master election)
- Hadoop (NameNode HA)
Click to expand code...

Consul

Uses Raft for consensus

Deploy 3 or 5 server nodes
Leader handles writes
Followers forward to leader
Used for service discovery + KV store
Click to expand code...

Implementation Patterns

Using etcd for Leader Election

go
// Go client for etcd leader election
package main

import (
    "context"
    "fmt"
    "go.etcd.io/etcd/client/v3/concurrency"
    clientv3 "go.etcd.io/etcd/client/v3"
)

func main() {
    client, _ := clientv3.New(clientv3.Config{
        Endpoints: []string{"localhost:2379"},
    })
    defer client.Close()
    
    session, _ := concurrency.NewSession(client)
    defer session.Close()
    
    election := concurrency.NewElection(session, "/my-service/leader")
    
    // Campaign to become leader (blocks until elected)
    ctx := context.Background()
    if err := election.Campaign(ctx, "node-1"); err != nil {
        fmt.Println("Failed to campaign:", err)
        return
    }
    
    fmt.Println("I AM THE LEADER!")
    
    // Do leader work...
    doLeaderWork()
    
    // Resign leadership
    election.Resign(ctx)
}

func doLeaderWork() {
    // Only one node executes this at a time
    fmt.Println("Doing critical work as leader...")
}
Click to expand code...

Using Redis for Leader Election

python
import redis
import time

class RedisLeaderElection:
    def __init__(self, redis_client, key, node_id, ttl=10):
        self.redis = redis_client
        self.key = key
        self.node_id = node_id
        self.ttl = ttl
        self.is_leader = False
    
    def try_become_leader(self):
        """Try to acquire leadership"""
        # SET key value NX EX ttl
        # NX: only set if key doesn't exist
        # EX: expire after ttl seconds
        result = self.redis.set(
            self.key,
            self.node_id,
            nx=True,
            ex=self.ttl
        )
        
        self.is_leader = result is not None
        return self.is_leader
    
    def maintain_leadership(self):
        """Renew leadership lock"""
        if not self.is_leader:
            return False
        
        # Renew lock before expiry
        result = self.redis.set(
            self.key,
            self.node_id,
            xx=True,  # Only set if key exists
            ex=self.ttl
        )
        
        self.is_leader = result is not None
        return self.is_leader
    
    def resign(self):
        """Voluntarily give up leadership"""
        if self.is_leader:
            self.redis.delete(self.key)
            self.is_leader = False

# Usage
redis_client = redis.Redis(host='localhost', port=6379)
election = RedisLeaderElection(redis_client, 'my-service-leader', 'node-1')

# Try to become leader
if election.try_become_leader():
    print("I AM THE LEADER!")
    
    while True:
        # Do leader work
        do_leader_work()
        
        # Renew leadership every 5 seconds
        time.sleep(5)
        if not election.maintain_leadership():
            print("Lost leadership!")
            break
else:
    print("Another node is the leader")
Click to expand code...

Interview Tips 💡

When discussing leader election in system design interviews:

  1. Explain the need: "Multiple database replicas need one master for consistent writes..."
  2. Quorum math: "With 5 nodes, need 3 votes to prevent split-brain..."
  3. Choose algorithm: "Use Raft (via etcd) for simplicity and proven implementation..."
  4. Handle failures: "If leader dies, followers timeout after 150-300ms and start election..."
  5. Discuss trade-offs: "3 nodes cheaper but 5 nodes more fault-tolerant..."
  6. Real examples: "Kubernetes uses etcd with Raft, Kafka uses ZooKeeper..."

Related Concepts

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Related Articles