What is Leader Election?
In a distributed system, leader election is the process of designating a single node as the coordinator responsible for managing tasks across multiple machines.
Key question: When you have 5 database replicas, which one is the master? If the master dies, how do you pick a new one?
Why Need a Leader?
Scenario: 3 database replicas Problem: All 3 accept writes → Data diverges! Solution: Elect 1 leader - Leader accepts all writes - Followers replicate from leader - Consistent view of data
Use cases:
- Database replication (Postgres master)
- Distributed locks (who runs the cron job?)
- Task coordination (MapReduce job tracker)
- Configuration management (etcd leader)
The Split-Brain Problem
Initial state: ┌──────┐ ┌──────┐ ┌──────┐ │Node A│ │Node B│ │Node C│ │LEADER│ │follower│ │follower│ └──────┘ └──────┘ └──────┘ Network partition: ┌──────┐ │ ┌──────┐ ┌──────┐ │Node A│ │ │Node B│ │Node C│ │LEADER│ │ │follower│ │follower│ └──────┘ │ └──────┘ └──────┘ Partition 1 │ Partition 2 Both sides elect leaders: ┌──────┐ │ ┌──────┐ ┌──────┐ │Node A│ │ │Node B│ │Node C│ │LEADER│ │ │LEADER│ │follower│ └──────┘ │ └──────┘ └──────┘ PROBLEM: Two leaders accepting writes!
Solution: Quorum - Need majority (N/2 + 1) votes to become leader.
3 nodes: Need 2 votes (quorum) 5 nodes: Need 3 votes (quorum) Partition 1: Node A (1 vote) → Can't become leader Partition 2: Nodes B, C (2 votes) → Can elect leader Only one partition can have quorum!
Consensus Algorithms
1. Bully Algorithm
Simple ID-based election: Highest ID wins.
Process:
- Node notices leader is dead
- Sends election message to all higher-ID nodes
- If no response from higher-ID nodes, declares itself leader
- If higher-ID node responds, that node starts election
sequenceDiagram
participant N1 as Node 1 (ID=1)
participant N2 as Node 2 (ID=2)
participant N3 as Node 3 (ID=3)
participant N4 as Node 4 (ID=4 DEAD)
Note over N4: Leader dies
N2->>N3: Election message (to higher IDs)
N2->>N4: Election message
N3-->>N2: I'm alive! (bully)
Note over N3: Node 3 starts election
N3->>N4: Election message
Note over N3: No response from N4
N3->>N1: I am the leader!
N3->>N2: I am the leader!
Implementation:
class BullyElection:
def __init__(self, node_id, all_nodes):
self.node_id = node_id
self.all_nodes = sorted(all_nodes)
self.leader = None
self.in_election = False
def start_election(self):
"""Start election process"""
print(f"Node {self.node_id}: Starting election")
self.in_election = True
# Find nodes with higher IDs
higher_nodes = [n for n in self.all_nodes if n > self.node_id]
if not higher_nodes:
# I have highest ID - become leader
self.become_leader()
return
# Send election messages to higher nodes
responses = []
for node in higher_nodes:
response = self.send_election_message(node)
if response:
responses.append(response)
if responses:
# Higher node responded - wait for them to win
print(f"Node {self.node_id}: Higher node responded, waiting...")
else:
# No response from higher nodes - I win!
self.become_leader()
def become_leader(self):
"""Declare self as leader"""
self.leader = self.node_id
self.in_election = False
print(f"Node {self.node_id}: I AM THE LEADER!")
# Announce to all lower nodes
for node in self.all_nodes:
if node < self.node_id:
self.send_coordinator_message(node)
def send_election_message(self, target_node):
"""Send election message to target node"""
# Simulate message - returns True if node alive
return is_node_alive(target_node)
def send_coordinator_message(self, target_node):
"""Announce leadership to node"""
# Notify node of new leader
pass
def receive_election_message(self, from_node):
"""Handle incoming election message"""
if from_node < self.node_id:
# I have higher ID - respond and start my own election
print(f"Node {self.node_id}: Received election from {from_node}, bullying...")
self.start_election()
return True
return False
# Example usage
nodes = [1, 2, 3, 4, 5]
node3 = BullyElection(node_id=3, all_nodes=nodes)
# Node 5 (leader) dies
# Node 3 notices and starts election
node3.start_election()
# Output: Node 3 becomes leader (highest alive ID)
Pros:
- Simple to implement
- Always elects highest-ID node
Cons:
- Many messages (O(n²) in worst case)
- Not fault-tolerant during election
- Can have multiple elections simultaneously
2. Raft Consensus
Modern, understandable consensus algorithm. Used by etcd (Kubernetes), Consul, and CockroachDB.
Key concepts:
- Leader: Handles all client requests
- Followers: Replicate leader's log
- Candidates: Nodes trying to become leader
States:
┌──────────┐ timeout ┌───────────┐ wins ┌────────┐
│ Follower │───────────▶│ Candidate │───────────▶│ Leader │
└──────────┘ └───────────┘ └────────┘
▲ │ │
│ │ loses │
│ ▼ │
└────────────────────────────────────────────────┘
Election process:
sequenceDiagram
participant F1 as Follower 1
participant F2 as Follower 2
participant C as Candidate
participant F3 as Follower 3
Note over C: Election timeout!
C->>C: Become candidate
C->>C: Increment term
C->>C: Vote for self
C->>F1: Request vote (term 2)
C->>F2: Request vote (term 2)
C->>F3: Request vote (term 2)
F1-->>C: Vote granted
F2-->>C: Vote granted
F3-->>C: Vote granted
Note over C: Received majority (4/5)
C->>C: Become leader
C->>F1: Heartbeat (I'm leader)
C->>F2: Heartbeat (I'm leader)
C->>F3: Heartbeat (I'm leader)
Implementation (simplified):
class RaftNode {
constructor(id, peers) {
this.id = id;
this.peers = peers;
this.state = 'follower';
this.currentTerm = 0;
this.votedFor = null;
this.leader = null;
// Election timeout (random 150-300ms)
this.electionTimeout = 150 + Math.random() * 150;
this.lastHeartbeat = Date.now();
this.startElectionTimer();
}
startElectionTimer() {
setInterval(() => {
if (this.state === 'leader') return;
const timeSinceHeartbeat = Date.now() - this.lastHeartbeat;
if (timeSinceHeartbeat > this.electionTimeout) {
this.startElection();
}
}, 50);
}
startElection() {
console.log(`Node ${this.id}: Starting election for term ${this.currentTerm + 1}`);
// Become candidate
this.state = 'candidate';
this.currentTerm++;
this.votedFor = this.id; // Vote for self
this.lastHeartbeat = Date.now();
let votesReceived = 1; // Self vote
const votesNeeded = Math.floor(this.peers.length / 2) + 1;
// Request votes from peers
for (const peer of this.peers) {
if (peer === this.id) continue;
this.requestVote(peer, (granted) => {
if (granted) {
votesReceived++;
if (votesReceived >= votesNeeded && this.state === 'candidate') {
this.becomeLeader();
}
}
});
}
}
requestVote(peerId, callback) {
// Send RequestVote RPC
const request = {
term: this.currentTerm,
candidateId: this.id
};
// Simulate network call
sendRPC(peerId, 'requestVote', request, (response) => {
if (response.term > this.currentTerm) {
// Discovered higher term - revert to follower
this.currentTerm = response.term;
this.state = 'follower';
this.votedFor = null;
}
callback(response.voteGranted);
});
}
handleRequestVote(request) {
// Received vote request from candidate
if (request.term < this.currentTerm) {
return { voteGranted: false, term: this.currentTerm };
}
if (request.term > this.currentTerm) {
this.currentTerm = request.term;
this.state = 'follower';
this.votedFor = null;
}
// Grant vote if haven't voted yet
if (this.votedFor === null || this.votedFor === request.candidateId) {
this.votedFor = request.candidateId;
this.lastHeartbeat = Date.now();
return { voteGranted: true, term: this.currentTerm };
}
return { voteGranted: false, term: this.currentTerm };
}
becomeLeader() {
console.log(`Node ${this.id}: I AM THE LEADER for term ${this.currentTerm}!`);
this.state = 'leader';
this.leader = this.id;
// Send heartbeats to maintain leadership
this.startHeartbeats();
}
startHeartbeats() {
this.heartbeatInterval = setInterval(() => {
if (this.state !== 'leader') {
clearInterval(this.heartbeatInterval);
return;
}
// Send AppendEntries (heartbeat) to all peers
for (const peer of this.peers) {
if (peer === this.id) continue;
this.sendHeartbeat(peer);
}
}, 50); // Send every 50ms
}
sendHeartbeat(peerId) {
const request = {
term: this.currentTerm,
leaderId: this.id
};
sendRPC(peerId, 'appendEntries', request, (response) => {
if (response.term > this.currentTerm) {
// Higher term discovered - step down
this.currentTerm = response.term;
this.state = 'follower';
this.leader = null;
clearInterval(this.heartbeatInterval);
}
});
}
handleAppendEntries(request) {
// Received heartbeat from leader
if (request.term >= this.currentTerm) {
this.currentTerm = request.term;
this.state = 'follower';
this.leader = request.leaderId;
this.lastHeartbeat = Date.now();
return { success: true, term: this.currentTerm };
}
return { success: false, term: this.currentTerm };
}
}
// Example
const node1 = new RaftNode(1, [1, 2, 3, 4, 5]);
const node2 = new RaftNode(2, [1, 2, 3, 4, 5]);
const node3 = new RaftNode(3, [1, 2, 3, 4, 5]);
Raft guarantees:
- Safety: At most one leader per term
- Liveness: Eventually elects a leader
- Log replication: Leader's log is source of truth
Pros:
- Understandable (simpler than Paxos)
- Production-ready implementations
- Strong consistency guarantees
Cons:
- Complex to implement correctly
- Network partition sensitive
- Leader bottleneck for writes
3. Paxos
Classic consensus algorithm (Leslie Lamport, 1989). Very difficult to understand and implement.
Roles:
- Proposers: Propose values
- Acceptors: Vote on proposals
- Learners: Learn chosen value
Phases:
Phase 1: Prepare - Proposer sends prepare(n) to acceptors - Acceptors promise not to accept proposals < n Phase 2: Accept - Proposer sends accept(n, value) - Acceptors accept if >= highest promise
Why it's hard: Subtle edge cases, liveness not guaranteed, difficult to implement.
Modern variants:
- Multi-Paxos: Optimized for multiple decisions
- Fast Paxos: Faster under low contention
- Raft: Easier-to-understand alternative
Quorum Systems
Definition: Subset of nodes that can make progress.
3 nodes: Quorum = 2 5 nodes: Quorum = 3 7 nodes: Quorum = 4 Formula: Quorum = ⌊N/2⌋ + 1
Why quorum works:
5 nodes split into 2 partitions: Partition A: 3 nodes (has quorum) ✓ Partition B: 2 nodes (no quorum) ✗ Only Partition A can elect leader!
Trade-offs:
More nodes: + Higher availability (tolerate more failures) - Slower consensus (need more votes) - More expensive Fewer nodes: + Faster consensus + Cheaper - Lower availability
Real-World Systems
Kubernetes (etcd)
etcd uses Raft for leader election 3 or 5 etcd nodes (always odd number) Leader handles all writes Followers replicate If leader dies: - Followers timeout (150-300ms) - New election starts - New leader elected in ~1 second
Apache ZooKeeper
ZooKeeper uses ZAB (ZooKeeper Atomic Broadcast) Similar to Paxos/Raft Used by: - Kafka (broker coordination) - HBase (master election) - Hadoop (NameNode HA)
Consul
Uses Raft for consensus Deploy 3 or 5 server nodes Leader handles writes Followers forward to leader Used for service discovery + KV store
Implementation Patterns
Using etcd for Leader Election
// Go client for etcd leader election
package main
import (
"context"
"fmt"
"go.etcd.io/etcd/client/v3/concurrency"
clientv3 "go.etcd.io/etcd/client/v3"
)
func main() {
client, _ := clientv3.New(clientv3.Config{
Endpoints: []string{"localhost:2379"},
})
defer client.Close()
session, _ := concurrency.NewSession(client)
defer session.Close()
election := concurrency.NewElection(session, "/my-service/leader")
// Campaign to become leader (blocks until elected)
ctx := context.Background()
if err := election.Campaign(ctx, "node-1"); err != nil {
fmt.Println("Failed to campaign:", err)
return
}
fmt.Println("I AM THE LEADER!")
// Do leader work...
doLeaderWork()
// Resign leadership
election.Resign(ctx)
}
func doLeaderWork() {
// Only one node executes this at a time
fmt.Println("Doing critical work as leader...")
}
Using Redis for Leader Election
import redis
import time
class RedisLeaderElection:
def __init__(self, redis_client, key, node_id, ttl=10):
self.redis = redis_client
self.key = key
self.node_id = node_id
self.ttl = ttl
self.is_leader = False
def try_become_leader(self):
"""Try to acquire leadership"""
# SET key value NX EX ttl
# NX: only set if key doesn't exist
# EX: expire after ttl seconds
result = self.redis.set(
self.key,
self.node_id,
nx=True,
ex=self.ttl
)
self.is_leader = result is not None
return self.is_leader
def maintain_leadership(self):
"""Renew leadership lock"""
if not self.is_leader:
return False
# Renew lock before expiry
result = self.redis.set(
self.key,
self.node_id,
xx=True, # Only set if key exists
ex=self.ttl
)
self.is_leader = result is not None
return self.is_leader
def resign(self):
"""Voluntarily give up leadership"""
if self.is_leader:
self.redis.delete(self.key)
self.is_leader = False
# Usage
redis_client = redis.Redis(host='localhost', port=6379)
election = RedisLeaderElection(redis_client, 'my-service-leader', 'node-1')
# Try to become leader
if election.try_become_leader():
print("I AM THE LEADER!")
while True:
# Do leader work
do_leader_work()
# Renew leadership every 5 seconds
time.sleep(5)
if not election.maintain_leadership():
print("Lost leadership!")
break
else:
print("Another node is the leader")
Interview Tips 💡
When discussing leader election in system design interviews:
- Explain the need: "Multiple database replicas need one master for consistent writes..."
- Quorum math: "With 5 nodes, need 3 votes to prevent split-brain..."
- Choose algorithm: "Use Raft (via etcd) for simplicity and proven implementation..."
- Handle failures: "If leader dies, followers timeout after 150-300ms and start election..."
- Discuss trade-offs: "3 nodes cheaper but 5 nodes more fault-tolerant..."
- Real examples: "Kubernetes uses etcd with Raft, Kafka uses ZooKeeper..."
Related Concepts
- Distributed Consensus — Agreement across nodes
- CAP Theorem — Consistency vs availability trade-offs
- Database Replication — Master-slave patterns
- Quorum — Majority voting systems
- Split-Brain Problem — Partition handling
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
Raft Consensus Algorithm
A comprehensive guide to Raft, the consensus algorithm powering Etcd, Consul, and Kubernetes. Leader election, log replication, safety guarantees, and production deployment patterns.
Merkle Trees (Data Synchronization)
Hash tree data structure enabling efficient verification of data integrity and synchronization across distributed systems, used in Git, Bitcoin, Cassandra, and IPFS for tamper detection and incremental sync.
CAP Theorem
Consistency, Availability, Partition Tolerance. Why you can only pick two in distributed systems, and how real databases like MongoDB, Cassandra, and DynamoDB make the trade-off.