The Problem
In distributed systems, servers crash. Networks fail. How do you know if a server is still alive or has failed?
Solution: The Heartbeat Protocol.
- The Beat: Every T seconds (e.g., 1s), Server A sends a "Heartbeat" signal to a monitor.
- The Timeout: If the monitor doesn't receive a signal for a threshold (e.g., 3 consecutive beats), it assumes Server A is dead.
Implementation Strategies
Push Model
Server actively sends heartbeats to monitor.
import time
import threading
class HeartbeatSender:
def __init__(self, server_id, interval=1.0):
self.server_id = server_id
self.interval = interval
self.running = False
def start(self):
self.running = True
threading.Thread(target=self._send_loop, daemon=True).start()
def _send_loop(self):
while self.running:
print(f"[{self.server_id}] Heartbeat sent")
time.sleep(self.interval)
###Pull Model
Monitor actively pings servers.
class HeartbeatMonitor {
async checkServer(serverUrl) {
try {
const response = await fetch(`${serverUrl}/health`, {
signal: AbortSignal.timeout(5000)
});
return response.ok;
} catch {
return false; // Server is down
}
}
}
Advanced: Adaptive Timeout
import statistics
class AdaptiveMonitor:
def __init__(self):
self.latencies = []
def calculate_timeout(self):
if len(self.latencies) < 5:
return 3.0
mean = statistics.mean(self.latencies)
stddev = statistics.stdev(self.latencies)
return mean + (3 * stddev) # 99.7% confidence
Real-World Examples
Kubernetes
- Kubelet sends heartbeat every 10s
- Master checks every 5s
- Grace period: 40s before marking node NotReady
- Pod eviction after 5 minutes
Redis Sentinel
- Pings master every 1s
- 5s timeout → Subjectively Down
- Quorum agreement → Objectively Down
- Triggers automatic failover
Cassandra
- Uses Phi Accrual Failure Detector
- Gossip protocol every 1s
- Phi threshold: 8 (suspicion level)
- Adaptive to network conditions
Common Patterns
Flapping Detection
Require N consecutive failures before marking dead:
class FlappingDetector:
def __init__(self, required_failures=3):
self.consecutive_failures = 0
def heartbeat_missed(self):
self.consecutive_failures += 1
return self.consecutive_failures >= self.required_failures
Key Metrics
- Heartbeat success rate
- Average latency
- Timeout frequency
- False positive rate
Interview Tips
- Timeout tradeoff: Short = fast detection, more false positives. Long = fewer false positives, slower detection.
- Adaptive timeouts: Use mean + 3*stddev for dynamic networks (like Cassandra)
- Network partitions: Require quorum to avoid split-brain scenarios
- Push vs Pull: Push scales better for many servers; Pull better for dashboards
- Real examples: Kubernetes uses 40s grace period, Cassandra uses Phi Accrual Detector
Related Concepts
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
Distributed Unique ID Generation
How to generate unique IDs in a distributed system without coordination. Twitter Snowflake, UUID v4 vs v7, Clock skew issues, and production implementations.
BitTorrent Protocol (P2P File Sharing)
Complete guide to peer-to-peer file sharing using BitTorrent protocol, covering torrent structure, piece exchange, tit-for-tat algorithm, DHT for decentralization, and real-world implementations powering massive file distribution networks.
System Design: Dropbox (Google Drive)
Designing a file synchronization service like Dropbox or Google Drive. Key concepts: Block-level Deduplication, Delta Sync, and Strong Consistency.