Back to All Concepts
ReliabilityFailure DetectionDistributed SystemsSystem DesignBeginner

Heartbeat Protocol

A mechanism for failure detection in distributed systems where nodes send periodic signals to indicate they are still alive. Implementation strategies, timeout tuning, and production patterns.

The Problem

In distributed systems, servers crash. Networks fail. How do you know if a server is still alive or has failed?

Solution: The Heartbeat Protocol.

  • The Beat: Every T seconds (e.g., 1s), Server A sends a "Heartbeat" signal to a monitor.
  • The Timeout: If the monitor doesn't receive a signal for a threshold (e.g., 3 consecutive beats), it assumes Server A is dead.

Implementation Strategies

Push Model

Server actively sends heartbeats to monitor.

python
import time
import threading

class HeartbeatSender:
    def __init__(self, server_id, interval=1.0):
        self.server_id = server_id
        self.interval = interval
        self.running = False
    
    def start(self):
        self.running = True
        threading.Thread(target=self._send_loop, daemon=True).start()
    
    def _send_loop(self):
        while self.running:
            print(f"[{self.server_id}] Heartbeat sent")
            time.sleep(self.interval)
Click to expand code...

###Pull Model

Monitor actively pings servers.

javascript
class HeartbeatMonitor {
    async checkServer(serverUrl) {
        try {
            const response = await fetch(`${serverUrl}/health`, {
                signal: AbortSignal.timeout(5000)
            });
            return response.ok;
        } catch {
            return false; // Server is down
        }
    }
}
Click to expand code...

Advanced: Adaptive Timeout

python
import statistics

class AdaptiveMonitor:
    def __init__(self):
        self.latencies = []
    
    def calculate_timeout(self):
        if len(self.latencies) < 5:
            return 3.0
        
        mean = statistics.mean(self.latencies)
        stddev = statistics.stdev(self.latencies)
        return mean + (3 * stddev)  # 99.7% confidence
Click to expand code...

Real-World Examples

Kubernetes

  • Kubelet sends heartbeat every 10s
  • Master checks every 5s
  • Grace period: 40s before marking node NotReady
  • Pod eviction after 5 minutes

Redis Sentinel

  • Pings master every 1s
  • 5s timeout → Subjectively Down
  • Quorum agreement → Objectively Down
  • Triggers automatic failover

Cassandra

  • Uses Phi Accrual Failure Detector
  • Gossip protocol every 1s
  • Phi threshold: 8 (suspicion level)
  • Adaptive to network conditions

Common Patterns

Flapping Detection

Require N consecutive failures before marking dead:

python
class FlappingDetector:
    def __init__(self, required_failures=3):
        self.consecutive_failures = 0
    
    def heartbeat_missed(self):
        self.consecutive_failures += 1
        return self.consecutive_failures >= self.required_failures
Click to expand code...

Key Metrics

  • Heartbeat success rate
  • Average latency
  • Timeout frequency
  • False positive rate

Interview Tips

  • Timeout tradeoff: Short = fast detection, more false positives. Long = fewer false positives, slower detection.
  • Adaptive timeouts: Use mean + 3*stddev for dynamic networks (like Cassandra)
  • Network partitions: Require quorum to avoid split-brain scenarios
  • Push vs Pull: Push scales better for many servers; Pull better for dashboards
  • Real examples: Kubernetes uses 40s grace period, Cassandra uses Phi Accrual Detector

Related Concepts

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Related Articles