Beyond the Binary Timeout
Traditional failure detection uses a fixed timeout (e.g., "If no heartbeat for 5 seconds, mark dead").
- Problem: In an unstable network (cloud), latency spikes. A fixed timeout is either too aggressive (false positives) or too slow (slow detection).
The Phi () Accrual Failure Detector, introduced by Hayashibara et al. in 2004, solves this by making the timeout adaptive.
How it Works
Instead of a binary state (Up/Down), the detector outputs a continuous value, (Phi), representing the suspicion level.
Where is the probability that a heartbeat will arrive later than the current time, given historical mean and variance.
The Scale
| Phi Value | Probability Heartbeat will arrive | Interpretation |
|---|---|---|
| 10% (0.1) | Minor delay, likely just jitter. | |
| 1% (0.01) | Moderate suspicion. | |
| 0.1% (0.001) | High suspicion. (Typical threshold) | |
| 0.000001% | Extremely high confidence it's dead. |
The Algorithm
- Sampling: Every time a heartbeat arrives, store the arrival time in a sliding window (e.g., last 1000 samples).
- Statistics: Calculate the Mean () and Variance () of the arrival intervals.
- Distribution: Assume arrival times follow a Normal Distribution (Gaussian).
- Calculation: When checking for failure at time :
- Time since last heartbeat:
- Calculate probability that a value exists in the distribution.
Real World Example: Cassandra
Apache Cassandra uses Phi Accrual to detect down nodes.
- By default,
phi_convict_thresholdis set to 8. - This means Cassandra waits until the chance that "it's just slow" is less than before declaring a node dead.
- Benefit: If EC2 network gets laggy, the mean and variance increase. The algorithm automatically relaxes the timeout. When network stabilizes, it tightens up.
Visualizing the Windows
Imagine two scenarios:
Scenario A: LAN (Stable)
- Heartbeats arrive every 100ms 2ms.
- Statistical variance is low.
- If a heartbeat is missing for 200ms, shoots up to 10+ instantly. Fast detection.
Scenario B: WAN (Jittery)
- Heartbeats arrive every 100ms 50ms.
- Statistical variance is high.
- If a heartbeat is missing for 200ms, might only be 1.5. The system "knows" this network is noisy and waits longer. Robutness.
Implementation Note
While the original paper assumes Normal Distribution, some implementations (like Akka) simplified this or use exponential distribution approximations to avoid expensive integral calculations in the hot loop.
import math
class PhiDetector:
def __init__(self, threshold=8.0):
self.threshold = threshold
self.intervals = [] # sliding window
def heartbeat(self):
now = time.time()
if self.last_time:
self.intervals.append(now - self.last_time)
self.last_time = now
def is_available(self):
phi = self._calculate_phi()
return phi < self.threshold
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
Cache Eviction Policies
When the cache is full, something has to go. A comprehensive guide to LRU, LFU, ARC, and other replacement algorithms with implementation details.
Geohashing (Location Encoding)
A geocoding system that encodes latitude/longitude coordinates into short alphanumeric strings for efficient proximity searches and spatial indexing.
HyperLogLog (Cardinality Estimation)
A probabilistic algorithm for counting unique items in massive datasets using minimal memory, with less than 1% error using just kilobytes of space.