Back to All Concepts
AlgorithmsFailure DetectionMathematicsExpert

Phi Accrual Failure Detector

An advanced algorithm that uses historical heartbeat data to calculate the probability of a node failure, adapting to network conditions dynamically.

Beyond the Binary Timeout

Traditional failure detection uses a fixed timeout (e.g., "If no heartbeat for 5 seconds, mark dead").

  • Problem: In an unstable network (cloud), latency spikes. A fixed timeout is either too aggressive (false positives) or too slow (slow detection).

The Phi (φ\varphi) Accrual Failure Detector, introduced by Hayashibara et al. in 2004, solves this by making the timeout adaptive.

How it Works

Instead of a binary state (Up/Down), the detector outputs a continuous value, φ\varphi (Phi), representing the suspicion level.

φ=log10(Plater)\varphi = -\log_{10}(P_{later})

Where PlaterP_{later} is the probability that a heartbeat will arrive later than the current time, given historical mean and variance.

The Scale

Phi ValueProbability Heartbeat will arriveInterpretation
φ=1\varphi = 110% (0.1)Minor delay, likely just jitter.
φ=2\varphi = 21% (0.01)Moderate suspicion.
φ=3\varphi = 30.1% (0.001)High suspicion. (Typical threshold)
φ=8\varphi = 80.000001%Extremely high confidence it's dead.

The Algorithm

  1. Sampling: Every time a heartbeat arrives, store the arrival time in a sliding window (e.g., last 1000 samples).
  2. Statistics: Calculate the Mean (μ\mu) and Variance (σ2\sigma^2) of the arrival intervals.
  3. Distribution: Assume arrival times follow a Normal Distribution (Gaussian).
  4. Calculation: When checking for failure at time tnowt_{now}:
    • Time since last heartbeat: tdiff=tnowtlastt_{diff} = t_{now} - t_{last}
    • Calculate probability PP that a value x>tdiffx > t_{diff} exists in the distribution.
    • φ=log10(P)\varphi = -\log_{10}(P)

Real World Example: Cassandra

Apache Cassandra uses Phi Accrual to detect down nodes.

  • By default, phi_convict_threshold is set to 8.
  • This means Cassandra waits until the chance that "it's just slow" is less than 10810^{-8} before declaring a node dead.
  • Benefit: If EC2 network gets laggy, the mean and variance increase. The algorithm automatically relaxes the timeout. When network stabilizes, it tightens up.

Visualizing the Windows

Imagine two scenarios:

Scenario A: LAN (Stable)

  • Heartbeats arrive every 100ms ±\pm 2ms.
  • Statistical variance is low.
  • If a heartbeat is missing for 200ms, φ\varphi shoots up to 10+ instantly. Fast detection.

Scenario B: WAN (Jittery)

  • Heartbeats arrive every 100ms ±\pm 50ms.
  • Statistical variance is high.
  • If a heartbeat is missing for 200ms, φ\varphi might only be 1.5. The system "knows" this network is noisy and waits longer. Robutness.

Implementation Note

While the original paper assumes Normal Distribution, some implementations (like Akka) simplified this or use exponential distribution approximations to avoid expensive integral calculations in the hot loop.

python
import math

class PhiDetector:
    def __init__(self, threshold=8.0):
        self.threshold = threshold
        self.intervals = [] # sliding window
    
    def heartbeat(self):
        now = time.time()
        if self.last_time:
            self.intervals.append(now - self.last_time)
        self.last_time = now
        
    def is_available(self):
        phi = self._calculate_phi()
        return phi < self.threshold
Click to expand code...

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Related Articles