Back to All Concepts
AlgorithmsBig DataProbabilisticExpert

HyperLogLog (Cardinality Estimation)

A probabilistic algorithm for counting unique items in massive datasets using minimal memory, with less than 1% error using just kilobytes of space.

What is HyperLogLog?

HyperLogLog (HLL) is a probabilistic data structure that estimates the number of distinct elements (cardinality) in a dataset with remarkable space efficiency. It can count billions of unique items using just a few kilobytes of memory, with typical error rates under 2%.

Think of it as answering "How many unique visitors?" without actually remembering every visitor—just clever mathematical tricks.

The Problem: Counting Unique Items

You have 100 million log entries and need to answer: "How many unique IP addresses visited today?"

Naive Approach: HashSet

python
unique_ips = set()
for line in log_file:
    ip = extract_ip(line)
    unique_ips.add(ip)
    
print(len(unique_ips))  # Exact count
Click to expand code...

Cost:

  • 100M IPs × 16 bytes (IPv6) = 1.6 GB RAM
  • Requires storing every single unique value
  • Not feasible for billions of items or streaming data

HyperLogLog Approach

python
hll = HyperLogLog()
for line in log_file:
    ip = extract_ip(line)
    hll.add(ip)
    
print(hll.count())  # Estimate: 100,234,567 ± 1%
Click to expand code...

Cost:

  • 12 KB RAM (fixed size, regardless of input)
  • ~0.81% standard error
  • Works with infinite streams

Trade-off: You sacrifice perfect accuracy for massive space savings.

The Intuition: Coin Flips

The core insight comes from probability theory:

Simple Example

Imagine flipping a coin repeatedly until you get heads:

  • If you see THHHH (1 tail), you probably flipped ~2 times total
  • If you see TTTTH (4 tails), you probably flipped ~16 times (242^4)
  • If you see TTTTTTTH (7 tails), you probably flipped ~128 times (272^7)

Insight: The rarest event you observe approximates the total number of trials.

Applying to Cardinality

  1. Hash each item to a random binary string
  2. Count leading zeros in each hash (like consecutive coin flips)
  3. Track the maximum number of leading zeros seen
  4. Estimate cardinality as 2max_zeros2^{max\_zeros}

Example:

Hash("192.168.1.1")  = 0b00101110... → 2 leading zeros
Hash("10.0.0.1")     = 0b10010111... → 0 leading zeros
Hash("172.16.0.1")   = 0b00000110... → 5 leading zeros ← MAX

Estimate: 2^5 = 32 unique IPs seen so far
Click to expand code...

How HyperLogLog Works

Step 1: Hash and Partition

Hash each element and split the hash into two parts:

hash(item) = [bucket_index | bit_pattern]
             └─ first 14 bits─┘└─ remaining bits ─┘
Click to expand code...

With 14 bits for buckets → 2142^{14} = 16,384 buckets (standard HLL)

Step 2: Count Leading Zeros

For each bucket, track the maximum number of leading zeros in bit_pattern:

python
def add(item):
    hash_value = hash(item)
    bucket = hash_value & 0x3FFF  # First 14 bits
    remaining = hash_value >> 14
    
    leading_zeros = count_leading_zeros(remaining) + 1
    buckets[bucket] = max(buckets[bucket], leading_zeros)
Click to expand code...

Step 3: Harmonic Mean Aggregation

Estimate cardinality using the harmonic mean of bucket values:

python
def count():
    m = 16384  # Number of buckets
    raw_estimate = alpha * m^2 / sum(2^(-bucket[i]) for i in range(m))
    
    # Apply bias correction for small/large ranges
    return apply_corrections(raw_estimate)
Click to expand code...

Why harmonic mean? It's robust to outliers (a few buckets with extreme values won't skew the estimate).

Step 4: Bias Correction

Apply empirically-derived corrections for:

  • Small cardinalities (< 1000): Use linear counting
  • Large cardinalities (near 2322^{32}): Adjust for hash collisions

Mathematical Foundation

Standard Error

HyperLogLog has a proven error bound:

Standard Error=1.04m\text{Standard Error} = \frac{1.04}{\sqrt{m}}

Where mm = number of buckets.

Buckets (mm)MemoryStandard Error
1616 bytes±26%
6464 bytes±13%
256256 bytes±6.5%
1,0241 KB±3.2%
16,38412 KB±0.81%
65,53648 KB±0.41%

Key insight: Doubling memory halves the error, but you still use far less than a HashSet.

Why It Works

The algorithm exploits three properties:

  1. Hash uniformity: Good hash functions produce evenly distributed bits
  2. Independence: Each bucket operates independently (reduces variance)
  3. Extreme value theory: Maximum of random variables concentrates around log2(n)\log_2(n)

Implementation Example

Python (Simplified)

python
import mmh3  # MurmurHash3
import math

class HyperLogLog:
    def __init__(self, precision=14):
        self.precision = precision
        self.m = 1 << precision  # 2^precision buckets
        self.buckets = [0] * self.m
        self.alpha = self._get_alpha(self.m)
    
    def add(self, item):
        # Hash to 32 bits
        hash_value = mmh3.hash(str(item), signed=False)
        
        # Split into bucket index and bit pattern
        bucket = hash_value & ((1 << self.precision) - 1)
        remaining = hash_value >> self.precision
        
        # Count leading zeros + 1
        leading_zeros = self._leading_zeros(remaining, 32 - self.precision) + 1
        
        # Track maximum for this bucket
        self.buckets[bucket] = max(self.buckets[bucket], leading_zeros)
    
    def count(self):
        # Harmonic mean
        raw = self.alpha * (self.m ** 2) / sum(2 ** (-x) for x in self.buckets)
        
        # Small range correction
        if raw <= 2.5 * self.m:
            zeros = self.buckets.count(0)
            if zeros != 0:
                return self.m * math.log(self.m / zeros)
        
        # No correction needed
        if raw <= (1/30) * (1 << 32):
            return raw
        
        # Large range correction  
        return -(1 << 32) * math.log(1 - raw / (1 << 32))
    
    def _leading_zeros(self, value, max_bits):
        if value == 0:
            return max_bits
        return max_bits - value.bit_length()
    
    def _get_alpha(self, m):
        if m >= 128:
            return 0.7213 / (1 + 1.079 / m)
        elif m >= 64:
            return 0.709
        elif m >= 32:
            return 0.697
        else:
            return 0.673

# Usage
hll = HyperLogLog(precision=14)
for ip in ["192.168.1.1", "10.0.0.1", "192.168.1.1", "172.16.0.1"]:
    hll.add(ip)

print(f"Estimated unique IPs: {hll.count():.0f}")  # Output: ~3
Click to expand code...

Redis Implementation

Redis has built-in HyperLogLog commands:

bash
# Add elements
PFADD daily_visitors "user123" "user456" "user123" "user789"

# Get count
PFCOUNT daily_visitors
# Output: 3 (unique users)

# Merge multiple HLLs (e.g., hourly counters into daily)
PFMERGE daily_total hour_00 hour_01 hour_02 ... hour_23
PFCOUNT daily_total
Click to expand code...

Memory: Redis HLL uses 12 KB per key.

HyperLogLog vs. Alternatives

ApproachMemory (1B items)AccuracyUse Case
HashSet~8-16 GBExactSmall datasets
Bitmap~125 MBExactDense integer IDs only
HyperLogLog12 KB±0.81%Large-scale cardinality
Count-Min Sketch~100 KB±2-5%Frequency counts (top-K)
Bloom Filter~1 MBFalse positivesMembership testing

When to use HLL:

  • Large datasets (millions to billions of items)
  • Approximate counts are acceptable
  • Memory is constrained
  • Streaming data (can't store all values)

When NOT to use HLL:

  • Need exact counts
  • Small datasets (< 10K items, just use a set)
  • Need to retrieve actual unique items (HLL only counts)

Real-World Use Cases

Reddit: Daily Active Users

python
# Track DAU across sharded databases
for shard in shards:
    shard.execute("SELECT hll_union_agg(user_id_hll) FROM daily_events")

# Estimate: 52,341,234 unique users (±420K)
# Memory: 12 KB per shard instead of GBs
Click to expand code...

Google Analytics: Unique Visitors

Google uses HLL++ (an optimized variant) to count unique pageviews:

  • Combine HLLs from different time ranges (hourly → daily → monthly)
  • Merge HLLs from different geographic regions
  • Memory efficient even with billions of visitors

Twitter: Hashtag Popularity

sql
-- PostgreSQL with HLL extension
CREATE TABLE hashtag_stats (
    tag text,
    unique_users hll
);

-- Add users to HLL
UPDATE hashtag_stats 
SET unique_users = hll_add(unique_users, hll_hash_text('user_id_123'))
WHERE tag = '#ai';

-- Count unique users
SELECT tag, hll_cardinality(unique_users)::bigint 
FROM hashtag_stats 
ORDER BY hll_cardinality(unique_users) DESC;
Click to expand code...

Facebook: Video View Counts

Estimate unique video views across billions of users:

  • Each video maintains an HLL
  • Views from the same user don't double-count
  • Aggregations (views per region, age group) merge HLLs

Advanced: Merging HyperLogLogs

HLLs can be merged to combine statistics:

python
def merge(hll1, hll2):
    merged = HyperLogLog(precision=hll1.precision)
    for i in range(hll1.m):
        merged.buckets[i] = max(hll1.buckets[i], hll2.buckets[i])
    return merged

# Example: Combine hourly stats
daily_hll = merge(hour_00_hll, hour_01_hll)
daily_hll = merge(daily_hll, hour_02_hll)
# ...
Click to expand code...

Applications:

  • Roll up time-series data (hourly → daily → monthly)
  • Distributed systems (merge HLLs from different servers)
  • A/B testing (compare unique users across experiments)

Interview Tips 💡

When discussing HyperLogLog in system design interviews:

  1. Start with the problem: "We need to count billions of unique users, but can't store them all in memory..."
  2. Explain the trade-off: "HyperLogLog gives us ~1% error but uses 12 KB instead of gigabytes..."
  3. Mention use cases: "Redis uses HLL for unique visitor counts, Google Analytics for pageviews..."
  4. Discuss mergeability: "We can combine HLLs from different time periods or servers..."
  5. Know the limitations: "HLL only estimates count—if we need actual user IDs, we'd need a different approach..."
  6. Compare alternatives: "For exact counts on smaller datasets, we'd use a HashSet; for top-K queries, Count-Min Sketch..."

Related Concepts

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Related Articles