What is HyperLogLog?

HyperLogLog (HLL) is a probabilistic data structure that estimates the number of distinct elements (cardinality) in a dataset with remarkable space efficiency. It can count billions of unique items using just a few kilobytes of memory, with typical error rates under 2%.

Think of it as answering "How many unique visitors?" without actually remembering every visitor—just clever mathematical tricks.

The Problem: Counting Unique Items

You have 100 million log entries and need to answer: "How many unique IP addresses visited today?"

Naive Approach: HashSet

python

unique_ips = set()
for line in log_file:
    ip = extract_ip(line)
    unique_ips.add(ip)
    
print(len(unique_ips))  # Exact count

Click to expand code...

Cost:

100M IPs × 16 bytes (IPv6) = 1.6 GB RAM
Requires storing every single unique value
Not feasible for billions of items or streaming data

HyperLogLog Approach

python

hll = HyperLogLog()
for line in log_file:
    ip = extract_ip(line)
    hll.add(ip)
    
print(hll.count())  # Estimate: 100,234,567 ± 1%

Click to expand code...

Cost:

12 KB RAM (fixed size, regardless of input)
~0.81% standard error
Works with infinite streams

Trade-off: You sacrifice perfect accuracy for massive space savings.

The Intuition: Coin Flips

The core insight comes from probability theory:

Simple Example

Imagine flipping a coin repeatedly until you get heads:

If you see THHHH (1 tail), you probably flipped ~2 times total
If you see TTTTH (4 tails), you probably flipped ~16 times ( $2^4$ )
If you see TTTTTTTH (7 tails), you probably flipped ~128 times ( $2^7$ )

Insight: The rarest event you observe approximates the total number of trials.

Applying to Cardinality

Hash each item to a random binary string
Count leading zeros in each hash (like consecutive coin flips)
Track the maximum number of leading zeros seen
Estimate cardinality as $2^{max\_zeros}$

Example:

Hash("192.168.1.1")  = 0b00101110... → 2 leading zeros
Hash("10.0.0.1")     = 0b10010111... → 0 leading zeros
Hash("172.16.0.1")   = 0b00000110... → 5 leading zeros ← MAX

Estimate: 2^5 = 32 unique IPs seen so far

Click to expand code...

How HyperLogLog Works

Step 1: Hash and Partition

Hash each element and split the hash into two parts:

hash(item) = [bucket_index | bit_pattern]
             └─ first 14 bits─┘└─ remaining bits ─┘

Click to expand code...

With 14 bits for buckets → $2^{14}$ = 16,384 buckets (standard HLL)

Step 2: Count Leading Zeros

For each bucket, track the maximum number of leading zeros in bit_pattern:

python

def add(item):
    hash_value = hash(item)
    bucket = hash_value & 0x3FFF  # First 14 bits
    remaining = hash_value >> 14
    
    leading_zeros = count_leading_zeros(remaining) + 1
    buckets[bucket] = max(buckets[bucket], leading_zeros)

Click to expand code...

Step 3: Harmonic Mean Aggregation

Estimate cardinality using the harmonic mean of bucket values:

python

def count():
    m = 16384  # Number of buckets
    raw_estimate = alpha * m^2 / sum(2^(-bucket[i]) for i in range(m))
    
    # Apply bias correction for small/large ranges
    return apply_corrections(raw_estimate)

Click to expand code...

Why harmonic mean? It's robust to outliers (a few buckets with extreme values won't skew the estimate).

Step 4: Bias Correction

Apply empirically-derived corrections for:

Small cardinalities (< 1000): Use linear counting
Large cardinalities (near $2^{32}$ ): Adjust for hash collisions

Mathematical Foundation

Standard Error

HyperLogLog has a proven error bound:

\text{Standard Error} = \frac{1.04}{\sqrt{m}}

Where $m$ = number of buckets.

Buckets ( $m$ )	Memory	Standard Error
16	16 bytes	±26%
64	64 bytes	±13%
256	256 bytes	±6.5%
1,024	1 KB	±3.2%
16,384	12 KB	±0.81%
65,536	48 KB	±0.41%

Key insight: Doubling memory halves the error, but you still use far less than a HashSet.

Why It Works

The algorithm exploits three properties:

Hash uniformity: Good hash functions produce evenly distributed bits
Independence: Each bucket operates independently (reduces variance)
Extreme value theory: Maximum of random variables concentrates around $\log_2(n)$

Implementation Example

Python (Simplified)

python

import mmh3  # MurmurHash3
import math

class HyperLogLog:
    def __init__(self, precision=14):
        self.precision = precision
        self.m = 1 << precision  # 2^precision buckets
        self.buckets = [0] * self.m
        self.alpha = self._get_alpha(self.m)
    
    def add(self, item):
        # Hash to 32 bits
        hash_value = mmh3.hash(str(item), signed=False)
        
        # Split into bucket index and bit pattern
        bucket = hash_value & ((1 << self.precision) - 1)
        remaining = hash_value >> self.precision
        
        # Count leading zeros + 1
        leading_zeros = self._leading_zeros(remaining, 32 - self.precision) + 1
        
        # Track maximum for this bucket
        self.buckets[bucket] = max(self.buckets[bucket], leading_zeros)
    
    def count(self):
        # Harmonic mean
        raw = self.alpha * (self.m ** 2) / sum(2 ** (-x) for x in self.buckets)
        
        # Small range correction
        if raw <= 2.5 * self.m:
            zeros = self.buckets.count(0)
            if zeros != 0:
                return self.m * math.log(self.m / zeros)
        
        # No correction needed
        if raw <= (1/30) * (1 << 32):
            return raw
        
        # Large range correction  
        return -(1 << 32) * math.log(1 - raw / (1 << 32))
    
    def _leading_zeros(self, value, max_bits):
        if value == 0:
            return max_bits
        return max_bits - value.bit_length()
    
    def _get_alpha(self, m):
        if m >= 128:
            return 0.7213 / (1 + 1.079 / m)
        elif m >= 64:
            return 0.709
        elif m >= 32:
            return 0.697
        else:
            return 0.673

# Usage
hll = HyperLogLog(precision=14)
for ip in ["192.168.1.1", "10.0.0.1", "192.168.1.1", "172.16.0.1"]:
    hll.add(ip)

print(f"Estimated unique IPs: {hll.count():.0f}")  # Output: ~3

Click to expand code...

Redis Implementation

Redis has built-in HyperLogLog commands:

bash

# Add elements
PFADD daily_visitors "user123" "user456" "user123" "user789"

# Get count
PFCOUNT daily_visitors
# Output: 3 (unique users)

# Merge multiple HLLs (e.g., hourly counters into daily)
PFMERGE daily_total hour_00 hour_01 hour_02 ... hour_23
PFCOUNT daily_total

Click to expand code...

Memory: Redis HLL uses 12 KB per key.

HyperLogLog vs. Alternatives

Approach	Memory (1B items)	Accuracy	Use Case
HashSet	~8-16 GB	Exact	Small datasets
Bitmap	~125 MB	Exact	Dense integer IDs only
HyperLogLog	12 KB	±0.81%	Large-scale cardinality
Count-Min Sketch	~100 KB	±2-5%	Frequency counts (top-K)
Bloom Filter	~1 MB	False positives	Membership testing

When to use HLL:

Large datasets (millions to billions of items)
Approximate counts are acceptable
Memory is constrained
Streaming data (can't store all values)

When NOT to use HLL:

Need exact counts
Small datasets (< 10K items, just use a set)
Need to retrieve actual unique items (HLL only counts)

Real-World Use Cases

Reddit: Daily Active Users

python

# Track DAU across sharded databases
for shard in shards:
    shard.execute("SELECT hll_union_agg(user_id_hll) FROM daily_events")

# Estimate: 52,341,234 unique users (±420K)
# Memory: 12 KB per shard instead of GBs

Click to expand code...

Google Analytics: Unique Visitors

Google uses HLL++ (an optimized variant) to count unique pageviews:

Combine HLLs from different time ranges (hourly → daily → monthly)
Merge HLLs from different geographic regions
Memory efficient even with billions of visitors

Twitter: Hashtag Popularity

sql

-- PostgreSQL with HLL extension
CREATE TABLE hashtag_stats (
    tag text,
    unique_users hll
);

-- Add users to HLL
UPDATE hashtag_stats 
SET unique_users = hll_add(unique_users, hll_hash_text('user_id_123'))
WHERE tag = '#ai';

-- Count unique users
SELECT tag, hll_cardinality(unique_users)::bigint 
FROM hashtag_stats 
ORDER BY hll_cardinality(unique_users) DESC;

Click to expand code...

Facebook: Video View Counts

Estimate unique video views across billions of users:

Each video maintains an HLL
Views from the same user don't double-count
Aggregations (views per region, age group) merge HLLs

Advanced: Merging HyperLogLogs

HLLs can be merged to combine statistics:

python

def merge(hll1, hll2):
    merged = HyperLogLog(precision=hll1.precision)
    for i in range(hll1.m):
        merged.buckets[i] = max(hll1.buckets[i], hll2.buckets[i])
    return merged

# Example: Combine hourly stats
daily_hll = merge(hour_00_hll, hour_01_hll)
daily_hll = merge(daily_hll, hour_02_hll)
# ...

Click to expand code...

Applications:

Roll up time-series data (hourly → daily → monthly)
Distributed systems (merge HLLs from different servers)
A/B testing (compare unique users across experiments)

Interview Tips 💡

When discussing HyperLogLog in system design interviews:

Start with the problem: "We need to count billions of unique users, but can't store them all in memory..."
Explain the trade-off: "HyperLogLog gives us ~1% error but uses 12 KB instead of gigabytes..."
Mention use cases: "Redis uses HLL for unique visitor counts, Google Analytics for pageviews..."
Discuss mergeability: "We can combine HLLs from different time periods or servers..."
Know the limitations: "HLL only estimates count—if we need actual user IDs, we'd need a different approach..."
Compare alternatives: "For exact counts on smaller datasets, we'd use a HashSet; for top-K queries, Count-Min Sketch..."

Related Concepts

Bloom Filters — Probabilistic membership testing
Count-Min Sketch — Frequency estimation
Redis Internals — Redis PFADD/PFCOUNT commands
Caching — HLL can track cache hit rates efficiently
Distributed Systems — Merging HLLs across services

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Advanced

MapReduce

A programming model for processing massive datasets in parallel across distributed clusters. Understanding Map, Shuffle, Reduce with real-world use cases from Google, Hadoop, and Spark.

Big DataAlgorithmsDistributed Computing

Pro

Top-K Heavy Hitters (Streaming)

How to find 'Trending Topics' in massive real-time streams. MapReduce, Count-Min Sketches, and Sliding Windows.

AlgorithmsBig DataStreaming

Advanced

Bloom Filters

Space-efficient probabilistic data structure for membership testing that allows false positives but guarantees no false negatives, using minimal memory compared to hash sets.

Data StructuresProbabilisticOptimization

HyperLogLog (Cardinality Estimation)

What is HyperLogLog?

The Problem: Counting Unique Items

Naive Approach: HashSet

HyperLogLog Approach

The Intuition: Coin Flips

Simple Example

Applying to Cardinality

How HyperLogLog Works

Step 1: Hash and Partition

Step 2: Count Leading Zeros

Step 3: Harmonic Mean Aggregation

Step 4: Bias Correction

Mathematical Foundation

Standard Error

Why It Works

Implementation Example

Python (Simplified)

Redis Implementation

HyperLogLog vs. Alternatives

Real-World Use Cases

Reddit: Daily Active Users

Google Analytics: Unique Visitors

Twitter: Hashtag Popularity

Facebook: Video View Counts

Advanced: Merging HyperLogLogs

Interview Tips 💡

Related Concepts

About ScaleWiki

Related Articles

MapReduce

Top-K Heavy Hitters (Streaming)

Bloom Filters