What is HyperLogLog?
HyperLogLog (HLL) is a probabilistic data structure that estimates the number of distinct elements (cardinality) in a dataset with remarkable space efficiency. It can count billions of unique items using just a few kilobytes of memory, with typical error rates under 2%.
Think of it as answering "How many unique visitors?" without actually remembering every visitor—just clever mathematical tricks.
The Problem: Counting Unique Items
You have 100 million log entries and need to answer: "How many unique IP addresses visited today?"
Naive Approach: HashSet
unique_ips = set()
for line in log_file:
ip = extract_ip(line)
unique_ips.add(ip)
print(len(unique_ips)) # Exact count
Cost:
- 100M IPs × 16 bytes (IPv6) = 1.6 GB RAM
- Requires storing every single unique value
- Not feasible for billions of items or streaming data
HyperLogLog Approach
hll = HyperLogLog()
for line in log_file:
ip = extract_ip(line)
hll.add(ip)
print(hll.count()) # Estimate: 100,234,567 ± 1%
Cost:
- 12 KB RAM (fixed size, regardless of input)
- ~0.81% standard error
- Works with infinite streams
Trade-off: You sacrifice perfect accuracy for massive space savings.
The Intuition: Coin Flips
The core insight comes from probability theory:
Simple Example
Imagine flipping a coin repeatedly until you get heads:
- If you see THHHH (1 tail), you probably flipped ~2 times total
- If you see TTTTH (4 tails), you probably flipped ~16 times ()
- If you see TTTTTTTH (7 tails), you probably flipped ~128 times ()
Insight: The rarest event you observe approximates the total number of trials.
Applying to Cardinality
- Hash each item to a random binary string
- Count leading zeros in each hash (like consecutive coin flips)
- Track the maximum number of leading zeros seen
- Estimate cardinality as
Example:
Hash("192.168.1.1") = 0b00101110... → 2 leading zeros
Hash("10.0.0.1") = 0b10010111... → 0 leading zeros
Hash("172.16.0.1") = 0b00000110... → 5 leading zeros ← MAX
Estimate: 2^5 = 32 unique IPs seen so far
How HyperLogLog Works
Step 1: Hash and Partition
Hash each element and split the hash into two parts:
hash(item) = [bucket_index | bit_pattern]
└─ first 14 bits─┘└─ remaining bits ─┘
With 14 bits for buckets → = 16,384 buckets (standard HLL)
Step 2: Count Leading Zeros
For each bucket, track the maximum number of leading zeros in bit_pattern:
def add(item):
hash_value = hash(item)
bucket = hash_value & 0x3FFF # First 14 bits
remaining = hash_value >> 14
leading_zeros = count_leading_zeros(remaining) + 1
buckets[bucket] = max(buckets[bucket], leading_zeros)
Step 3: Harmonic Mean Aggregation
Estimate cardinality using the harmonic mean of bucket values:
def count():
m = 16384 # Number of buckets
raw_estimate = alpha * m^2 / sum(2^(-bucket[i]) for i in range(m))
# Apply bias correction for small/large ranges
return apply_corrections(raw_estimate)
Why harmonic mean? It's robust to outliers (a few buckets with extreme values won't skew the estimate).
Step 4: Bias Correction
Apply empirically-derived corrections for:
- Small cardinalities (< 1000): Use linear counting
- Large cardinalities (near ): Adjust for hash collisions
Mathematical Foundation
Standard Error
HyperLogLog has a proven error bound:
Where = number of buckets.
| Buckets () | Memory | Standard Error |
|---|---|---|
| 16 | 16 bytes | ±26% |
| 64 | 64 bytes | ±13% |
| 256 | 256 bytes | ±6.5% |
| 1,024 | 1 KB | ±3.2% |
| 16,384 | 12 KB | ±0.81% |
| 65,536 | 48 KB | ±0.41% |
Key insight: Doubling memory halves the error, but you still use far less than a HashSet.
Why It Works
The algorithm exploits three properties:
- Hash uniformity: Good hash functions produce evenly distributed bits
- Independence: Each bucket operates independently (reduces variance)
- Extreme value theory: Maximum of random variables concentrates around
Implementation Example
Python (Simplified)
import mmh3 # MurmurHash3
import math
class HyperLogLog:
def __init__(self, precision=14):
self.precision = precision
self.m = 1 << precision # 2^precision buckets
self.buckets = [0] * self.m
self.alpha = self._get_alpha(self.m)
def add(self, item):
# Hash to 32 bits
hash_value = mmh3.hash(str(item), signed=False)
# Split into bucket index and bit pattern
bucket = hash_value & ((1 << self.precision) - 1)
remaining = hash_value >> self.precision
# Count leading zeros + 1
leading_zeros = self._leading_zeros(remaining, 32 - self.precision) + 1
# Track maximum for this bucket
self.buckets[bucket] = max(self.buckets[bucket], leading_zeros)
def count(self):
# Harmonic mean
raw = self.alpha * (self.m ** 2) / sum(2 ** (-x) for x in self.buckets)
# Small range correction
if raw <= 2.5 * self.m:
zeros = self.buckets.count(0)
if zeros != 0:
return self.m * math.log(self.m / zeros)
# No correction needed
if raw <= (1/30) * (1 << 32):
return raw
# Large range correction
return -(1 << 32) * math.log(1 - raw / (1 << 32))
def _leading_zeros(self, value, max_bits):
if value == 0:
return max_bits
return max_bits - value.bit_length()
def _get_alpha(self, m):
if m >= 128:
return 0.7213 / (1 + 1.079 / m)
elif m >= 64:
return 0.709
elif m >= 32:
return 0.697
else:
return 0.673
# Usage
hll = HyperLogLog(precision=14)
for ip in ["192.168.1.1", "10.0.0.1", "192.168.1.1", "172.16.0.1"]:
hll.add(ip)
print(f"Estimated unique IPs: {hll.count():.0f}") # Output: ~3
Redis Implementation
Redis has built-in HyperLogLog commands:
# Add elements PFADD daily_visitors "user123" "user456" "user123" "user789" # Get count PFCOUNT daily_visitors # Output: 3 (unique users) # Merge multiple HLLs (e.g., hourly counters into daily) PFMERGE daily_total hour_00 hour_01 hour_02 ... hour_23 PFCOUNT daily_total
Memory: Redis HLL uses 12 KB per key.
HyperLogLog vs. Alternatives
| Approach | Memory (1B items) | Accuracy | Use Case |
|---|---|---|---|
| HashSet | ~8-16 GB | Exact | Small datasets |
| Bitmap | ~125 MB | Exact | Dense integer IDs only |
| HyperLogLog | 12 KB | ±0.81% | Large-scale cardinality |
| Count-Min Sketch | ~100 KB | ±2-5% | Frequency counts (top-K) |
| Bloom Filter | ~1 MB | False positives | Membership testing |
When to use HLL:
- Large datasets (millions to billions of items)
- Approximate counts are acceptable
- Memory is constrained
- Streaming data (can't store all values)
When NOT to use HLL:
- Need exact counts
- Small datasets (< 10K items, just use a set)
- Need to retrieve actual unique items (HLL only counts)
Real-World Use Cases
Reddit: Daily Active Users
# Track DAU across sharded databases
for shard in shards:
shard.execute("SELECT hll_union_agg(user_id_hll) FROM daily_events")
# Estimate: 52,341,234 unique users (±420K)
# Memory: 12 KB per shard instead of GBs
Google Analytics: Unique Visitors
Google uses HLL++ (an optimized variant) to count unique pageviews:
- Combine HLLs from different time ranges (hourly → daily → monthly)
- Merge HLLs from different geographic regions
- Memory efficient even with billions of visitors
Twitter: Hashtag Popularity
-- PostgreSQL with HLL extension
CREATE TABLE hashtag_stats (
tag text,
unique_users hll
);
-- Add users to HLL
UPDATE hashtag_stats
SET unique_users = hll_add(unique_users, hll_hash_text('user_id_123'))
WHERE tag = '#ai';
-- Count unique users
SELECT tag, hll_cardinality(unique_users)::bigint
FROM hashtag_stats
ORDER BY hll_cardinality(unique_users) DESC;
Facebook: Video View Counts
Estimate unique video views across billions of users:
- Each video maintains an HLL
- Views from the same user don't double-count
- Aggregations (views per region, age group) merge HLLs
Advanced: Merging HyperLogLogs
HLLs can be merged to combine statistics:
def merge(hll1, hll2):
merged = HyperLogLog(precision=hll1.precision)
for i in range(hll1.m):
merged.buckets[i] = max(hll1.buckets[i], hll2.buckets[i])
return merged
# Example: Combine hourly stats
daily_hll = merge(hour_00_hll, hour_01_hll)
daily_hll = merge(daily_hll, hour_02_hll)
# ...
Applications:
- Roll up time-series data (hourly → daily → monthly)
- Distributed systems (merge HLLs from different servers)
- A/B testing (compare unique users across experiments)
Interview Tips 💡
When discussing HyperLogLog in system design interviews:
- Start with the problem: "We need to count billions of unique users, but can't store them all in memory..."
- Explain the trade-off: "HyperLogLog gives us ~1% error but uses 12 KB instead of gigabytes..."
- Mention use cases: "Redis uses HLL for unique visitor counts, Google Analytics for pageviews..."
- Discuss mergeability: "We can combine HLLs from different time periods or servers..."
- Know the limitations: "HLL only estimates count—if we need actual user IDs, we'd need a different approach..."
- Compare alternatives: "For exact counts on smaller datasets, we'd use a HashSet; for top-K queries, Count-Min Sketch..."
Related Concepts
- Bloom Filters — Probabilistic membership testing
- Count-Min Sketch — Frequency estimation
- Redis Internals — Redis PFADD/PFCOUNT commands
- Caching — HLL can track cache hit rates efficiently
- Distributed Systems — Merging HLLs across services
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
MapReduce
A programming model for processing massive datasets in parallel across distributed clusters. Understanding Map, Shuffle, Reduce with real-world use cases from Google, Hadoop, and Spark.
Top-K Heavy Hitters (Streaming)
How to find 'Trending Topics' in massive real-time streams. MapReduce, Count-Min Sketches, and Sliding Windows.
Bloom Filters
Space-efficient probabilistic data structure for membership testing that allows false positives but guarantees no false negatives, using minimal memory compared to hash sets.