What is a Bloom Filter?
A Bloom Filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It can give false positives but never false negatives.
Trade-off: 100MB memory instead of 100GB, in exchange for occasional "maybe" answers.
The Problem: Expensive Membership Tests
Task: Check if username "alice123" exists in 1 billion user database Naive approaches: 1. Database query: SELECT * FROM users WHERE username = 'alice123' → 100ms latency, expensive 2. HashSet in memory: Set<String> usernames → 1B usernames × 100 bytes = 100GB RAM Bloom Filter approach: → 1.2GB RAM, O(1) lookups, <1μs latency → Trade-off: 1% false positive rate
How It Works
A Bloom Filter is a bit array of size m with k hash functions.
graph TB
Item[Item: alice123] -->|hash1| H1[Hash Function 1]
Item -->|hash2| H2[Hash Function 2]
Item -->|hash3| H3[Hash Function 3]
H1 -->|index 5| BitArray[Bit Array: 0010010100...]
H2 -->|index 12| BitArray
H3 -->|index 23| BitArray
BitArray -->|set bits to 1| Result[Bits 5, 12, 23 = 1]
Adding an Element
- Pass element through
khash functions - Get
karray indices - Set bits at those indices to
1
Checking an Element
- Pass element through same
khash functions - Check if all bits at those indices are
1 - Result:
- All bits = 1: Element probably in set (might be false positive)
- Any bit = 0: Element definitely not in set (guaranteed accurate)
Example
Bit array size: 10
Hash functions: 2
Add "alice":
hash1("alice") = 2 → set bit 2
hash2("alice") = 7 → set bit 7
Bits: 0010000100
Add "bob":
hash1("bob") = 3 → set bit 3
hash2("bob") = 7 → set bit 7 (already set)
Bits: 0011000100
Check "alice":
hash1("alice") = 2 → bit 2 is 1 ✓
hash2("alice") = 7 → bit 7 is 1 ✓
Result: PROBABLY in set ✅
Check "charlie":
hash1("charlie") = 2 → bit 2 is 1 ✓
hash2("charlie") = 5 → bit 5 is 0 ✗
Result: DEFINITELY NOT in set ❌
Check "eve":
hash1("eve") = 2 → bit 2 is 1 ✓
hash2("eve") = 7 → bit 7 is 1 ✓
Result: PROBABLY in set ✅ (FALSE POSITIVE!)
Implementation
Python Implementation
import hashlib
import math
class BloomFilter:
def __init__(self, expected_elements, false_positive_rate):
# Calculate optimal size and hash count
self.size = self._optimal_size(expected_elements, false_positive_rate)
self.hash_count = self._optimal_hash_count(self.size, expected_elements)
self.bit_array = [0] * self.size
self.count = 0
def _optimal_size(self, n, p):
"""
m = -(n * ln(p)) / (ln(2)^2)
n = expected elements
p = false positive rate
"""
return int(-(n * math.log(p)) / (math.log(2) ** 2))
def _optimal_hash_count(self, m, n):
"""
k = (m / n) * ln(2)
"""
return int((m / n) * math.log(2))
def _hashes(self, item):
"""Generate k hash values"""
# Use two hash functions to simulate k hashes (double hashing)
h1 = int(hashlib.md5(item.encode()).hexdigest(), 16)
h2 = int(hashlib.sha256(item.encode()).hexdigest(), 16)
for i in range(self.hash_count):
yield (h1 + i * h2) % self.size
def add(self, item):
"""Add item to bloom filter"""
for index in self._hashes(item):
self.bit_array[index] = 1
self.count += 1
def contains(self, item):
"""Check if item might be in set"""
return all(self.bit_array[index] == 1 for index in self._hashes(item))
def __contains__(self, item):
"""Support 'in' operator"""
return self.contains(item)
# Usage
bloom = BloomFilter(expected_elements=1000000, false_positive_rate=0.01)
# Add elements
bloom.add("alice123")
bloom.add("bob456")
bloom.add("charlie789")
# Check membership
print("alice123" in bloom) # True (correct)
print("bob456" in bloom) # True (correct)
print("unknown" in bloom) # False (correct)
print("xyz" in bloom) # Maybe True (false positive)
JavaScript Implementation
class BloomFilter {
constructor(expectedElements, falsePositiveRate) {
this.size = this._optimalSize(expectedElements, falsePositiveRate);
this.hashCount = this._optimalHashCount(this.size, expectedElements);
this.bitArray = new Array(this.size).fill(0);
}
_optimalSize(n, p) {
return Math.ceil(-(n * Math.log(p)) / (Math.log(2) ** 2));
}
_optimalHashCount(m, n) {
return Math.ceil((m / n) * Math.log(2));
}
_hash(str, seed = 0) {
let h = seed;
for (let i = 0; i < str.length; i++) {
h = Math.imul(h ^ str.charCodeAt(i), 2654435761);
}
return (h ^ (h >>> 16)) >>> 0;
}
*_hashes(item) {
const h1 = this._hash(item, 0);
const h2 = this._hash(item, 1);
for (let i = 0; i < this.hashCount; i++) {
yield (h1 + i * h2) % this.size;
}
}
add(item) {
for (const index of this._hashes(item)) {
this.bitArray[index] = 1;
}
}
contains(item) {
for (const index of this._hashes(item)) {
if (this.bitArray[index] === 0) return false;
}
return true;
}
}
// Usage
const bloom = new BloomFilter(1000000, 0.01);
bloom.add("user123");
console.log(bloom.contains("user123")); // true
console.log(bloom.contains("unknown")); // false
Mathematical Foundation
False Positive Probability
After inserting n elements:
P(false positive) ≈ (1 - e^(-kn/m))^k Where: - k = number of hash functions - n = number of inserted elements - m = bit array size
Optimal Parameters
Optimal bit array size:
m = -(n * ln(p)) / (ln(2)^2) m ≈ 1.44 * n * log₂(1/p) Example: n = 1,000,000 p = 0.01 (1% false positive) m ≈ 9,585,059 bits ≈ 1.2 MB
Optimal hash function count:
k = (m/n) * ln(2) k ≈ 0.693 * (m/n) For above example: k ≈ 7 hash functions
Space Savings
HashSet vs Bloom Filter (1M usernames, 1% FPR): HashSet: - 1M × 100 bytes/username = 100 MB Bloom Filter: - 1.2 MB (83x smaller!)
Real-World Use Cases
1. Database Query Optimization
Cassandra & HBase:
# Before querying disk, check bloom filter
def get_user(user_id):
sstable_bloom = load_bloom_filter("users_sstable_5")
if user_id not in sstable_bloom:
# Definitely not in this SSTable, skip disk read
return None
# Might be in SSTable, do expensive disk read
return read_from_disk(user_id)
# Saves 99% of unnecessary disk reads!
2. Web Crawler (Avoid Duplicate URLs)
class WebCrawler:
def __init__(self):
self.visited = BloomFilter(100_000_000, 0.001) # 100M URLs
def crawl(self, url):
if url in self.visited:
return # Skip (probably already crawled)
self.visited.add(url)
page = fetch(url)
# Process page...
vs HashSet: 100M URLs × 100 bytes = 10GB vs 120MB (83x savings)
3. Malicious URL Detection (Chrome Safe Browsing)
// Client-side bloom filter (downloaded with browser)
const maliciousURLBloom = new BloomFilter(1000000, 0.001);
function checkURL(url) {
if (!maliciousURLBloom.contains(url)) {
return "SAFE"; // Definitely safe
}
// Might be malicious, check with server
return await checkWithServer(url);
}
// 99.9% of URLs filtered locally (no network call)
// 0.1% false positives require server check
4. CDN Cache Decision
# Only cache items seen multiple times (avoid "one-hit wonders")
class SmartCDN:
def __init__(self):
self.seen_once = BloomFilter(10_000_000, 0.01)
self.cache = {}
def get(self, url):
if url in self.cache:
return self.cache[url]
content = fetch_from_origin(url)
if url in self.seen_once:
# Seen twice, cache it
self.cache[url] = content
else:
# First time seeing, mark as seen
self.seen_once.add(url)
return content
5. Bitcoin SPV Wallets
# Lightweight Bitcoin client
# Only download blocks containing my transactions
class SPVWallet:
def __init__(self):
self.my_addresses_bloom = BloomFilter(1000, 0.001)
for addr in my_addresses:
self.my_addresses_bloom.add(addr)
def filter_block(self, block):
for tx in block.transactions:
if any(addr in self.my_addresses_bloom for addr in tx.outputs):
return True # Download this block
return False # Skip this block
Variants & Extensions
1. Counting Bloom Filter
Allows deletions by using counters instead of bits.
class CountingBloomFilter:
def __init__(self, size, hash_count):
self.counters = [0] * size
self.hash_count = hash_count
def add(self, item):
for index in self._hashes(item):
self.counters[index] += 1
def remove(self, item):
for index in self._hashes(item):
if self.counters[index] > 0:
self.counters[index] -= 1
def contains(self, item):
return all(self.counters[index] > 0 for index in self._hashes(item))
Trade-off: 4-8x more memory (counters vs bits)
2. Scalable Bloom Filter
Dynamically grows as more elements added.
class ScalableBloomFilter:
def __init__(self, initial_capacity, fp_rate):
self.filters = [BloomFilter(initial_capacity, fp_rate * 0.9)]
self.capacity = initial_capacity
self.fp_rate = fp_rate
def add(self, item):
if self.filters[-1].count >= self.capacity:
# Create new filter with doubled capacity
self.capacity *= 2
self.filters.append(BloomFilter(self.capacity, self.fp_rate * 0.9))
self.filters[-1].add(item)
def contains(self, item):
return any(f.contains(item) for f in self.filters)
Bloom Filter vs Alternatives
| Data Structure | Space | False Positives | False Negatives | Deletions |
|---|---|---|---|---|
| Bloom Filter | Very Low | Yes | No | ❌ No |
| Counting Bloom | Low | Yes | No | ✅ Yes |
| HashSet | High | No | No | ✅ Yes |
| Cuckoo Filter | Low | Yes | No | ✅ Yes |
| Quotient Filter | Low | Yes | No | ✅ Yes |
Choose Bloom Filter when:
- Space is critical
- No deletions needed
- False positives acceptable (with low rate)
- High throughput required
Interview Tips 💡
When discussing Bloom Filters in system design interviews:
- Explain the problem: "We have 1B users but can't fit a HashSet in memory..."
- Trade-off: "Bloom filter uses 1.2MB instead of 100MB, with 1% false positive rate..."
- Why false positives OK: "For cache check, false positive means extra DB query, not wrong result..."
- Real examples: "Cassandra uses bloom filters to avoid disk reads for non-existent keys..."
- Parameters: "With 1M items and 1% FPR, we need ~1.2MB and 7 hash functions..."
- Alternatives: "If we need deletions, we'd use Counting Bloom Filter or Cuckoo Filter..."
Related Concepts
- HyperLogLog — Probabilistic cardinality estimation
- Consistent Hashing — Another hash-based optimization
- Caching Strategies — Bloom filters optimize cache decisions
- Database Indexing — SSTable optimization with bloom filters
- Count-Min Sketch — Probabilistic frequency counting
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
Caching Overview
High-speed data storage to reduce latency. The single most effective way to scale read-heavy systems.
HyperLogLog (Cardinality Estimation)
A probabilistic algorithm for counting unique items in massive datasets using minimal memory, with less than 1% error using just kilobytes of space.
Trie (Prefix Tree & Autocomplete)
A tree data structure optimized for storing and searching strings by prefix, enabling efficient autocomplete, spell checking, and dictionary operations in O(L) time.