Back to All Concepts
Data StructuresProbabilisticOptimizationAdvanced

Bloom Filters

Space-efficient probabilistic data structure for membership testing that allows false positives but guarantees no false negatives, using minimal memory compared to hash sets.

What is a Bloom Filter?

A Bloom Filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It can give false positives but never false negatives.

Trade-off: 100MB memory instead of 100GB, in exchange for occasional "maybe" answers.

The Problem: Expensive Membership Tests

Task: Check if username "alice123" exists in 1 billion user database

Naive approaches:
1. Database query: SELECT * FROM users WHERE username = 'alice123'
   → 100ms latency, expensive

2. HashSet in memory: Set<String> usernames
   → 1B usernames × 100 bytes = 100GB RAM

Bloom Filter approach:
→ 1.2GB RAM, O(1) lookups, <1μs latency
→ Trade-off: 1% false positive rate
Click to expand code...

How It Works

A Bloom Filter is a bit array of size m with k hash functions.

mermaid
graph TB
    Item[Item: alice123] -->|hash1| H1[Hash Function 1]
    Item -->|hash2| H2[Hash Function 2]
    Item -->|hash3| H3[Hash Function 3]
    
    H1 -->|index 5| BitArray[Bit Array: 0010010100...]
    H2 -->|index 12| BitArray
    H3 -->|index 23| BitArray
    
    BitArray -->|set bits to 1| Result[Bits 5, 12, 23 = 1]
Click to expand code...

Adding an Element

  1. Pass element through k hash functions
  2. Get k array indices
  3. Set bits at those indices to 1

Checking an Element

  1. Pass element through same k hash functions
  2. Check if all bits at those indices are 1
  3. Result:
    • All bits = 1: Element probably in set (might be false positive)
    • Any bit = 0: Element definitely not in set (guaranteed accurate)

Example

Bit array size: 10
Hash functions: 2

Add "alice":
  hash1("alice") = 2 → set bit 2
  hash2("alice") = 7 → set bit 7
  Bits: 0010000100

Add "bob":
  hash1("bob") = 3 → set bit 3
  hash2("bob") = 7 → set bit 7 (already set)
  Bits: 0011000100

Check "alice":
  hash1("alice") = 2 → bit 2 is 1 ✓
  hash2("alice") = 7 → bit 7 is 1 ✓
  Result: PROBABLY in set ✅

Check "charlie":
  hash1("charlie") = 2 → bit 2 is 1 ✓
  hash2("charlie") = 5 → bit 5 is 0 ✗
  Result: DEFINITELY NOT in set ❌

Check "eve":
  hash1("eve") = 2 → bit 2 is 1 ✓
  hash2("eve") = 7 → bit 7 is 1 ✓
  Result: PROBABLY in set ✅ (FALSE POSITIVE!)
Click to expand code...

Implementation

Python Implementation

python
import hashlib
import math

class BloomFilter:
    def __init__(self, expected_elements, false_positive_rate):
        # Calculate optimal size and hash count
        self.size = self._optimal_size(expected_elements, false_positive_rate)
        self.hash_count = self._optimal_hash_count(self.size, expected_elements)
        self.bit_array = [0] * self.size
        self.count = 0
    
    def _optimal_size(self, n, p):
        """
        m = -(n * ln(p)) / (ln(2)^2)
        n = expected elements
        p = false positive rate
        """
        return int(-(n * math.log(p)) / (math.log(2) ** 2))
    
    def _optimal_hash_count(self, m, n):
        """
        k = (m / n) * ln(2)
        """
        return int((m / n) * math.log(2))
    
    def _hashes(self, item):
        """Generate k hash values"""
        # Use two hash functions to simulate k hashes (double hashing)
        h1 = int(hashlib.md5(item.encode()).hexdigest(), 16)
        h2 = int(hashlib.sha256(item.encode()).hexdigest(), 16)
        
        for i in range(self.hash_count):
            yield (h1 + i * h2) % self.size
    
    def add(self, item):
        """Add item to bloom filter"""
        for index in self._hashes(item):
            self.bit_array[index] = 1
        self.count += 1
    
    def contains(self, item):
        """Check if item might be in set"""
        return all(self.bit_array[index] == 1 for index in self._hashes(item))
    
    def __contains__(self, item):
        """Support 'in' operator"""
        return self.contains(item)

# Usage
bloom = BloomFilter(expected_elements=1000000, false_positive_rate=0.01)

# Add elements
bloom.add("alice123")
bloom.add("bob456")
bloom.add("charlie789")

# Check membership
print("alice123" in bloom)    # True (correct)
print("bob456" in bloom)      # True (correct)
print("unknown" in bloom)     # False (correct)
print("xyz" in bloom)         # Maybe True (false positive)
Click to expand code...

JavaScript Implementation

javascript
class BloomFilter {
    constructor(expectedElements, falsePositiveRate) {
        this.size = this._optimalSize(expectedElements, falsePositiveRate);
        this.hashCount = this._optimalHashCount(this.size, expectedElements);
        this.bitArray = new Array(this.size).fill(0);
    }
    
    _optimalSize(n, p) {
        return Math.ceil(-(n * Math.log(p)) / (Math.log(2) ** 2));
    }
    
    _optimalHashCount(m, n) {
        return Math.ceil((m / n) * Math.log(2));
    }
    
    _hash(str, seed = 0) {
        let h = seed;
        for (let i = 0; i < str.length; i++) {
            h = Math.imul(h ^ str.charCodeAt(i), 2654435761);
        }
        return (h ^ (h >>> 16)) >>> 0;
    }
    
    *_hashes(item) {
        const h1 = this._hash(item, 0);
        const h2 = this._hash(item, 1);
        
        for (let i = 0; i < this.hashCount; i++) {
            yield (h1 + i * h2) % this.size;
        }
    }
    
    add(item) {
        for (const index of this._hashes(item)) {
            this.bitArray[index] = 1;
        }
    }
    
    contains(item) {
        for (const index of this._hashes(item)) {
            if (this.bitArray[index] === 0) return false;
        }
        return true;
    }
}

// Usage
const bloom = new BloomFilter(1000000, 0.01);
bloom.add("user123");
console.log(bloom.contains("user123"));  // true
console.log(bloom.contains("unknown"));  // false
Click to expand code...

Mathematical Foundation

False Positive Probability

After inserting n elements:

P(false positive) ≈ (1 - e^(-kn/m))^k

Where:
- k = number of hash functions
- n = number of inserted elements
- m = bit array size
Click to expand code...

Optimal Parameters

Optimal bit array size:

m = -(n * ln(p)) / (ln(2)^2)
m ≈ 1.44 * n * log₂(1/p)

Example:
n = 1,000,000
p = 0.01 (1% false positive)
m ≈ 9,585,059 bits ≈ 1.2 MB
Click to expand code...

Optimal hash function count:

k = (m/n) * ln(2)
k ≈ 0.693 * (m/n)

For above example:
k ≈ 7 hash functions
Click to expand code...

Space Savings

HashSet vs Bloom Filter (1M usernames, 1% FPR):

HashSet:
- 1M × 100 bytes/username = 100 MB

Bloom Filter:
- 1.2 MB (83x smaller!)
Click to expand code...

Real-World Use Cases

1. Database Query Optimization

Cassandra & HBase:

python
# Before querying disk, check bloom filter
def get_user(user_id):
    sstable_bloom = load_bloom_filter("users_sstable_5")
    
    if user_id not in sstable_bloom:
        # Definitely not in this SSTable, skip disk read
        return None
    
    # Might be in SSTable, do expensive disk read
    return read_from_disk(user_id)

# Saves 99% of unnecessary disk reads!
Click to expand code...

2. Web Crawler (Avoid Duplicate URLs)

python
class WebCrawler:
    def __init__(self):
        self.visited = BloomFilter(100_000_000, 0.001)  # 100M URLs
    
    def crawl(self, url):
        if url in self.visited:
            return  # Skip (probably already crawled)
        
        self.visited.add(url)
        page = fetch(url)
        # Process page...
Click to expand code...

vs HashSet: 100M URLs × 100 bytes = 10GB vs 120MB (83x savings)

3. Malicious URL Detection (Chrome Safe Browsing)

javascript
// Client-side bloom filter (downloaded with browser)
const maliciousURLBloom = new BloomFilter(1000000, 0.001);

function checkURL(url) {
    if (!maliciousURLBloom.contains(url)) {
        return "SAFE";  // Definitely safe
    }
    
    // Might be malicious, check with server
    return await checkWithServer(url);
}

// 99.9% of URLs filtered locally (no network call)
// 0.1% false positives require server check
Click to expand code...

4. CDN Cache Decision

python
# Only cache items seen multiple times (avoid "one-hit wonders")
class SmartCDN:
    def __init__(self):
        self.seen_once = BloomFilter(10_000_000, 0.01)
        self.cache = {}
    
    def get(self, url):
        if url in self.cache:
            return self.cache[url]
        
        content = fetch_from_origin(url)
        
        if url in self.seen_once:
            # Seen twice, cache it
            self.cache[url] = content
        else:
            # First time seeing, mark as seen
            self.seen_once.add(url)
        
        return content
Click to expand code...

5. Bitcoin SPV Wallets

python
# Lightweight Bitcoin client
# Only download blocks containing my transactions
class SPVWallet:
    def __init__(self):
        self.my_addresses_bloom = BloomFilter(1000, 0.001)
        
        for addr in my_addresses:
            self.my_addresses_bloom.add(addr)
    
    def filter_block(self, block):
        for tx in block.transactions:
            if any(addr in self.my_addresses_bloom for addr in tx.outputs):
                return True  # Download this block
        return False  # Skip this block
Click to expand code...

Variants & Extensions

1. Counting Bloom Filter

Allows deletions by using counters instead of bits.

python
class CountingBloomFilter:
    def __init__(self, size, hash_count):
        self.counters = [0] * size
        self.hash_count = hash_count
    
    def add(self, item):
        for index in self._hashes(item):
            self.counters[index] += 1
    
    def remove(self, item):
        for index in self._hashes(item):
            if self.counters[index] > 0:
                self.counters[index] -= 1
    
    def contains(self, item):
        return all(self.counters[index] > 0 for index in self._hashes(item))
Click to expand code...

Trade-off: 4-8x more memory (counters vs bits)

2. Scalable Bloom Filter

Dynamically grows as more elements added.

python
class ScalableBloomFilter:
    def __init__(self, initial_capacity, fp_rate):
        self.filters = [BloomFilter(initial_capacity, fp_rate * 0.9)]
        self.capacity = initial_capacity
        self.fp_rate = fp_rate
    
    def add(self, item):
        if self.filters[-1].count >= self.capacity:
            # Create new filter with doubled capacity
            self.capacity *= 2
            self.filters.append(BloomFilter(self.capacity, self.fp_rate * 0.9))
        
        self.filters[-1].add(item)
    
    def contains(self, item):
        return any(f.contains(item) for f in self.filters)
Click to expand code...

Bloom Filter vs Alternatives

Data StructureSpaceFalse PositivesFalse NegativesDeletions
Bloom FilterVery LowYesNo❌ No
Counting BloomLowYesNo✅ Yes
HashSetHighNoNo✅ Yes
Cuckoo FilterLowYesNo✅ Yes
Quotient FilterLowYesNo✅ Yes

Choose Bloom Filter when:

  • Space is critical
  • No deletions needed
  • False positives acceptable (with low rate)
  • High throughput required

Interview Tips 💡

When discussing Bloom Filters in system design interviews:

  1. Explain the problem: "We have 1B users but can't fit a HashSet in memory..."
  2. Trade-off: "Bloom filter uses 1.2MB instead of 100MB, with 1% false positive rate..."
  3. Why false positives OK: "For cache check, false positive means extra DB query, not wrong result..."
  4. Real examples: "Cassandra uses bloom filters to avoid disk reads for non-existent keys..."
  5. Parameters: "With 1M items and 1% FPR, we need ~1.2MB and 7 hash functions..."
  6. Alternatives: "If we need deletions, we'd use Counting Bloom Filter or Cuckoo Filter..."

Related Concepts

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Related Articles