What is a Bloom Filter?

A Bloom Filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It can give false positives but never false negatives.

Trade-off: 100MB memory instead of 100GB, in exchange for occasional "maybe" answers.

The Problem: Expensive Membership Tests

Task: Check if username "alice123" exists in 1 billion user database

Naive approaches:
1. Database query: SELECT * FROM users WHERE username = 'alice123'
   → 100ms latency, expensive

2. HashSet in memory: Set<String> usernames
   → 1B usernames × 100 bytes = 100GB RAM

Bloom Filter approach:
→ 1.2GB RAM, O(1) lookups, <1μs latency
→ Trade-off: 1% false positive rate

Click to expand code...

How It Works

A Bloom Filter is a bit array of size m with k hash functions.

mermaid

graph TB
    Item[Item: alice123] -->|hash1| H1[Hash Function 1]
    Item -->|hash2| H2[Hash Function 2]
    Item -->|hash3| H3[Hash Function 3]
    
    H1 -->|index 5| BitArray[Bit Array: 0010010100...]
    H2 -->|index 12| BitArray
    H3 -->|index 23| BitArray
    
    BitArray -->|set bits to 1| Result[Bits 5, 12, 23 = 1]

Click to expand code...

Adding an Element

Pass element through k hash functions
Get k array indices
Set bits at those indices to 1

Checking an Element

Pass element through same k hash functions
Check if all bits at those indices are 1
Result:
- All bits = 1: Element probably in set (might be false positive)
- Any bit = 0: Element definitely not in set (guaranteed accurate)

Example

Bit array size: 10
Hash functions: 2

Add "alice":
  hash1("alice") = 2 → set bit 2
  hash2("alice") = 7 → set bit 7
  Bits: 0010000100

Add "bob":
  hash1("bob") = 3 → set bit 3
  hash2("bob") = 7 → set bit 7 (already set)
  Bits: 0011000100

Check "alice":
  hash1("alice") = 2 → bit 2 is 1 ✓
  hash2("alice") = 7 → bit 7 is 1 ✓
  Result: PROBABLY in set ✅

Check "charlie":
  hash1("charlie") = 2 → bit 2 is 1 ✓
  hash2("charlie") = 5 → bit 5 is 0 ✗
  Result: DEFINITELY NOT in set ❌

Check "eve":
  hash1("eve") = 2 → bit 2 is 1 ✓
  hash2("eve") = 7 → bit 7 is 1 ✓
  Result: PROBABLY in set ✅ (FALSE POSITIVE!)

Click to expand code...

Implementation

Python Implementation

python

import hashlib
import math

class BloomFilter:
    def __init__(self, expected_elements, false_positive_rate):
        # Calculate optimal size and hash count
        self.size = self._optimal_size(expected_elements, false_positive_rate)
        self.hash_count = self._optimal_hash_count(self.size, expected_elements)
        self.bit_array = [0] * self.size
        self.count = 0
    
    def _optimal_size(self, n, p):
        """
        m = -(n * ln(p)) / (ln(2)^2)
        n = expected elements
        p = false positive rate
        """
        return int(-(n * math.log(p)) / (math.log(2) ** 2))
    
    def _optimal_hash_count(self, m, n):
        """
        k = (m / n) * ln(2)
        """
        return int((m / n) * math.log(2))
    
    def _hashes(self, item):
        """Generate k hash values"""
        # Use two hash functions to simulate k hashes (double hashing)
        h1 = int(hashlib.md5(item.encode()).hexdigest(), 16)
        h2 = int(hashlib.sha256(item.encode()).hexdigest(), 16)
        
        for i in range(self.hash_count):
            yield (h1 + i * h2) % self.size
    
    def add(self, item):
        """Add item to bloom filter"""
        for index in self._hashes(item):
            self.bit_array[index] = 1
        self.count += 1
    
    def contains(self, item):
        """Check if item might be in set"""
        return all(self.bit_array[index] == 1 for index in self._hashes(item))
    
    def __contains__(self, item):
        """Support 'in' operator"""
        return self.contains(item)

# Usage
bloom = BloomFilter(expected_elements=1000000, false_positive_rate=0.01)

# Add elements
bloom.add("alice123")
bloom.add("bob456")
bloom.add("charlie789")

# Check membership
print("alice123" in bloom)    # True (correct)
print("bob456" in bloom)      # True (correct)
print("unknown" in bloom)     # False (correct)
print("xyz" in bloom)         # Maybe True (false positive)

Click to expand code...

JavaScript Implementation

javascript

class BloomFilter {
    constructor(expectedElements, falsePositiveRate) {
        this.size = this._optimalSize(expectedElements, falsePositiveRate);
        this.hashCount = this._optimalHashCount(this.size, expectedElements);
        this.bitArray = new Array(this.size).fill(0);
    }
    
    _optimalSize(n, p) {
        return Math.ceil(-(n * Math.log(p)) / (Math.log(2) ** 2));
    }
    
    _optimalHashCount(m, n) {
        return Math.ceil((m / n) * Math.log(2));
    }
    
    _hash(str, seed = 0) {
        let h = seed;
        for (let i = 0; i < str.length; i++) {
            h = Math.imul(h ^ str.charCodeAt(i), 2654435761);
        }
        return (h ^ (h >>> 16)) >>> 0;
    }
    
    *_hashes(item) {
        const h1 = this._hash(item, 0);
        const h2 = this._hash(item, 1);
        
        for (let i = 0; i < this.hashCount; i++) {
            yield (h1 + i * h2) % this.size;
        }
    }
    
    add(item) {
        for (const index of this._hashes(item)) {
            this.bitArray[index] = 1;
        }
    }
    
    contains(item) {
        for (const index of this._hashes(item)) {
            if (this.bitArray[index] === 0) return false;
        }
        return true;
    }
}

// Usage
const bloom = new BloomFilter(1000000, 0.01);
bloom.add("user123");
console.log(bloom.contains("user123"));  // true
console.log(bloom.contains("unknown"));  // false

Click to expand code...

Mathematical Foundation

False Positive Probability

After inserting n elements:

P(false positive) ≈ (1 - e^(-kn/m))^k

Where:
- k = number of hash functions
- n = number of inserted elements
- m = bit array size

Click to expand code...

Optimal Parameters

Optimal bit array size:

m = -(n * ln(p)) / (ln(2)^2)
m ≈ 1.44 * n * log₂(1/p)

Example:
n = 1,000,000
p = 0.01 (1% false positive)
m ≈ 9,585,059 bits ≈ 1.2 MB

Click to expand code...

Optimal hash function count:

k = (m/n) * ln(2)
k ≈ 0.693 * (m/n)

For above example:
k ≈ 7 hash functions

Click to expand code...

Space Savings

HashSet vs Bloom Filter (1M usernames, 1% FPR):

HashSet:
- 1M × 100 bytes/username = 100 MB

Bloom Filter:
- 1.2 MB (83x smaller!)

Click to expand code...

Real-World Use Cases

1. Database Query Optimization

Cassandra & HBase:

python

# Before querying disk, check bloom filter
def get_user(user_id):
    sstable_bloom = load_bloom_filter("users_sstable_5")
    
    if user_id not in sstable_bloom:
        # Definitely not in this SSTable, skip disk read
        return None
    
    # Might be in SSTable, do expensive disk read
    return read_from_disk(user_id)

# Saves 99% of unnecessary disk reads!

Click to expand code...

2. Web Crawler (Avoid Duplicate URLs)

python

class WebCrawler:
    def __init__(self):
        self.visited = BloomFilter(100_000_000, 0.001)  # 100M URLs
    
    def crawl(self, url):
        if url in self.visited:
            return  # Skip (probably already crawled)
        
        self.visited.add(url)
        page = fetch(url)
        # Process page...

Click to expand code...

vs HashSet: 100M URLs × 100 bytes = 10GB vs 120MB (83x savings)

3. Malicious URL Detection (Chrome Safe Browsing)

javascript

// Client-side bloom filter (downloaded with browser)
const maliciousURLBloom = new BloomFilter(1000000, 0.001);

function checkURL(url) {
    if (!maliciousURLBloom.contains(url)) {
        return "SAFE";  // Definitely safe
    }
    
    // Might be malicious, check with server
    return await checkWithServer(url);
}

// 99.9% of URLs filtered locally (no network call)
// 0.1% false positives require server check

Click to expand code...

4. CDN Cache Decision

python

# Only cache items seen multiple times (avoid "one-hit wonders")
class SmartCDN:
    def __init__(self):
        self.seen_once = BloomFilter(10_000_000, 0.01)
        self.cache = {}
    
    def get(self, url):
        if url in self.cache:
            return self.cache[url]
        
        content = fetch_from_origin(url)
        
        if url in self.seen_once:
            # Seen twice, cache it
            self.cache[url] = content
        else:
            # First time seeing, mark as seen
            self.seen_once.add(url)
        
        return content

Click to expand code...

5. Bitcoin SPV Wallets

python

# Lightweight Bitcoin client
# Only download blocks containing my transactions
class SPVWallet:
    def __init__(self):
        self.my_addresses_bloom = BloomFilter(1000, 0.001)
        
        for addr in my_addresses:
            self.my_addresses_bloom.add(addr)
    
    def filter_block(self, block):
        for tx in block.transactions:
            if any(addr in self.my_addresses_bloom for addr in tx.outputs):
                return True  # Download this block
        return False  # Skip this block

Click to expand code...

Variants & Extensions

1. Counting Bloom Filter

Allows deletions by using counters instead of bits.

python

class CountingBloomFilter:
    def __init__(self, size, hash_count):
        self.counters = [0] * size
        self.hash_count = hash_count
    
    def add(self, item):
        for index in self._hashes(item):
            self.counters[index] += 1
    
    def remove(self, item):
        for index in self._hashes(item):
            if self.counters[index] > 0:
                self.counters[index] -= 1
    
    def contains(self, item):
        return all(self.counters[index] > 0 for index in self._hashes(item))

Click to expand code...

Trade-off: 4-8x more memory (counters vs bits)

2. Scalable Bloom Filter

Dynamically grows as more elements added.

python

class ScalableBloomFilter:
    def __init__(self, initial_capacity, fp_rate):
        self.filters = [BloomFilter(initial_capacity, fp_rate * 0.9)]
        self.capacity = initial_capacity
        self.fp_rate = fp_rate
    
    def add(self, item):
        if self.filters[-1].count >= self.capacity:
            # Create new filter with doubled capacity
            self.capacity *= 2
            self.filters.append(BloomFilter(self.capacity, self.fp_rate * 0.9))
        
        self.filters[-1].add(item)
    
    def contains(self, item):
        return any(f.contains(item) for f in self.filters)

Click to expand code...

Bloom Filter vs Alternatives

Data Structure	Space	False Positives	False Negatives	Deletions
Bloom Filter	Very Low	Yes	No	❌ No
Counting Bloom	Low	Yes	No	✅ Yes
HashSet	High	No	No	✅ Yes
Cuckoo Filter	Low	Yes	No	✅ Yes
Quotient Filter	Low	Yes	No	✅ Yes

Choose Bloom Filter when:

Space is critical
No deletions needed
False positives acceptable (with low rate)
High throughput required

Interview Tips 💡

When discussing Bloom Filters in system design interviews:

Explain the problem: "We have 1B users but can't fit a HashSet in memory..."
Trade-off: "Bloom filter uses 1.2MB instead of 100MB, with 1% false positive rate..."
Why false positives OK: "For cache check, false positive means extra DB query, not wrong result..."
Real examples: "Cassandra uses bloom filters to avoid disk reads for non-existent keys..."
Parameters: "With 1M items and 1% FPR, we need ~1.2MB and 7 hash functions..."
Alternatives: "If we need deletions, we'd use Counting Bloom Filter or Cuckoo Filter..."

Related Concepts

HyperLogLog — Probabilistic cardinality estimation
Consistent Hashing — Another hash-based optimization
Caching Strategies — Bloom filters optimize cache decisions
Database Indexing — SSTable optimization with bloom filters
Count-Min Sketch — Probabilistic frequency counting

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Beginner

Caching Overview

High-speed data storage to reduce latency. The single most effective way to scale read-heavy systems.

PerformanceOptimizationDatabase

Expert

HyperLogLog (Cardinality Estimation)

A probabilistic algorithm for counting unique items in massive datasets using minimal memory, with less than 1% error using just kilobytes of space.

AlgorithmsBig DataProbabilistic

Intermediate

Trie (Prefix Tree & Autocomplete)

A tree data structure optimized for storing and searching strings by prefix, enabling efficient autocomplete, spell checking, and dictionary operations in O(L) time.

AlgorithmsData StructuresSearch

What is a Bloom Filter?

The Problem: Expensive Membership Tests

How It Works

Adding an Element

Checking an Element

Example

Implementation

Python Implementation

JavaScript Implementation

Mathematical Foundation

False Positive Probability

Optimal Parameters

Space Savings

Real-World Use Cases

1. Database Query Optimization

2. Web Crawler (Avoid Duplicate URLs)

3. Malicious URL Detection (Chrome Safe Browsing)

4. CDN Cache Decision

5. Bitcoin SPV Wallets

Variants & Extensions

1. Counting Bloom Filter

2. Scalable Bloom Filter

Bloom Filter vs Alternatives

Interview Tips 💡

Related Concepts

About ScaleWiki

Related Articles

Caching Overview

HyperLogLog (Cardinality Estimation)

Trie (Prefix Tree & Autocomplete)