Designing a Distributed Web Crawler

Designing a web crawler that can index the entire internet (billions of pages) is a classic system design interview question. It tests your ability to handle massive scale, concurrency, and politeness.

1. Requirements

Functional Requirements

Scalability: Must handle billions of pages.
Robustness: Must handle malformed HTML, unresponsive servers, and crashes.
Politeness: Must not overload any single server (DDoS).
Extensibility: Easy to add new content types (Images, PDFs).

Non-Functional Requirements

Performance: Crawl 1 billion pages per month (approx 400 pages/sec).
Storage: Store metadata and content (Petabytes).

2. High-Level Architecture

The system is composed of several independent workers coordinated by message queues.

Components

URL Frontier: The brain of the crawler. It prioritizes which URLs to crawl next.
HTML Fetcher: The workers that physically download the web pages.
DNS Resolver: A dedicated service (or cache) to resolve hostnames to IPs to reduce latency.
Content Parser: Validates HTML, extracts links, and parses content.
Dedup Service: Checks if the content or URL has already been seen to avoid infinite loops and redundancy.
Storage: BigTable (metadata) + GFS/S3 (content blobs).

3. Deep Dive: The URL Frontier

The URL Frontier is not just a simple FIFO queue. It must handle Priority and Politeness.

Architecture: The Split-Queue Approach

We split the logic into two layers of queues:

A. Front queues (Prioritization)

The input flows into F1 to Fn queues based on priority.
Priority can be determined by PageRank, update frequency, or payment.
A Prioritizer component selects which queue to pull from.

B. Back queues (Politeness)

The goal: One queue per domain.
Queue B1 only contains URLs for wikipedia.org. Queue B2 only for cnn.com.
A Mapping Table maps a host to a specific Back Queue.
A heap manages when a queue is ready to run.

This ensures we never fire 100 concurrent requests to a single small website, which would be an attack.

4. HTML Fetcher & DNS Resolution

The Fetcher is a distributed worker module.

Multithreading: Each fetcher runs thousands of threads. Blocking I/O is the bottleneck.
DNS Caching: DNS lookups are slow (10ms - 500ms). The crawler maintains a custom DNS cache with a large TTL.
Robots.txt: Before crawling a domain, the fetcher must download and cache robots.txt to respect User-Agent rules.

5. Deduplication (Dedup)

The web is full of duplicates.

URL Deduplication

Canonicalization: Convert www.google.com and google.com/ to the same string.
Bloom Filter: A probabilistic data structure. Essential for checking billions of URLs in milliseconds.

Content Deduplication

Two pages can have different URLs but identical content.
SimHash (Locality Sensitive Hashing): Generates a fingerprint where similar documents have similar hashes. Allows detection of "near-duplicates".

6. Fault Tolerance & Spider Traps

Spider Traps

A spider trap is a web page that causes a crawler to cycle infinitely.

Infinite URL generation: example.com/calendar.php?year=2024, .../year=2025
Solution: Max Depth Limit. Only crawl up to depth 10 from the seed.
Solution: URL Length Limit.

Data Safety

Consistent Hashing: Used to distribute the load of the URL Frontier and Fetchers.
Checkpoints: Store the state of the crawl queue on disk periodically.

7. Storage Design

We deal with two types of data:

Metadata (URL, last crawled, frequency, priority): BigTable or Cassandra.
Blob Content (HTML, Images, Video): GFS or S3.

Summary

Frontier: Kafka + Redis
Politeness: Leaky Bucket / Delay Queues
Dedup: Bloom Filter + SimHash
DNS: Custom LRU Cache
Storage: NoSQL (Cassandra/HBase)

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Advanced

System Design: Dropbox (Google Drive)

Designing a file synchronization service like Dropbox or Google Drive. Key concepts: Block-level Deduplication, Delta Sync, and Strong Consistency.

System DesignStorageDistributed Systems

Expert

System Design: Instagram News Feed

Designing a scalable social feed. Fan-out on Write vs Fan-out on Read, and solving the Justin Bieber problem.

System DesignSocialDatabases

Pro

Raft Consensus Algorithm

A comprehensive guide to Raft, the consensus algorithm powering Etcd, Consul, and Kubernetes. Leader election, log replication, safety guarantees, and production deployment patterns.

Distributed SystemsAlgorithmsConsensus

System Design: Web Crawler (Google)