Designing a Distributed Web Crawler
Designing a web crawler that can index the entire internet (billions of pages) is a classic system design interview question. It tests your ability to handle massive scale, concurrency, and politeness.
1. Requirements
Functional Requirements
- Scalability: Must handle billions of pages.
- Robustness: Must handle malformed HTML, unresponsive servers, and crashes.
- Politeness: Must not overload any single server (DDoS).
- Extensibility: Easy to add new content types (Images, PDFs).
Non-Functional Requirements
- Performance: Crawl 1 billion pages per month (approx 400 pages/sec).
- Storage: Store metadata and content (Petabytes).
2. High-Level Architecture
The system is composed of several independent workers coordinated by message queues.
Components
- URL Frontier: The brain of the crawler. It prioritizes which URLs to crawl next.
- HTML Fetcher: The workers that physically download the web pages.
- DNS Resolver: A dedicated service (or cache) to resolve hostnames to IPs to reduce latency.
- Content Parser: Validates HTML, extracts links, and parses content.
- Dedup Service: Checks if the content or URL has already been seen to avoid infinite loops and redundancy.
- Storage: BigTable (metadata) + GFS/S3 (content blobs).
3. Deep Dive: The URL Frontier
The URL Frontier is not just a simple FIFO queue. It must handle Priority and Politeness.
Architecture: The Split-Queue Approach
We split the logic into two layers of queues:
A. Front queues (Prioritization)
- The input flows into F1 to Fn queues based on priority.
- Priority can be determined by PageRank, update frequency, or payment.
- A Prioritizer component selects which queue to pull from.
B. Back queues (Politeness)
- The goal: One queue per domain.
- Queue B1 only contains URLs for wikipedia.org. Queue B2 only for cnn.com.
- A Mapping Table maps a host to a specific Back Queue.
- A heap manages when a queue is ready to run.
This ensures we never fire 100 concurrent requests to a single small website, which would be an attack.
4. HTML Fetcher & DNS Resolution
The Fetcher is a distributed worker module.
- Multithreading: Each fetcher runs thousands of threads. Blocking I/O is the bottleneck.
- DNS Caching: DNS lookups are slow (10ms - 500ms). The crawler maintains a custom DNS cache with a large TTL.
- Robots.txt: Before crawling a domain, the fetcher must download and cache robots.txt to respect User-Agent rules.
5. Deduplication (Dedup)
The web is full of duplicates.
URL Deduplication
- Canonicalization: Convert www.google.com and google.com/ to the same string.
- Bloom Filter: A probabilistic data structure. Essential for checking billions of URLs in milliseconds.
Content Deduplication
- Two pages can have different URLs but identical content.
- SimHash (Locality Sensitive Hashing): Generates a fingerprint where similar documents have similar hashes. Allows detection of "near-duplicates".
6. Fault Tolerance & Spider Traps
Spider Traps
A spider trap is a web page that causes a crawler to cycle infinitely.
- Infinite URL generation: example.com/calendar.php?year=2024, .../year=2025
- Solution: Max Depth Limit. Only crawl up to depth 10 from the seed.
- Solution: URL Length Limit.
Data Safety
- Consistent Hashing: Used to distribute the load of the URL Frontier and Fetchers.
- Checkpoints: Store the state of the crawl queue on disk periodically.
7. Storage Design
We deal with two types of data:
- Metadata (URL, last crawled, frequency, priority): BigTable or Cassandra.
- Blob Content (HTML, Images, Video): GFS or S3.
Summary
- Frontier: Kafka + Redis
- Politeness: Leaky Bucket / Delay Queues
- Dedup: Bloom Filter + SimHash
- DNS: Custom LRU Cache
- Storage: NoSQL (Cassandra/HBase)
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
System Design: Dropbox (Google Drive)
Designing a file synchronization service like Dropbox or Google Drive. Key concepts: Block-level Deduplication, Delta Sync, and Strong Consistency.
System Design: Instagram News Feed
Designing a scalable social feed. Fan-out on Write vs Fan-out on Read, and solving the Justin Bieber problem.
Raft Consensus Algorithm
A comprehensive guide to Raft, the consensus algorithm powering Etcd, Consul, and Kubernetes. Leader election, log replication, safety guarantees, and production deployment patterns.