Back to All Concepts
System DesignSearchDistributed SystemsAdvanced

System Design: Web Crawler (Google)

A deep dive into designing a scalable, distributed web crawler capable of indexing the entire internet. Covers URL Frontiers, Politeness policies, and robust deduplication.

Designing a Distributed Web Crawler

Designing a web crawler that can index the entire internet (billions of pages) is a classic system design interview question. It tests your ability to handle massive scale, concurrency, and politeness.

1. Requirements

Functional Requirements

  • Scalability: Must handle billions of pages.
  • Robustness: Must handle malformed HTML, unresponsive servers, and crashes.
  • Politeness: Must not overload any single server (DDoS).
  • Extensibility: Easy to add new content types (Images, PDFs).

Non-Functional Requirements

  • Performance: Crawl 1 billion pages per month (approx 400 pages/sec).
  • Storage: Store metadata and content (Petabytes).

2. High-Level Architecture

The system is composed of several independent workers coordinated by message queues.

Components

  1. URL Frontier: The brain of the crawler. It prioritizes which URLs to crawl next.
  2. HTML Fetcher: The workers that physically download the web pages.
  3. DNS Resolver: A dedicated service (or cache) to resolve hostnames to IPs to reduce latency.
  4. Content Parser: Validates HTML, extracts links, and parses content.
  5. Dedup Service: Checks if the content or URL has already been seen to avoid infinite loops and redundancy.
  6. Storage: BigTable (metadata) + GFS/S3 (content blobs).

3. Deep Dive: The URL Frontier

The URL Frontier is not just a simple FIFO queue. It must handle Priority and Politeness.

Architecture: The Split-Queue Approach

We split the logic into two layers of queues:

A. Front queues (Prioritization)

  • The input flows into F1 to Fn queues based on priority.
  • Priority can be determined by PageRank, update frequency, or payment.
  • A Prioritizer component selects which queue to pull from.

B. Back queues (Politeness)

  • The goal: One queue per domain.
  • Queue B1 only contains URLs for wikipedia.org. Queue B2 only for cnn.com.
  • A Mapping Table maps a host to a specific Back Queue.
  • A heap manages when a queue is ready to run.

This ensures we never fire 100 concurrent requests to a single small website, which would be an attack.

4. HTML Fetcher & DNS Resolution

The Fetcher is a distributed worker module.

  • Multithreading: Each fetcher runs thousands of threads. Blocking I/O is the bottleneck.
  • DNS Caching: DNS lookups are slow (10ms - 500ms). The crawler maintains a custom DNS cache with a large TTL.
  • Robots.txt: Before crawling a domain, the fetcher must download and cache robots.txt to respect User-Agent rules.

5. Deduplication (Dedup)

The web is full of duplicates.

URL Deduplication

  • Canonicalization: Convert www.google.com and google.com/ to the same string.
  • Bloom Filter: A probabilistic data structure. Essential for checking billions of URLs in milliseconds.

Content Deduplication

  • Two pages can have different URLs but identical content.
  • SimHash (Locality Sensitive Hashing): Generates a fingerprint where similar documents have similar hashes. Allows detection of "near-duplicates".

6. Fault Tolerance & Spider Traps

Spider Traps

A spider trap is a web page that causes a crawler to cycle infinitely.

  • Infinite URL generation: example.com/calendar.php?year=2024, .../year=2025
  • Solution: Max Depth Limit. Only crawl up to depth 10 from the seed.
  • Solution: URL Length Limit.

Data Safety

  • Consistent Hashing: Used to distribute the load of the URL Frontier and Fetchers.
  • Checkpoints: Store the state of the crawl queue on disk periodically.

7. Storage Design

We deal with two types of data:

  1. Metadata (URL, last crawled, frequency, priority): BigTable or Cassandra.
  2. Blob Content (HTML, Images, Video): GFS or S3.

Summary

  • Frontier: Kafka + Redis
  • Politeness: Leaky Bucket / Delay Queues
  • Dedup: Bloom Filter + SimHash
  • DNS: Custom LRU Cache
  • Storage: NoSQL (Cassandra/HBase)

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Related Articles