Chapter 9: Design a Web Crawler

volume1 web-crawler distributed-systems search

Status: 🟩 Interview ready
Difficulty: Medium-Hard
Time to complete: 50 min read + practice

Overview

A web crawler (also called a spider or bot) systematically browses the web, downloading web pages and following links to discover new pages. Large-scale crawlers are the backbone of search engines like Google and Bing.

Common use cases:

Search engine indexing: Discover and index web pages (Google, Bing, DuckDuckGo)
Web archiving: Store web pages over time (Wayback Machine / archive.org)
Data mining: Collect research data (price comparison, market research)
Link checking: Detect broken links on websites
Copyright monitoring: Find unauthorized copies of content
Web monitoring: Track changes to websites (news aggregators, change detection services)

Why this matters:

Medium-Hard interview question — tests distributed systems depth
Covers BFS/DFS trade-offs, queue design, deduplication, politeness
Core to search engine infrastructure

Problem Statement

Design a web crawler that:

Discovers and downloads billions of web pages
Stores downloaded pages for indexing
Avoids downloading the same page twice (deduplication)
Is polite (respects robots.txt and crawl delays)
Is scalable, robust, and extensible

Step 1: Requirements & Scope (5 min)

Functional Requirements

Clarifying questions:

What is the scale? → 1 billion web pages per month
What content types? → HTML only (not images, video, PDFs)
How long to store pages? → 5 years
What about duplicate content? → Ignore duplicates (save storage)
Do we respect robots.txt? → Yes (politeness)
How to handle newly added/updated pages? → Freshness: re-crawl periodically

Scope:

Crawl 1 billion HTML pages per month
Store content for 5 years
Detect and skip duplicate content
Respect robots.txt and per-host crawl delays
Scalable, robust, and extensible architecture

Non-Functional Requirements

Scalability: Crawl 1 billion pages/month (can be scaled further)
Robustness: Handle bad HTML, network failures, server crashes
Politeness: One host should not be hammered with too many requests
Extensibility: Easy to add support for new content types (PDF, images)
Freshness: Periodically re-crawl pages to catch updates

Back-of-the-Envelope Estimation

Download rate:

1 billion pages / month
= 1,000,000,000 / (30 days × 24 hours × 3,600 sec)
= 1,000,000,000 / 2,592,000
≈ 400 pages/second

Storage:

Average HTML page size: ~500 KB (including assets, raw HTML)
1 billion pages/month × 500 KB = 500 TB/month
5 years storage: 500 TB × 12 months × 5 years = 30 PB total

Bandwidth:

400 pages/sec × 500 KB = 200 MB/sec inbound bandwidth
≈ 1.6 Gbps — requires multiple network connections and distributed crawlers

Step 2: High-Level Design (10 min)

Core Components

Seed URLs
    │
    ▼
┌──────────────────────────────────────────────────────────┐
│                     URL Frontier                          │
│   (Priority Queue + Politeness Queue — URLs to crawl)    │
└──────────────────────────┬───────────────────────────────┘
                           │
                    ┌──────▼──────┐
                    │ DNS Resolver│  (IP address cache)
                    └──────┬──────┘
                           │
                    ┌──────▼──────────┐
                    │ HTML Downloader │  (Fetches page)
                    └──────┬──────────┘
                           │
                    ┌──────▼──────────┐
                    │ Content Parser  │  (Extract text, validate HTML)
                    └──────┬──────────┘
                           │
                    ┌──────▼──────────────┐
                    │ Content Seen?       │  (Dedup check on page content)
                    │ (Hash comparison)   │
                    └──────┬──────────────┘
                           │ Not seen
                    ┌──────▼──────────┐
                    │  Content Store  │  (S3, Bigtable — save raw HTML)
                    └─────────────────┘
                           │
                    ┌──────▼──────────┐
                    │ Link Extractor  │  (Find all <a href> links)
                    └──────┬──────────┘
                           │
                    ┌──────▼──────────┐
                    │   URL Filter    │  (Remove invalid, blacklisted URLs)
                    └──────┬──────────┘
                           │
                    ┌──────▼──────────────┐
                    │    URL Seen?        │  (Dedup: Bloom filter)
                    │  (Bloom Filter)     │
                    └──────┬──────────────┘
                           │ Not seen
                           ▼
                     URL Frontier  (loop — add new URLs to crawl)

Component Descriptions

Seed URLs: Initial set of URLs to start crawling from. Often selected to maximize coverage (e.g., top-ranked pages, DMOZ directory, sitemaps). Different strategies: locality-based (crawl by country), topic-based (crawl by category).

URL Frontier: A queue of URLs to be crawled. Must support politeness (per-host delays) and priority (crawl important pages first). Core of the crawler’s scheduling logic.

DNS Resolver: Converts hostnames to IP addresses. Cached aggressively to avoid DNS bottleneck (DNS lookup: 10-200ms, can bottleneck throughput without caching).

HTML Downloader: Fetches the actual HTML content from the web server. Must respect robots.txt, support HTTP/HTTPS, handle redirects, timeouts, and retry logic.

Content Parser: Validates and parses downloaded HTML. Rejects malformed pages. Extracts links and metadata. Runs quickly (CPU-bound, scale with more workers).

Content Seen: Detects duplicate pages (same content, different URLs). Computes a hash of page content and checks if that hash exists in a hash set. Prevents storing the same page twice.

Content Store: Durable storage for raw HTML pages. Use object storage (S3) for raw content; structured storage (Bigtable/HBase/Cassandra) for metadata (URL, crawl time, content hash).

Link Extractor: Parses <a href> tags to extract all outgoing links from the page. Converts relative URLs to absolute. Passes links to URL Filter.

URL Filter: Removes unwanted URLs: invalid format, excluded file types (.jpg, .pdf if HTML-only), blacklisted domains, URLs matching exclusion rules.

URL Seen: Checks if a URL has already been crawled or is already in the frontier. Uses a Bloom filter (fast, memory-efficient) to avoid re-crawling known URLs.

Step 3: Deep Dive (20 min)

DFS vs BFS for Crawling

Depth-First Search (DFS):

Start: site.com/home
→ site.com/page1
  → site.com/page1/sub1
    → site.com/page1/sub1/deep1
      → site.com/page1/sub1/deep1/deeper1  (can go infinitely deep!)

Problems with DFS:

❌ Can get stuck in infinitely deep paths (spider traps, infinite pagination)
❌ May crawl one site completely before visiting others
❌ Low coverage early on — misses important pages from other domains

Breadth-First Search (BFS):

Level 0: site.com/home
Level 1: site.com/page1, site.com/page2, site.com/page3
Level 2: site.com/page1/sub1, site.com/page2/sub1, other-site.com/page1
Level 3: ...

BFS is preferred because:

✅ Better coverage across many hosts (more polite, spreads load)
✅ Discovers important pages faster (important pages have more inbound links)
✅ Bounded depth (avoids spider traps naturally with queue limit)
✅ Standard approach in production crawlers (Google, Bing)

BFS with Priority:
Standard BFS treats all URLs equally. Production crawlers use prioritized BFS:

High PageRank pages crawled first
Fresh pages recrawled more often
New pages given higher priority than stale pages

URL Frontier Architecture (Deep Dive)

The URL frontier is the most complex component. It must handle three concerns:

1. Politeness (Don’t Hammer One Host)

Problem: Naive BFS might send 1000 requests/second to the same host. This is impolite and may cause the host to block the crawler.

Solution: Per-host queues with configurable delay

URL Frontier Structure:
                                            Politeness
Front Queues (Priority)    Back Queues     Queue Selector
┌─────────────────┐       ┌──────────────┐
│  Queue F1 (High │       │ Queue B1     │  → site-a.com (next at t+5s)
│  Priority URLs) │──┐    │ (site-a.com) │
└─────────────────┘  │    └──────────────┘
┌─────────────────┐  │    ┌──────────────┐
│  Queue F2 (Med) │──┼──→ │ Queue B2     │  → site-b.com (next at t+2s)
└─────────────────┘  │    │ (site-b.com) │
┌─────────────────┐  │    └──────────────┘
│  Queue F3 (Low) │──┘    ┌──────────────┐
└─────────────────┘       │ Queue B3     │  → site-c.com (next at t+30s)
                          │ (site-c.com) │
                          └──────────────┘

Politeness rules:

One back queue per host (or per domain)
Minimum delay between requests to same host (e.g., 5 seconds)
Read Crawl-delay from robots.txt and honor it
Queue selector picks the back queue whose host is ready (delay elapsed)

2. Priority (Crawl Important Pages First)

Priority based on:

PageRank (more inbound links = higher priority)
Web traffic (popular URLs crawled more frequently)
Update frequency (news sites re-crawled hourly vs static pages weekly)
Freshness (URLs not crawled recently get bumped up in priority)

Implementation:

Prioritizer → assigns priority score → routes to priority queue F1/F2/F3
Queue router → picks from front queues proportional to priority
              (e.g., pick from F1 60%, F2 30%, F3 10%)

3. Freshness (Re-Crawl Updated Pages)

Challenge: Pages change over time. Need to re-crawl to stay current.

Strategies:

Time-based: Recrawl every N days/hours based on page type (news = hourly, product pages = daily, static docs = monthly)
Change detection: If page hash changed on last crawl, increase recrawl frequency
Sitemap-driven: Use sitemap.xml which lists pages and their <lastmod> dates
Feed-based: Subscribe to RSS/Atom feeds for change notifications

Recrawl priority formula:
Priority = base_priority × recency_factor × importance_factor
Where:
  recency_factor  = time since last crawl / expected update interval
  importance_factor = log(PageRank + 1) + traffic_score

DNS Resolver — Hidden Bottleneck

Why DNS is a bottleneck:

Each new hostname requires a DNS lookup
DNS lookup takes 10–200ms (varies by server, geography)
At 400 pages/sec with many unique hosts → thousands of DNS lookups/sec
DNS servers have rate limits and can return stale results

Optimization: DNS Cache

class CachedDNSResolver:
    def __init__(self):
        self.cache = {}  # hostname → (ip, expiry_timestamp)
 
    def resolve(self, hostname):
        if hostname in self.cache:
            ip, expiry = self.cache[hostname]
            if time.now() < expiry:
                return ip  # Cache hit!
 
        # Cache miss: query DNS
        ip = dns.resolve(hostname)
        self.cache[hostname] = (ip, time.now() + TTL)
        return ip

Cache TTL: Respect DNS TTL from response (typically 300–3600 seconds). Short TTL = more accurate but more DNS queries. Long TTL = faster but may use stale IPs.

At scale: Use a distributed DNS cache (e.g., Redis) shared across all crawlers in the cluster.

HTML Downloader — Performance Optimization

Single-threaded is too slow:

400 pages/sec × 100ms average download time = 40,000 concurrent connections!
Need: distributed crawlers + async I/O

Performance strategies:

1. Distributed crawl (multiple crawler machines):

┌──────────────────────────────────────────────┐
│              URL Frontier (Redis)             │
└────────┬──────────┬──────────────┬────────────┘
         ↓          ↓              ↓
  ┌──────────┐ ┌──────────┐ ┌──────────┐
  │ Crawler 1│ │ Crawler 2│ │ Crawler N│
  │(40 pps)  │ │(40 pps)  │ │(40 pps)  │
  └──────────┘ └──────────┘ └──────────┘
  Total: 400 pages/sec with 10 crawlers at 40 pps each

2. Locality: Crawl pages from geographically close servers (US servers crawl US sites, EU servers crawl EU sites). Reduces round-trip latency.

3. Short timeout: Don’t wait forever. Set timeout (e.g., 5 seconds). If no response → mark as failed, retry later.

4. Keep-alive connections: Reuse HTTP connections to the same host (HTTP/1.1 keep-alive, HTTP/2 multiplexing) instead of opening new TCP connections per page.

5. HTTPS handling: Validate SSL certificates, handle certificate errors (configurable: strict vs lenient).

Robustness Techniques

1. Consistent Hashing for Crawler Assignment:

URL → hash → consistent hash ring → assign to Crawler N

Benefits:
- When adding/removing crawlers, only re-assign ~1/N of URLs
- Same URL always assigned to same crawler (avoids duplicate crawl)
- Handles crawler failures gracefully (URLs redistribute)

2. Checkpointing:

Periodically save crawler state:
- Current URL Frontier (serialized to disk/DB)
- Visited URL set (Bloom filter snapshot)
- In-progress downloads (for retry on restart)

On restart: load from checkpoint, not from scratch.

3. Exception Handling:

For each download, handle:
- HTTP errors (404 → mark URL as dead, 503 → retry with backoff)
- Network timeouts → retry up to N times, then skip
- Malformed HTML → log and skip, do not crash crawler
- Infinite redirect loops → detect and break after N redirects
- Oversized pages → skip pages > 10 MB (likely binary content)

4. Data validation:

Before storing a page:
- Verify content-type is text/html (not PDF, image, etc.)
- Verify response is not empty
- Verify page is not a soft 404 (redirect to error page)
- Verify content hash is not in "seen" set (dedup)

5. Monitoring and alerting:

Track metrics:
- Pages crawled per second (should be ~400)
- Error rate (4xx, 5xx responses)
- DNS resolution failures
- Queue size (should not grow unbounded)
- Deduplication rate (how many pages skipped as duplicates)

Content Deduplication

Problem: Many pages have identical or near-identical content at different URLs:

http://site.com/page1
https://site.com/page1        (same, different scheme)
http://www.site.com/page1     (same, www prefix)
http://site.com/page1?utm=abc (same, tracking parameter)

Approach 1: Exact Deduplication with Hash

def is_duplicate(page_content):
    content_hash = sha256(page_content)
    if bloom_filter.contains(content_hash):
        return True  # Likely duplicate (may be false positive)
    bloom_filter.add(content_hash)
    return False

Hash the page content (SHA-256 or MD5)
Check hash in a hash set or Bloom filter
✅ Fast, exact
❌ Cannot detect near-duplicates (same content, minor differences)

Approach 2: Near-Duplicate Detection with SimHash

SimHash algorithm:

1. Extract features from page content (words, n-grams)
2. Hash each feature → 64-bit integer
3. For each bit position: count +1 if bit is 1, -1 if bit is 0
4. Final SimHash: bit i = 1 if count[i] > 0, else 0

Similar pages → SimHashes with small Hamming distance (few bits different)
Threshold: pages with ≤ 3 bits different are near-duplicates

Page A SimHash: 10110101...  (64 bits)
Page B SimHash: 10110111...  (64 bits)
Hamming distance: 1 bit → Near-duplicate!

Page A SimHash: 10110101...
Page C SimHash: 01001010...
Hamming distance: 32 bits → Not a duplicate

When to use SimHash: When exact hash misses near-duplicates (e.g., same news article with slightly different ads or navigation). Used by Google to detect duplicate content.

URL Deduplication with Bloom Filter

Why Bloom filter for URL deduplication?

At 1 billion URLs/month over 5 years = 60 billion URLs.

Storing 60 billion URLs in a hash set:

Average URL length: ~100 bytes
60 billion × 100 bytes = 6 TB just for URL storage!

Bloom filter for 60 billion URLs:

Optimal bits per element for 1% false positive rate: ~10 bits
60 billion × 10 bits = 600 billion bits = 75 GB

75 GB vs 6 TB → 80x more memory efficient!

Bloom filter properties:

Space-efficient: ~10 bits per element for 1% false positive rate
No false negatives: If URL is in the set, Bloom filter always returns true
False positives: ~1% chance a new URL is incorrectly flagged as seen (acceptable — just skip it)
Cannot delete elements (use Counting Bloom Filter if deletes needed)

URL → hash1, hash2, hash3 → set bits in bit array
                ↓
Lookup: Check if all 3 bits are set
  → All set: probably seen (might be false positive)
  → Not all set: definitely NOT seen

Why false positives are acceptable here: The worst case is that a crawler skips a URL it hasn’t actually visited. A small fraction of pages being skipped is acceptable in a 1-billion-page crawl. The benefit (80x memory savings) far outweighs the cost.

Robots.txt Compliance

What is robots.txt?:

# File at https://www.example.com/robots.txt
User-agent: *              # Applies to all bots
Disallow: /private/        # Don't crawl /private/ and subdirectories
Disallow: /admin/          # Don't crawl /admin/
Allow: /public/            # Explicitly allow (overrides parent Disallow)
Crawl-delay: 10            # Wait 10 seconds between requests

User-agent: Googlebot      # Google-specific rules
Disallow: /tmp/

Compliance rules:

Fetch robots.txt before crawling any page from a new host
Cache robots.txt per host (re-fetch periodically, e.g., every 24 hours)
Respect Disallow rules (never crawl disallowed paths)
Respect Crawl-delay (add delay to per-host politeness queue)
Honor Sitemap directive (use sitemap for efficient URL discovery)

Implementation:

class RobotsCache:
    def __init__(self):
        self.cache = {}  # host → (rules, expiry)
 
    def can_crawl(self, url):
        host = extract_host(url)
        if host not in self.cache or expired(self.cache[host]):
            robots_url = f"https://{host}/robots.txt"
            rules = fetch_and_parse_robots(robots_url)
            self.cache[host] = (rules, time.now() + 86400)  # 24h TTL
 
        rules, _ = self.cache[host]
        return not rules.is_disallowed(url, user_agent="MyCrawler")

Spider Trap Detection

What is a spider trap?: A URL pattern that generates an infinite number of unique URLs:

/calendar/2024/01/01/
/calendar/2024/01/02/
/calendar/2024/01/03/
... (generates dates infinitely)

/search?q=a
/search?q=aa
/search?q=aaa
... (parameter permutations)

/page/1/page/1/page/1/
/page/1/page/1/page/1/page/1/
... (cyclically generated paths)

Detection strategies:

URL length limit: Skip URLs longer than a threshold (e.g., 2048 chars)

if len(url) > 2048:
    skip(url)  # Likely a trap

Max path depth: Skip URLs with too many path segments

if url.count('/') > 10:
    skip(url)  # Too deep, likely a trap

Max URLs per domain: Cap the number of URLs crawled per domain

if urls_crawled[domain] > 10000:
    skip(url)  # Domain cap reached

Parameter deduplication: Normalize URLs (remove tracking params, sort query params)

# Treat these as the same URL:
# /page?utm_source=twitter&id=123
# /page?id=123&utm_source=google
normalized = normalize_url(url)  # Remove utm_*, sort params

Cycle detection: If URL Seen check catches it (Bloom filter), it’s already in the queue

Extensibility

Plugin architecture for new content types:

HTML Downloader → Content Type Detector
                        │
           ┌────────────┼────────────┐
           ↓            ↓            ↓
    HTML Parser    PDF Parser   Image Parser
    (default)      (plugin)     (plugin)
           │            │            │
           └────────────┼────────────┘
                        ↓
                  Content Store

To add PDF crawling:

Register PDF MIME type with Content Type Detector
Implement PDFParser plugin (extract text, links from PDF)
Deploy without changing core crawler logic

Design Summary

Final Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Seed URLs                              │
└────────────────────────────┬────────────────────────────────┘
                             │
┌────────────────────────────▼────────────────────────────────┐
│                     URL Frontier                             │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Front Queues (Priority)  │  Back Queues (Politeness)│   │
│  │  F1: High priority  ──────┤  B1: site-a.com          │   │
│  │  F2: Med priority   ──────┤  B2: site-b.com          │   │
│  │  F3: Low priority   ──────┤  B3: site-c.com          │   │
│  └──────────────────────────────────────────────────────┘   │
└────────────────────────────┬────────────────────────────────┘
                             │
           ┌─────────────────┼─────────────────┐
           ↓                 ↓                  ↓
   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
   │  Crawler 1   │  │  Crawler 2   │  │  Crawler N   │
   │  DNS Cache   │  │  DNS Cache   │  │  DNS Cache   │
   │  HTML Fetcher│  │  HTML Fetcher│  │  HTML Fetcher│
   └──────┬───────┘  └──────┬───────┘  └──────┬───────┘
          │                 │                  │
          └─────────────────┼──────────────────┘
                            │
          ┌─────────────────▼──────────────────┐
          │          Content Parser             │
          │   (HTML validation, link extract)   │
          └─────────────────┬──────────────────┘
                            │
          ┌─────────────────▼──────────────────┐
          │        Content Dedup Check          │
          │   (SHA-256 hash / SimHash)           │
          └──────────┬──────────────────────────┘
         Duplicate   │  New
         → Discard   ↓
          ┌─────────────────────────────────────┐
          │          Content Store              │
          │  (S3 for raw HTML + Bigtable/HBase  │
          │   for metadata — 30 PB over 5 years)│
          └─────────────────┬───────────────────┘
                            │
          ┌─────────────────▼──────────────────┐
          │          Link Extractor             │
          └─────────────────┬──────────────────┘
                            │
          ┌─────────────────▼──────────────────┐
          │       URL Filter + Normalize        │
          │   (blacklist, file type, length)    │
          └─────────────────┬──────────────────┘
                            │
          ┌─────────────────▼──────────────────┐
          │         URL Seen? (Bloom Filter)    │
          │   (75 GB Bloom filter for 60B URLs) │
          └──────────┬──────────────────────────┘
         Seen        │  New
         → Discard   ↓
                URL Frontier (loop)

Key Decisions Summary

Decision	Choice	Reasoning
Traversal	BFS with priority	Better coverage, avoids deep traps, polite
URL Frontier	Priority + Politeness queues	Balance importance vs crawl delay
DNS	Distributed DNS cache (Redis)	DNS is bottleneck without caching
Content dedup	SHA-256 hash + SimHash	Exact dedup + near-duplicate detection
URL dedup	Bloom filter (75 GB)	80x more memory efficient than hash set
Content store	S3 + Bigtable/HBase	Object store for blobs, structured for metadata
Robustness	Consistent hashing + checkpointing	Handle crawler failures gracefully
Politeness	Per-host queues + robots.txt	Don’t overload any single host
Spider traps	URL length + depth + domain cap	Detect and escape infinite loops
Extensibility	Plugin architecture	Add PDF/image parsers without changing core

Interview Questions & Answers

Q: Why BFS over DFS for web crawling?
A: DFS can descend infinitely deep into a site (spider traps, infinite pagination), starves other domains of crawl time, and provides poor coverage early on. BFS spreads crawl across many hosts, discovers important pages sooner (important pages have more inbound links at shallow depth), and is naturally polite (many hosts get requests, not one host getting slammed). Production crawlers (Google, Bing) use prioritized BFS.

Q: How do you ensure the crawler is polite and doesn’t overload a single host?
A: (1) Per-host back queues in the URL Frontier — each host gets its own queue. (2) Queue selector enforces minimum delay between requests to the same host (e.g., 5 seconds). (3) Respect Crawl-delay directive in robots.txt. (4) Respect Disallow rules in robots.txt. (5) Use exponential backoff on 5xx errors. (6) Honor HTTP Retry-After response headers.

Q: How do you handle duplicate URLs and duplicate content separately?
A: URL deduplication: Bloom filter on URL strings before adding to the frontier. Prevents re-crawling the same URL. Near-zero memory (75 GB for 60B URLs). Content deduplication: SHA-256 hash of page content after downloading. Catches pages with different URLs but same content (mirrors, CDN copies, UTM parameters). Near-duplicate detection: SimHash with Hamming distance threshold for near-identical pages (same article, different ads).

Q: What happens if a crawler machine fails mid-crawl?
A: (1) Consistent hashing: URLs assigned to crawlers by hash. If Crawler 3 fails, its URLs are redistributed to remaining crawlers. (2) Checkpointing: Crawler state (frontier, visited set, in-progress downloads) is periodically saved to durable storage (HDFS, S3). On restart, load from checkpoint rather than starting over. (3) Idempotent: Downloads are idempotent — if a page is re-downloaded, the content dedup check catches the duplicate before storing.

Q: How do you detect and escape spider traps?
A: Multiple defenses: (1) URL length limit: skip URLs longer than 2048 characters. (2) Max path depth: skip URLs with more than 10 path segments. (3) Per-domain cap: stop crawling a domain after 10,000 URLs. (4) URL normalization: remove tracking parameters, sort query params — reduces unique URL explosion. (5) Bloom filter: detects if URL already visited. (6) Manual blacklist: known spider trap domains added to exclusion list.

Q: How would you extend the crawler to support PDFs?
A: Use a plugin/content-type architecture. The HTML Downloader already downloads any HTTP response. After download, a Content Type Detector reads the Content-Type header. Register a PDF parser plugin for application/pdf. The PDF parser extracts text and links (if any). The rest of the pipeline (content store, link extractor) remains unchanged. New content types can be added without modifying the core crawler code.

Key Takeaways

BFS over DFS — BFS provides better coverage, politeness, and avoids infinite loops; use priority BFS in production
URL Frontier is the core — must handle both politeness (per-host queues) and priority (PageRank, freshness) simultaneously
DNS is a hidden bottleneck — without caching, DNS lookups (10–200ms each) become the throughput ceiling; cache aggressively
Bloom filter for URL dedup — 80x more memory-efficient than a hash set; false positives are acceptable (a few pages skipped is fine)
SimHash for near-duplicate content — exact hash misses near-duplicates; SimHash with Hamming distance catches them
Robots.txt compliance is non-negotiable — cache per-host, respect Disallow and Crawl-delay, honor Sitemap directives
Consistent hashing + checkpointing for robustness — crawler failures are inevitable at scale; design for recovery
Spider trap defenses are layered — no single defense is sufficient; use URL length, depth limit, domain cap, and normalization together

distributed-system-components > Message Queues - URL Frontier can be backed by Kafka/Redis
key-patterns > Bloom Filter - Space-efficient probabilistic set membership
key-patterns > Consistent Hashing - URL-to-crawler assignment
ch03-framework-for-system-design - Apply framework to this problem

Practice this design! Medium-Hard — be ready to go deep on:

URL Frontier politeness vs priority design
Bloom filter math (why 75 GB is acceptable)
BFS vs DFS argument (common follow-up)
Content dedup (hash) vs near-dup detection (SimHash)
Robustness: what happens when a crawler crashes

Last Updated: 2026-04-13
Status: Solid distributed systems question — depth matters here!

Study Notes by Niladri & AI

Explorer

ch09-web-crawler

Chapter 9: Design a Web Crawler

Overview

Problem Statement

Step 1: Requirements & Scope (5 min)

Functional Requirements

Non-Functional Requirements

Back-of-the-Envelope Estimation

Step 2: High-Level Design (10 min)

Core Components

Component Descriptions

Step 3: Deep Dive (20 min)

DFS vs BFS for Crawling

URL Frontier Architecture (Deep Dive)

1. Politeness (Don’t Hammer One Host)

2. Priority (Crawl Important Pages First)

3. Freshness (Re-Crawl Updated Pages)

DNS Resolver — Hidden Bottleneck

HTML Downloader — Performance Optimization

Robustness Techniques

Content Deduplication

URL Deduplication with Bloom Filter

Robots.txt Compliance

Spider Trap Detection

Extensibility

Design Summary

Final Architecture

Key Decisions Summary

Interview Questions & Answers

Key Takeaways

Graph View

Table of Contents

Backlinks

Study Notes by Niladri & AI

Explorer

ch09-web-crawler

Chapter 9: Design a Web Crawler

Overview

Problem Statement

Step 1: Requirements & Scope (5 min)

Functional Requirements

Non-Functional Requirements

Back-of-the-Envelope Estimation

Step 2: High-Level Design (10 min)

Core Components

Component Descriptions

Step 3: Deep Dive (20 min)

DFS vs BFS for Crawling

URL Frontier Architecture (Deep Dive)

1. Politeness (Don’t Hammer One Host)

2. Priority (Crawl Important Pages First)

3. Freshness (Re-Crawl Updated Pages)

DNS Resolver — Hidden Bottleneck

HTML Downloader — Performance Optimization

Robustness Techniques

Content Deduplication

URL Deduplication with Bloom Filter

Robots.txt Compliance

Spider Trap Detection

Extensibility

Design Summary

Final Architecture

Key Decisions Summary

Interview Questions & Answers

Key Takeaways

Related Resources

Graph View

Table of Contents

Backlinks