Chapter 9: Design a Web Crawler
volume1 web-crawler distributed-systems search
Status: π© Interview ready
Difficulty: Medium-Hard
Time to complete: 50 min read + practice
Overview
A web crawler (also called a spider or bot) systematically browses the web, downloading web pages and following links to discover new pages. Large-scale crawlers are the backbone of search engines like Google and Bing.
Common use cases:
- Search engine indexing: Discover and index web pages (Google, Bing, DuckDuckGo)
- Web archiving: Store web pages over time (Wayback Machine / archive.org)
- Data mining: Collect research data (price comparison, market research)
- Link checking: Detect broken links on websites
- Copyright monitoring: Find unauthorized copies of content
- Web monitoring: Track changes to websites (news aggregators, change detection services)
Why this matters:
- Medium-Hard interview question β tests distributed systems depth
- Covers BFS/DFS trade-offs, queue design, deduplication, politeness
- Core to search engine infrastructure
Problem Statement
Design a web crawler that:
- Discovers and downloads billions of web pages
- Stores downloaded pages for indexing
- Avoids downloading the same page twice (deduplication)
- Is polite (respects robots.txt and crawl delays)
- Is scalable, robust, and extensible
Step 1: Requirements & Scope (5 min)
Functional Requirements
Clarifying questions:
- What is the scale? β 1 billion web pages per month
- What content types? β HTML only (not images, video, PDFs)
- How long to store pages? β 5 years
- What about duplicate content? β Ignore duplicates (save storage)
- Do we respect robots.txt? β Yes (politeness)
- How to handle newly added/updated pages? β Freshness: re-crawl periodically
Scope:
- Crawl 1 billion HTML pages per month
- Store content for 5 years
- Detect and skip duplicate content
- Respect robots.txt and per-host crawl delays
- Scalable, robust, and extensible architecture
Non-Functional Requirements
- Scalability: Crawl 1 billion pages/month (can be scaled further)
- Robustness: Handle bad HTML, network failures, server crashes
- Politeness: One host should not be hammered with too many requests
- Extensibility: Easy to add support for new content types (PDF, images)
- Freshness: Periodically re-crawl pages to catch updates
Back-of-the-Envelope Estimation
Download rate:
1 billion pages / month
= 1,000,000,000 / (30 days Γ 24 hours Γ 3,600 sec)
= 1,000,000,000 / 2,592,000
β 400 pages/second
Storage:
Average HTML page size: ~500 KB (including assets, raw HTML)
1 billion pages/month Γ 500 KB = 500 TB/month
5 years storage: 500 TB Γ 12 months Γ 5 years = 30 PB total
Bandwidth:
400 pages/sec Γ 500 KB = 200 MB/sec inbound bandwidth
β 1.6 Gbps β requires multiple network connections and distributed crawlers
Step 2: High-Level Design (10 min)
Core Components
Seed URLs
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β URL Frontier β
β (Priority Queue + Politeness Queue β URLs to crawl) β
ββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
ββββββββΌβββββββ
β DNS Resolverβ (IP address cache)
ββββββββ¬βββββββ
β
ββββββββΌβββββββββββ
β HTML Downloader β (Fetches page)
ββββββββ¬βββββββββββ
β
ββββββββΌβββββββββββ
β Content Parser β (Extract text, validate HTML)
ββββββββ¬βββββββββββ
β
ββββββββΌβββββββββββββββ
β Content Seen? β (Dedup check on page content)
β (Hash comparison) β
ββββββββ¬βββββββββββββββ
β Not seen
ββββββββΌβββββββββββ
β Content Store β (S3, Bigtable β save raw HTML)
βββββββββββββββββββ
β
ββββββββΌβββββββββββ
β Link Extractor β (Find all <a href> links)
ββββββββ¬βββββββββββ
β
ββββββββΌβββββββββββ
β URL Filter β (Remove invalid, blacklisted URLs)
ββββββββ¬βββββββββββ
β
ββββββββΌβββββββββββββββ
β URL Seen? β (Dedup: Bloom filter)
β (Bloom Filter) β
ββββββββ¬βββββββββββββββ
β Not seen
βΌ
URL Frontier (loop β add new URLs to crawl)
Component Descriptions
Seed URLs: Initial set of URLs to start crawling from. Often selected to maximize coverage (e.g., top-ranked pages, DMOZ directory, sitemaps). Different strategies: locality-based (crawl by country), topic-based (crawl by category).
URL Frontier: A queue of URLs to be crawled. Must support politeness (per-host delays) and priority (crawl important pages first). Core of the crawlerβs scheduling logic.
DNS Resolver: Converts hostnames to IP addresses. Cached aggressively to avoid DNS bottleneck (DNS lookup: 10-200ms, can bottleneck throughput without caching).
HTML Downloader: Fetches the actual HTML content from the web server. Must respect robots.txt, support HTTP/HTTPS, handle redirects, timeouts, and retry logic.
Content Parser: Validates and parses downloaded HTML. Rejects malformed pages. Extracts links and metadata. Runs quickly (CPU-bound, scale with more workers).
Content Seen: Detects duplicate pages (same content, different URLs). Computes a hash of page content and checks if that hash exists in a hash set. Prevents storing the same page twice.
Content Store: Durable storage for raw HTML pages. Use object storage (S3) for raw content; structured storage (Bigtable/HBase/Cassandra) for metadata (URL, crawl time, content hash).
Link Extractor: Parses <a href> tags to extract all outgoing links from the page. Converts relative URLs to absolute. Passes links to URL Filter.
URL Filter: Removes unwanted URLs: invalid format, excluded file types (.jpg, .pdf if HTML-only), blacklisted domains, URLs matching exclusion rules.
URL Seen: Checks if a URL has already been crawled or is already in the frontier. Uses a Bloom filter (fast, memory-efficient) to avoid re-crawling known URLs.
Step 3: Deep Dive (20 min)
DFS vs BFS for Crawling
Depth-First Search (DFS):
Start: site.com/home
β site.com/page1
β site.com/page1/sub1
β site.com/page1/sub1/deep1
β site.com/page1/sub1/deep1/deeper1 (can go infinitely deep!)
Problems with DFS:
- β Can get stuck in infinitely deep paths (spider traps, infinite pagination)
- β May crawl one site completely before visiting others
- β Low coverage early on β misses important pages from other domains
Breadth-First Search (BFS):
Level 0: site.com/home
Level 1: site.com/page1, site.com/page2, site.com/page3
Level 2: site.com/page1/sub1, site.com/page2/sub1, other-site.com/page1
Level 3: ...
BFS is preferred because:
- β Better coverage across many hosts (more polite, spreads load)
- β Discovers important pages faster (important pages have more inbound links)
- β Bounded depth (avoids spider traps naturally with queue limit)
- β Standard approach in production crawlers (Google, Bing)
BFS with Priority:
Standard BFS treats all URLs equally. Production crawlers use prioritized BFS:
- High PageRank pages crawled first
- Fresh pages recrawled more often
- New pages given higher priority than stale pages
URL Frontier Architecture (Deep Dive)
The URL frontier is the most complex component. It must handle three concerns:
1. Politeness (Donβt Hammer One Host)
Problem: Naive BFS might send 1000 requests/second to the same host. This is impolite and may cause the host to block the crawler.
Solution: Per-host queues with configurable delay
URL Frontier Structure:
Politeness
Front Queues (Priority) Back Queues Queue Selector
βββββββββββββββββββ ββββββββββββββββ
β Queue F1 (High β β Queue B1 β β site-a.com (next at t+5s)
β Priority URLs) ββββ β (site-a.com) β
βββββββββββββββββββ β ββββββββββββββββ
βββββββββββββββββββ β ββββββββββββββββ
β Queue F2 (Med) ββββΌβββ β Queue B2 β β site-b.com (next at t+2s)
βββββββββββββββββββ β β (site-b.com) β
βββββββββββββββββββ β ββββββββββββββββ
β Queue F3 (Low) ββββ ββββββββββββββββ
βββββββββββββββββββ β Queue B3 β β site-c.com (next at t+30s)
β (site-c.com) β
ββββββββββββββββ
Politeness rules:
- One back queue per host (or per domain)
- Minimum delay between requests to same host (e.g., 5 seconds)
- Read
Crawl-delayfromrobots.txtand honor it - Queue selector picks the back queue whose host is ready (delay elapsed)
2. Priority (Crawl Important Pages First)
Priority based on:
- PageRank (more inbound links = higher priority)
- Web traffic (popular URLs crawled more frequently)
- Update frequency (news sites re-crawled hourly vs static pages weekly)
- Freshness (URLs not crawled recently get bumped up in priority)
Implementation:
Prioritizer β assigns priority score β routes to priority queue F1/F2/F3
Queue router β picks from front queues proportional to priority
(e.g., pick from F1 60%, F2 30%, F3 10%)
3. Freshness (Re-Crawl Updated Pages)
Challenge: Pages change over time. Need to re-crawl to stay current.
Strategies:
- Time-based: Recrawl every N days/hours based on page type (news = hourly, product pages = daily, static docs = monthly)
- Change detection: If page hash changed on last crawl, increase recrawl frequency
- Sitemap-driven: Use
sitemap.xmlwhich lists pages and their<lastmod>dates - Feed-based: Subscribe to RSS/Atom feeds for change notifications
Recrawl priority formula:
Priority = base_priority Γ recency_factor Γ importance_factor
Where:
recency_factor = time since last crawl / expected update interval
importance_factor = log(PageRank + 1) + traffic_score
DNS Resolver β Hidden Bottleneck
Why DNS is a bottleneck:
- Each new hostname requires a DNS lookup
- DNS lookup takes 10β200ms (varies by server, geography)
- At 400 pages/sec with many unique hosts β thousands of DNS lookups/sec
- DNS servers have rate limits and can return stale results
Optimization: DNS Cache
class CachedDNSResolver:
def __init__(self):
self.cache = {} # hostname β (ip, expiry_timestamp)
def resolve(self, hostname):
if hostname in self.cache:
ip, expiry = self.cache[hostname]
if time.now() < expiry:
return ip # Cache hit!
# Cache miss: query DNS
ip = dns.resolve(hostname)
self.cache[hostname] = (ip, time.now() + TTL)
return ipCache TTL: Respect DNS TTL from response (typically 300β3600 seconds). Short TTL = more accurate but more DNS queries. Long TTL = faster but may use stale IPs.
At scale: Use a distributed DNS cache (e.g., Redis) shared across all crawlers in the cluster.
HTML Downloader β Performance Optimization
Single-threaded is too slow:
400 pages/sec Γ 100ms average download time = 40,000 concurrent connections!
Need: distributed crawlers + async I/O
Performance strategies:
1. Distributed crawl (multiple crawler machines):
ββββββββββββββββββββββββββββββββββββββββββββββββ
β URL Frontier (Redis) β
ββββββββββ¬βββββββββββ¬βββββββββββββββ¬βββββββββββββ
β β β
ββββββββββββ ββββββββββββ ββββββββββββ
β Crawler 1β β Crawler 2β β Crawler Nβ
β(40 pps) β β(40 pps) β β(40 pps) β
ββββββββββββ ββββββββββββ ββββββββββββ
Total: 400 pages/sec with 10 crawlers at 40 pps each
2. Locality: Crawl pages from geographically close servers (US servers crawl US sites, EU servers crawl EU sites). Reduces round-trip latency.
3. Short timeout: Donβt wait forever. Set timeout (e.g., 5 seconds). If no response β mark as failed, retry later.
4. Keep-alive connections: Reuse HTTP connections to the same host (HTTP/1.1 keep-alive, HTTP/2 multiplexing) instead of opening new TCP connections per page.
5. HTTPS handling: Validate SSL certificates, handle certificate errors (configurable: strict vs lenient).
Robustness Techniques
1. Consistent Hashing for Crawler Assignment:
URL β hash β consistent hash ring β assign to Crawler N
Benefits:
- When adding/removing crawlers, only re-assign ~1/N of URLs
- Same URL always assigned to same crawler (avoids duplicate crawl)
- Handles crawler failures gracefully (URLs redistribute)
2. Checkpointing:
Periodically save crawler state:
- Current URL Frontier (serialized to disk/DB)
- Visited URL set (Bloom filter snapshot)
- In-progress downloads (for retry on restart)
On restart: load from checkpoint, not from scratch.
3. Exception Handling:
For each download, handle:
- HTTP errors (404 β mark URL as dead, 503 β retry with backoff)
- Network timeouts β retry up to N times, then skip
- Malformed HTML β log and skip, do not crash crawler
- Infinite redirect loops β detect and break after N redirects
- Oversized pages β skip pages > 10 MB (likely binary content)
4. Data validation:
Before storing a page:
- Verify content-type is text/html (not PDF, image, etc.)
- Verify response is not empty
- Verify page is not a soft 404 (redirect to error page)
- Verify content hash is not in "seen" set (dedup)
5. Monitoring and alerting:
Track metrics:
- Pages crawled per second (should be ~400)
- Error rate (4xx, 5xx responses)
- DNS resolution failures
- Queue size (should not grow unbounded)
- Deduplication rate (how many pages skipped as duplicates)
Content Deduplication
Problem: Many pages have identical or near-identical content at different URLs:
http://site.com/page1
https://site.com/page1 (same, different scheme)
http://www.site.com/page1 (same, www prefix)
http://site.com/page1?utm=abc (same, tracking parameter)
Approach 1: Exact Deduplication with Hash
def is_duplicate(page_content):
content_hash = sha256(page_content)
if bloom_filter.contains(content_hash):
return True # Likely duplicate (may be false positive)
bloom_filter.add(content_hash)
return False- Hash the page content (SHA-256 or MD5)
- Check hash in a hash set or Bloom filter
- β Fast, exact
- β Cannot detect near-duplicates (same content, minor differences)
Approach 2: Near-Duplicate Detection with SimHash
SimHash algorithm:
1. Extract features from page content (words, n-grams)
2. Hash each feature β 64-bit integer
3. For each bit position: count +1 if bit is 1, -1 if bit is 0
4. Final SimHash: bit i = 1 if count[i] > 0, else 0
Similar pages β SimHashes with small Hamming distance (few bits different)
Threshold: pages with β€ 3 bits different are near-duplicates
Page A SimHash: 10110101... (64 bits)
Page B SimHash: 10110111... (64 bits)
Hamming distance: 1 bit β Near-duplicate!
Page A SimHash: 10110101...
Page C SimHash: 01001010...
Hamming distance: 32 bits β Not a duplicate
When to use SimHash: When exact hash misses near-duplicates (e.g., same news article with slightly different ads or navigation). Used by Google to detect duplicate content.
URL Deduplication with Bloom Filter
Why Bloom filter for URL deduplication?
At 1 billion URLs/month over 5 years = 60 billion URLs.
Storing 60 billion URLs in a hash set:
Average URL length: ~100 bytes
60 billion Γ 100 bytes = 6 TB just for URL storage!
Bloom filter for 60 billion URLs:
Optimal bits per element for 1% false positive rate: ~10 bits
60 billion Γ 10 bits = 600 billion bits = 75 GB
75 GB vs 6 TB β 80x more memory efficient!
Bloom filter properties:
- Space-efficient: ~10 bits per element for 1% false positive rate
- No false negatives: If URL is in the set, Bloom filter always returns true
- False positives: ~1% chance a new URL is incorrectly flagged as seen (acceptable β just skip it)
- Cannot delete elements (use Counting Bloom Filter if deletes needed)
URL β hash1, hash2, hash3 β set bits in bit array
β
Lookup: Check if all 3 bits are set
β All set: probably seen (might be false positive)
β Not all set: definitely NOT seen
Why false positives are acceptable here: The worst case is that a crawler skips a URL it hasnβt actually visited. A small fraction of pages being skipped is acceptable in a 1-billion-page crawl. The benefit (80x memory savings) far outweighs the cost.
Robots.txt Compliance
What is robots.txt?:
# File at https://www.example.com/robots.txt
User-agent: * # Applies to all bots
Disallow: /private/ # Don't crawl /private/ and subdirectories
Disallow: /admin/ # Don't crawl /admin/
Allow: /public/ # Explicitly allow (overrides parent Disallow)
Crawl-delay: 10 # Wait 10 seconds between requests
User-agent: Googlebot # Google-specific rules
Disallow: /tmp/
Compliance rules:
- Fetch
robots.txtbefore crawling any page from a new host - Cache robots.txt per host (re-fetch periodically, e.g., every 24 hours)
- Respect
Disallowrules (never crawl disallowed paths) - Respect
Crawl-delay(add delay to per-host politeness queue) - Honor
Sitemapdirective (use sitemap for efficient URL discovery)
Implementation:
class RobotsCache:
def __init__(self):
self.cache = {} # host β (rules, expiry)
def can_crawl(self, url):
host = extract_host(url)
if host not in self.cache or expired(self.cache[host]):
robots_url = f"https://{host}/robots.txt"
rules = fetch_and_parse_robots(robots_url)
self.cache[host] = (rules, time.now() + 86400) # 24h TTL
rules, _ = self.cache[host]
return not rules.is_disallowed(url, user_agent="MyCrawler")Spider Trap Detection
What is a spider trap?: A URL pattern that generates an infinite number of unique URLs:
/calendar/2024/01/01/
/calendar/2024/01/02/
/calendar/2024/01/03/
... (generates dates infinitely)
/search?q=a
/search?q=aa
/search?q=aaa
... (parameter permutations)
/page/1/page/1/page/1/
/page/1/page/1/page/1/page/1/
... (cyclically generated paths)
Detection strategies:
- URL length limit: Skip URLs longer than a threshold (e.g., 2048 chars)
if len(url) > 2048:
skip(url) # Likely a trap- Max path depth: Skip URLs with too many path segments
if url.count('/') > 10:
skip(url) # Too deep, likely a trap- Max URLs per domain: Cap the number of URLs crawled per domain
if urls_crawled[domain] > 10000:
skip(url) # Domain cap reached- Parameter deduplication: Normalize URLs (remove tracking params, sort query params)
# Treat these as the same URL:
# /page?utm_source=twitter&id=123
# /page?id=123&utm_source=google
normalized = normalize_url(url) # Remove utm_*, sort params- Cycle detection: If URL Seen check catches it (Bloom filter), itβs already in the queue
Extensibility
Plugin architecture for new content types:
HTML Downloader β Content Type Detector
β
ββββββββββββββΌβββββββββββββ
β β β
HTML Parser PDF Parser Image Parser
(default) (plugin) (plugin)
β β β
ββββββββββββββΌβββββββββββββ
β
Content Store
To add PDF crawling:
- Register PDF MIME type with Content Type Detector
- Implement PDFParser plugin (extract text, links from PDF)
- Deploy without changing core crawler logic
Design Summary
Final Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Seed URLs β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββ
β URL Frontier β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Front Queues (Priority) β Back Queues (Politeness)β β
β β F1: High priority βββββββ€ B1: site-a.com β β
β β F2: Med priority βββββββ€ B2: site-b.com β β
β β F3: Low priority βββββββ€ B3: site-c.com β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββ
β β β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Crawler 1 β β Crawler 2 β β Crawler N β
β DNS Cache β β DNS Cache β β DNS Cache β
β HTML Fetcherβ β HTML Fetcherβ β HTML Fetcherβ
ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β β
βββββββββββββββββββΌβββββββββββββββββββ
β
βββββββββββββββββββΌβββββββββββββββββββ
β Content Parser β
β (HTML validation, link extract) β
βββββββββββββββββββ¬βββββββββββββββββββ
β
βββββββββββββββββββΌβββββββββββββββββββ
β Content Dedup Check β
β (SHA-256 hash / SimHash) β
ββββββββββββ¬βββββββββββββββββββββββββββ
Duplicate β New
β Discard β
βββββββββββββββββββββββββββββββββββββββ
β Content Store β
β (S3 for raw HTML + Bigtable/HBase β
β for metadata β 30 PB over 5 years)β
βββββββββββββββββββ¬ββββββββββββββββββββ
β
βββββββββββββββββββΌβββββββββββββββββββ
β Link Extractor β
βββββββββββββββββββ¬βββββββββββββββββββ
β
βββββββββββββββββββΌβββββββββββββββββββ
β URL Filter + Normalize β
β (blacklist, file type, length) β
βββββββββββββββββββ¬βββββββββββββββββββ
β
βββββββββββββββββββΌβββββββββββββββββββ
β URL Seen? (Bloom Filter) β
β (75 GB Bloom filter for 60B URLs) β
ββββββββββββ¬βββββββββββββββββββββββββββ
Seen β New
β Discard β
URL Frontier (loop)
Key Decisions Summary
| Decision | Choice | Reasoning |
|---|---|---|
| Traversal | BFS with priority | Better coverage, avoids deep traps, polite |
| URL Frontier | Priority + Politeness queues | Balance importance vs crawl delay |
| DNS | Distributed DNS cache (Redis) | DNS is bottleneck without caching |
| Content dedup | SHA-256 hash + SimHash | Exact dedup + near-duplicate detection |
| URL dedup | Bloom filter (75 GB) | 80x more memory efficient than hash set |
| Content store | S3 + Bigtable/HBase | Object store for blobs, structured for metadata |
| Robustness | Consistent hashing + checkpointing | Handle crawler failures gracefully |
| Politeness | Per-host queues + robots.txt | Donβt overload any single host |
| Spider traps | URL length + depth + domain cap | Detect and escape infinite loops |
| Extensibility | Plugin architecture | Add PDF/image parsers without changing core |
Interview Questions & Answers
Q: Why BFS over DFS for web crawling?
A: DFS can descend infinitely deep into a site (spider traps, infinite pagination), starves other domains of crawl time, and provides poor coverage early on. BFS spreads crawl across many hosts, discovers important pages sooner (important pages have more inbound links at shallow depth), and is naturally polite (many hosts get requests, not one host getting slammed). Production crawlers (Google, Bing) use prioritized BFS.
Q: How do you ensure the crawler is polite and doesnβt overload a single host?
A: (1) Per-host back queues in the URL Frontier β each host gets its own queue. (2) Queue selector enforces minimum delay between requests to the same host (e.g., 5 seconds). (3) Respect Crawl-delay directive in robots.txt. (4) Respect Disallow rules in robots.txt. (5) Use exponential backoff on 5xx errors. (6) Honor HTTP Retry-After response headers.
Q: How do you handle duplicate URLs and duplicate content separately?
A: URL deduplication: Bloom filter on URL strings before adding to the frontier. Prevents re-crawling the same URL. Near-zero memory (75 GB for 60B URLs). Content deduplication: SHA-256 hash of page content after downloading. Catches pages with different URLs but same content (mirrors, CDN copies, UTM parameters). Near-duplicate detection: SimHash with Hamming distance threshold for near-identical pages (same article, different ads).
Q: What happens if a crawler machine fails mid-crawl?
A: (1) Consistent hashing: URLs assigned to crawlers by hash. If Crawler 3 fails, its URLs are redistributed to remaining crawlers. (2) Checkpointing: Crawler state (frontier, visited set, in-progress downloads) is periodically saved to durable storage (HDFS, S3). On restart, load from checkpoint rather than starting over. (3) Idempotent: Downloads are idempotent β if a page is re-downloaded, the content dedup check catches the duplicate before storing.
Q: How do you detect and escape spider traps?
A: Multiple defenses: (1) URL length limit: skip URLs longer than 2048 characters. (2) Max path depth: skip URLs with more than 10 path segments. (3) Per-domain cap: stop crawling a domain after 10,000 URLs. (4) URL normalization: remove tracking parameters, sort query params β reduces unique URL explosion. (5) Bloom filter: detects if URL already visited. (6) Manual blacklist: known spider trap domains added to exclusion list.
Q: How would you extend the crawler to support PDFs?
A: Use a plugin/content-type architecture. The HTML Downloader already downloads any HTTP response. After download, a Content Type Detector reads the Content-Type header. Register a PDF parser plugin for application/pdf. The PDF parser extracts text and links (if any). The rest of the pipeline (content store, link extractor) remains unchanged. New content types can be added without modifying the core crawler code.
Key Takeaways
- BFS over DFS β BFS provides better coverage, politeness, and avoids infinite loops; use priority BFS in production
- URL Frontier is the core β must handle both politeness (per-host queues) and priority (PageRank, freshness) simultaneously
- DNS is a hidden bottleneck β without caching, DNS lookups (10β200ms each) become the throughput ceiling; cache aggressively
- Bloom filter for URL dedup β 80x more memory-efficient than a hash set; false positives are acceptable (a few pages skipped is fine)
- SimHash for near-duplicate content β exact hash misses near-duplicates; SimHash with Hamming distance catches them
- Robots.txt compliance is non-negotiable β cache per-host, respect Disallow and Crawl-delay, honor Sitemap directives
- Consistent hashing + checkpointing for robustness β crawler failures are inevitable at scale; design for recovery
- Spider trap defenses are layered β no single defense is sufficient; use URL length, depth limit, domain cap, and normalization together
Related Resources
- distributed-system-components > Message Queues - URL Frontier can be backed by Kafka/Redis
- key-patterns > Bloom Filter - Space-efficient probabilistic set membership
- key-patterns > Consistent Hashing - URL-to-crawler assignment
- ch03-framework-for-system-design - Apply framework to this problem
Practice this design! Medium-Hard β be ready to go deep on:
- URL Frontier politeness vs priority design
- Bloom filter math (why 75 GB is acceptable)
- BFS vs DFS argument (common follow-up)
- Content dedup (hash) vs near-dup detection (SimHash)
- Robustness: what happens when a crawler crashes
Last Updated: 2026-04-13
Status: Solid distributed systems question β depth matters here!