Chapter 9 Flashcards - Web Crawler

flashcards volume1 web-crawler distributed-systems

What is a web crawler and what are its main use cases?
?
A web crawler (spider/bot) systematically browses the web, downloading pages and following links. Use cases: (1) Search engine indexing (Google, Bing — most important), (2) Web archiving (Wayback Machine), (3) Data mining (price comparison, market research), (4) Link checking (detect broken links), (5) Copyright monitoring (find unauthorized copies), (6) Web monitoring (news aggregators, change detection). Core feature: it FOLLOWS links to discover new pages automatically.

What are the key scale estimates for a web crawler handling 1 billion pages/month?
?
Download rate: 1B pages / (30 days × 86,400 sec) ≈ 400 pages/second. Average page size: ~500 KB → bandwidth: 400 × 500 KB = 200 MB/sec ≈ 1.6 Gbps. Storage: 1B pages × 500 KB = 500 TB/month. Over 5 years: 500 TB × 60 months = 30 PB total storage. DNS: 400 pages/sec × many unique hosts → thousands of DNS lookups/sec without caching (DNS bottleneck!). Requires distributed crawlers (10 machines × 40 pps each).

What are the main components of a web crawler (the pipeline)?
?
Seed URLs → URL Frontier (priority + politeness queue) → DNS Resolver (cached) → HTML Downloader (fetches page) → Content Parser (validates HTML) → Content Seen? (hash dedup) → Content Store (S3 + Bigtable) → Link Extractor (finds blacklisted) → URL Seen? (Bloom filter dedup) → URL Frontier (loop). Two dedup steps: content dedup (same page body) and URL dedup (same URL already crawled).

Why is BFS preferred over DFS for web crawling?
?
DFS problems: can descend infinitely deep (spider traps, infinite pagination), starves other domains (one site crawled completely before others), poor coverage early on. BFS advantages: (1) Spreads crawl across many hosts (more polite), (2) Discovers important pages sooner (important pages have more inbound links at shallow depth), (3) Naturally bounded depth, (4) Better coverage overall. Production crawlers (Google, Bing) use prioritized BFS. DFS is only used for very specific, bounded crawling tasks.

What is the URL Frontier and what three concerns must it balance?
?
The URL Frontier is a queue of URLs to be crawled — the core scheduling component of a crawler. It must balance three concerns: (1) Politeness: don’t hammer any single host; use per-host queues with enforced delays (respect robots.txt Crawl-delay). (2) Priority: crawl important pages first (PageRank, traffic, freshness score). (3) Freshness: re-crawl updated pages; news sites every hour, static docs monthly. Implementation: Front queues for priority (F1/F2/F3), Back queues for politeness (one per host).

How does the URL Frontier handle politeness (per-host rate limiting)?
?
Use per-host back queues: one queue per domain (e.g., B1 = site-a.com, B2 = site-b.com). A queue selector only picks from a back queue if the host’s minimum delay has elapsed (e.g., 5 seconds since last request). Read Crawl-delay from robots.txt and honor it as the minimum delay. Also: fetch and cache robots.txt per host, respect Disallow rules. This ensures no single host receives more than 1 request per Crawl-delay seconds.

How does the URL Frontier handle URL priority?
?
A Prioritizer component assigns a priority score to each URL. Score factors: PageRank (more inbound links = higher priority), web traffic, update frequency, freshness (time since last crawl vs expected update interval). URLs are routed to front queues (F1 = high, F2 = medium, F3 = low priority). Queue router picks from front queues with weighted probability: e.g., 60% from F1, 30% from F2, 10% from F3. Then routes to the appropriate back queue for politeness.

Why is DNS a hidden bottleneck in web crawling and how do you fix it?
?
Each new hostname requires a DNS lookup taking 10–200ms. At 400 pages/sec with many unique domains → thousands of DNS lookups/sec. Without caching, DNS becomes the throughput ceiling. Fix: maintain a DNS cache (hostname → IP + expiry). Check cache first; only call DNS if cache miss or expired. At scale: use a distributed DNS cache (Redis) shared across all crawler machines. Respect DNS TTL from responses (300–3600 seconds typically). This can reduce effective DNS call frequency by 90%+.

What are the two types of content deduplication in a web crawler?
?
(1) Exact deduplication: Compute SHA-256 hash of page content. Check if hash exists in a hash set. If yes, it’s a duplicate — discard. Catches identical pages at different URLs (mirrors, CDN copies, UTM parameter variants). (2) Near-duplicate detection with SimHash: Compute a 64-bit fingerprint of the page. Compare Hamming distance to other fingerprints. Pages with ≤ 3 bits different are near-duplicates (same article, slightly different ads/navigation). SimHash is used by Google to detect duplicate content across the web.

How does SimHash detect near-duplicate pages?
?
SimHash algorithm: (1) Extract features from page content (words, n-grams). (2) Hash each feature to a 64-bit integer. (3) For each bit position, count +1 if bit is 1 across all feature hashes, -1 if bit is 0. (4) Final SimHash bit i = 1 if count[i] > 0, else 0. Similar pages produce SimHashes with small Hamming distance (few bits differ). Threshold: pages with Hamming distance ≤ 3 are treated as near-duplicates. Example: “10110101…” vs “10110111…” → distance 1 → near-duplicate!

Why is Bloom filter ideal for URL deduplication in a web crawler?
?
At 1B pages/month × 5 years = 60 billion URLs. Hash set storage: 60B × 100 bytes/URL = 6 TB. Too expensive! Bloom filter: ~10 bits per element at 1% false positive rate → 60B × 10 bits = 600 billion bits = 75 GB. That’s 80x more memory-efficient. Bloom filter properties: no false negatives (if URL is seen, always returns true), small false positive rate (~1%) — acceptable because skipping a URL occasionally is fine. Cannot delete entries (use Counting Bloom Filter if needed).

Why are false positives acceptable in a Bloom filter for URL deduplication?
?
False positive = Bloom filter says URL is “already seen” when it hasn’t been. Worst case: crawler skips a URL it has never actually visited. In a 1-billion-page crawl, skipping ~1% of never-crawled URLs (10 million pages) is an acceptable trade-off for 80x memory savings (75 GB vs 6 TB). No false negatives: if a URL WAS seen, the Bloom filter will always correctly say “seen”. The guarantee that matters is: we never re-crawl seen URLs. The occasional missed new URL is fine.

What is robots.txt and how must a crawler handle it?
?
robots.txt is a file at the root of a website (e.g.,
https://site.com/robots.txt) that specifies crawl rules for bots. Key directives: User-agent (which bot), Disallow (paths not to crawl), Allow (override disallow), Crawl-delay (minimum seconds between requests), Sitemap (URL of the sitemap file). Crawler must: (1) Fetch robots.txt before crawling any page from a new host, (2) Cache it per host (refresh every 24 hours), (3) Never crawl Disallowed paths, (4) Respect Crawl-delay in per-host queue, (5) Honor Sitemap directive for efficient URL discovery.

What is a spider trap and how do you detect and avoid them?
?
A spider trap is a URL pattern generating infinitely many unique URLs: e.g., /calendar/2024/01/01/, /calendar/2024/01/02/… or /search?q=a, /search?q=aa, /search?q=aaa. Defenses: (1) URL length limit: skip URLs > 2048 characters. (2) Max path depth: skip URLs with > 10 path segments. (3) Per-domain cap: stop crawling after 10,000 URLs per domain. (4) URL normalization: remove tracking parameters (utm_*), sort query params — reduces URL explosion. (5) Bloom filter: catches re-queuing the same URL pattern. (6) Blacklist: known spider trap domains.

How does the HTML Downloader achieve 400 pages/second throughput?
?
Single-threaded download at 100ms/page = 10 pages/sec — way too slow. Solutions: (1) Distributed crawl: 10+ crawler machines each doing 40 pps. (2) Async I/O: non-blocking HTTP requests, handle thousands of concurrent connections per machine. (3) Locality: geographically co-located crawlers (US crawlers for US sites). (4) Short timeouts: 5-second timeout per request — don’t wait forever. (5) Keep-alive connections: reuse TCP/HTTP connections to the same host (HTTP/1.1 keep-alive, HTTP/2 multiplexing). (6) URL to crawler assignment via consistent hashing.

How do you make a distributed web crawler robust against machine failures?
?
(1) Consistent hashing: assign URLs to crawlers by hash → if a crawler fails, its URLs redistribute to remaining crawlers automatically. (2) Checkpointing: periodically serialize crawler state (URL Frontier, Bloom filter snapshot, in-progress downloads) to durable storage (S3/HDFS). On restart, load from checkpoint. (3) Idempotent downloads: re-downloading a page is safe; content dedup check catches duplicates before storing. (4) Retry with backoff: failed HTTP requests retried with exponential backoff. (5) Heartbeat monitoring: detect failed crawlers and trigger reassignment.

What storage systems are used for a web crawler and why?
?
Raw HTML pages: Object storage (S3, GCS) — cheap, durable, handles large blobs (500 KB × 1B pages = 500 TB/month). Metadata (URL, crawl time, content hash, status): Wide-column store (Bigtable, HBase, Cassandra) — handles high write throughput, efficient for key-based lookups by URL. URL Frontier: Redis (in-memory, priority queue support, fast dequeue). Bloom filter: In-memory (75 GB per node) or Redis with Bloom filter module. DNS cache: Redis (shared across all crawlers).

What are the key robustness design choices for a web crawler?
?
(1) Consistent hashing: distribute URLs across crawlers, auto-rebalance on failure. (2) Checkpointing: save state to S3/HDFS periodically so crash doesn’t lose progress. (3) Exception handling: handle HTTP errors (404 = mark dead, 503 = retry with backoff), timeouts, malformed HTML (log and skip, don’t crash), oversized pages (skip > 10 MB), infinite redirects (limit to 5 hops). (4) Data validation before storing: verify Content-Type is text/html, page is not empty, not a soft-404. (5) Monitoring: pages/sec, error rate, queue size, dedup rate.

How does freshness work in a web crawler (re-crawling strategy)?
?
Pages change over time and must be re-crawled. Strategies: (1) Time-based: schedule recrawl by page type — news sites hourly, product pages daily, static docs monthly. (2) Change detection: if content hash changed on last crawl, increase recrawl frequency; if unchanged, decrease. (3) Sitemap-driven: parse sitemap.xml which lists dates for each URL. (4) Feed-based: subscribe to RSS/Atom feeds for real-time change notification. Implementation: re-insert URLs into URL Frontier with a scheduled recrawl time. Priority score includes recency factor.

How would you make a web crawler extensible to support new content types (e.g., PDF)?
?
Use a plugin/content-type architecture. After the HTML Downloader fetches a response, a Content Type Detector reads the Content-Type HTTP header. Register parsers for each MIME type: text/html → HTML Parser, application/pdf → PDF Parser (plugin), image/jpeg → Image Parser (plugin). Each parser extracts text and links in a standardized format. The rest of the pipeline (content store, link extractor) is unchanged. New content types are added by implementing the parser interface and registering the MIME type — no core crawler changes needed.

What is the difference between URL deduplication and content deduplication?
?
URL deduplication: Check if the URL string has already been crawled or is already in the queue. Implemented with a Bloom filter. Prevents re-crawling the same URL. Happens BEFORE downloading. Content deduplication: Check if the downloaded page content is the same as a previously stored page. Implemented with SHA-256 hash. Prevents storing the same content twice (e.g., mirrors, CDN copies, redirect targets). Happens AFTER downloading. Need both: same content can appear at different URLs, and same URL should not be crawled twice.

Why should you use BFS + priority over simple BFS in a web crawler?
?
Simple BFS treats all URLs equally. Problems: (1) Low-quality pages and high-quality pages crawled at the same rate — wastes resources. (2) Updated pages not re-crawled promptly. Priority BFS adds: (1) PageRank-based priority: pages with more inbound links (more important) crawled first. (2) Traffic-based priority: popular pages crawled more frequently. (3) Freshness priority: pages not crawled recently bumped up. (4) News vs static priority: time-sensitive content re-crawled hourly. Result: in the same time window, prioritized BFS indexes more relevant content.

What is the role of the URL Filter component in a web crawler?
?
URL Filter removes unwanted URLs after link extraction, before adding to the URL Frontier. Filters applied: (1) Invalid format: not a well-formed URL (missing scheme, malformed). (2) Excluded file types: if crawling HTML only, remove .jpg, .pdf, .mp4, .zip URLs. (3) Blacklisted domains: known spam, malware, or adult content sites. (4) URL normalization: convert relative URLs to absolute, lowercase scheme/host, remove default ports, remove tracking parameters. (5) Length and depth limits: remove URLs that would hit spider trap defenses. Only URLs passing all filters reach the URL Seen check.

What monitoring metrics are important for a web crawler?
?
Throughput metrics: pages crawled per second (target: ~400), successful downloads per second, failed downloads per second. Queue health: URL Frontier size (should not grow unbounded), per-host queue depths. Deduplication: content dedup rate (% pages skipped as duplicate), URL dedup rate (% URLs already seen). Error rates: 4xx responses (broken links), 5xx responses (server errors), DNS failures, timeouts. Resource utilization: bandwidth, CPU per crawler, memory for Bloom filter/DNS cache. Alerting: if pages/sec drops 50%, if error rate spikes, if frontier queue grows unboundedly.

What is the full data flow for discovering and storing a new web page?
?

  1. URL dequeued from URL Frontier. 2. DNS cache lookup for host IP (fetch DNS if miss). 3. HTML Downloader fetches page (HTTP GET with timeout). 4. Content Type check (skip if not text/html). 5. Content Parser validates HTML and extracts metadata. 6. Content Seen check: SHA-256 hash of body → check hash set. If duplicate → discard. 7. Content Store: save raw HTML to S3, save metadata to Bigtable. 8. Link Extractor: find all wrong-type URLs. 10. URL Seen check (Bloom filter): skip already-seen URLs. 11. Add new URLs to URL Frontier with priority score.

Total Cards: 25
Review Time: 25-30 minutes
Priority: HIGH - Medium-Hard distributed systems question!
Last Updated: 2026-04-13