Chapter 8 Flashcards - URL Shortener

flashcards volume1 url-shortener hashing

What is a URL shortener and name two real-world examples?
?
A URL shortener converts a long URL into a short alias. Visiting the short URL redirects to the original. Real examples: TinyURL (tinyurl.com), bit.ly, t.co (Twitter), goo.gl (Google). Core operations: POST to shorten (write), GET short URL to redirect (read). It is a read-heavy system (100:1 read:write ratio).

What are the two API endpoints for a URL shortener?
?

  1. POST /api/v1/data/shorten — accepts { “longUrl”: ”…” } in body, returns { “shortUrl”: “https://tinyurl.com/abc1234” }. 2. GET /{shortURL} — returns 301 or 302 redirect with Location header set to the original long URL. The GET path is the hot path (100x more traffic than POST).

What is the character math for a 7-character short URL?
?
Characters allowed: [0-9] + [a-z] + [A-Z] = 10 + 26 + 26 = 62 characters. Unique combinations for 7 characters: 62^7 = 3,521,614,606,208 ≈ 3.5 trillion unique URLs. At 100M URLs/day: 3.5 trillion / 100M = 35,000 days ≈ 95+ years. 7 characters is the minimum length to support the required scale.

What are the scale estimates for a URL shortener serving 100M URLs/day?
?
Writes: 100M/day ÷ 86,400 sec ≈ 1,160 writes/sec. Reads (100:1 ratio): ~116,000 reads/sec. Storage: 100M × 365 × 10 years = 365 billion records. At ~500 bytes/record → ~182 TB over 10 years. Cache sizing: top 20% of URLs serve 80% of traffic (Pareto principle). One MySQL primary handles writes; read replicas handle reads.

What is the difference between 301 and 302 redirects in a URL shortener?
?
301 Permanent: browser caches the redirect, future visits go directly to destination (no server hit). Lower server load but cannot track repeat visits. 302 Temporary: browser always asks the server, every click hits our server. Higher load but enables full click tracking (analytics). Use 301 to reduce server load; use 302 when analytics/tracking every click matters.

Which redirect type should you use for a URL shortener that needs click analytics?
?
Use 302 Temporary Redirect. With 302, every client request goes through the short URL server, letting you log each click (timestamp, IP, user agent, referrer, country). With 301, the browser caches the redirect and future visits skip the server entirely — you only see the first click, not repeat visits. Marketing and analytics use cases require 302.

What is the data model / schema for a URL shortener?
?
Table: url_mapping with columns: id (BIGINT, primary key, auto-increment), short_url (VARCHAR(7), unique, indexed), long_url (VARCHAR(2048)), created_at (TIMESTAMP), user_id (BIGINT, nullable). Index on short_url for fast redirect lookups. Index on long_url for deduplication checks. Store in relational DB (MySQL/PostgreSQL) for ACID guarantees and unique constraints.

What are the two main approaches to generating a short URL?
?
Approach 1 — Hash + Collision Detection: Hash the long URL (MD5/SHA-256), take first 7 chars. If collision, append salt and rehash. Pros: deterministic (same long URL → same short URL). Cons: needs collision checking, multiple DB queries per write. Approach 2 — Base62 Encoding: Insert long URL into DB, get auto-increment ID, convert to base62. Pros: no collisions, one write per URL. Cons: sequential IDs are guessable.

Explain the Base62 encoding approach step by step.
?

  1. Client POSTs long URL. 2. Check if long URL already in DB (dedup). 3. If not found, INSERT into DB → get auto-increment id (e.g., 11157). 4. Convert id to base62: divide by 62 repeatedly, map remainders to alphabet [0-9a-zA-Z]. 5. base62(11157) = “2TX”. 6. Short URL = domain + “2TX”. 7. Write shortURL → longURL to Redis cache. 8. Return short URL to client. No collision check needed — IDs are globally unique.

How does base62 encoding work? Walk through an example.
?
Alphabet: “0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ” (0→‘0’, 10→‘a’, 36→‘A’). Algorithm: while num > 0, append alphabet[num % 62], then num = num // 62, reverse. Example: base62(11157) → 11157 % 62 = 21 (‘L’), 11157 // 62 = 179 → 179 % 62 = 55 (‘T’), 179 // 62 = 2 → 2 % 62 = 2 (‘2’) → reverse → “2TL”. Larger IDs produce longer strings (7 chars for IDs up to 62^7 = 3.5 trillion).

How does the hash + collision detection approach work?
?

  1. Compute MD5 or SHA-256 of the long URL. 2. Take first 7 characters of the hex hash string. 3. Check if this 7-char string exists in DB. 4. If no collision: save and return as short URL. 5. If collision (same 7 chars, different long URL): append random salt to long URL, rehash, take first 7 chars, repeat. Use Bloom filter to speed up collision checks (avoid DB query for non-existent keys).

What are the pros and cons of Hash+Collision vs Base62 encoding?
?
Hash+Collision pros: deterministic (same long URL always produces same short URL), works in distributed systems without coordination. Cons: collision handling needed (multiple DB queries), same URL + salt → different short URL. Base62 pros: guaranteed uniqueness (auto-increment), one DB write per URL, simple. Cons: sequential IDs guessable (enumerate all URLs), needs a single coordinated ID source. For interviews: Base62 is preferred.

Why is the URL shortener read-heavy and what is the caching strategy?
?
Read:write ratio is 100:1 (100 redirects per 1 new URL). At 100M writes/day → ~10B reads/day. Caching strategy: Redis cache with shortURL as key, longURL as value. Cache the top 20% of URLs (Pareto: 20% URLs handle 80% of traffic). Cache size: ~20 GB to store 200M entries at 100 bytes each. TTL: 24 hours or no TTL if URLs never expire. Eviction: LRU policy.

What database should you use for a URL shortener and why?
?
MySQL or PostgreSQL (relational). Reasons: structured schema (fixed columns: id, short_url, long_url, created_at), needs ACID (unique constraint on short_url prevents duplicates), strong consistency (two concurrent writes must not produce same short URL). Scale reads with read replicas (11,600 reads/sec). One primary handles 1,160 writes/sec comfortably. Shard by shortURL prefix if needed at extreme scale.

How does the read/redirect path work in a URL shortener?
?

  1. Client sends GET /abc1234. 2. Load balancer routes to API server. 3. API server checks Redis cache for key “abc1234”. 4. Cache hit: get longURL, return 301/302 redirect with Location header. 5. Cache miss: query MySQL DB for short_url = “abc1234”. 6. Write longURL to Redis cache (write-through). 7. Return 301/302 redirect. Cache hit ratio should be >80% in steady state due to hot URL concentration.

How do you handle deduplication — two users submitting the same long URL?
?
Before inserting a new record, query: SELECT short_url FROM url_mapping WHERE long_url = ?. If found, return the existing short URL (same long URL → same short URL, saves space). If not found, insert new record. Use a DB index on long_url for fast lookup. For concurrent inserts, use a unique index on long_url and handle duplicate key errors gracefully. Alternative: allow duplicate mappings (simpler, slightly wasteful).

How would you generate unique IDs in a distributed URL shortener?
?
Problem: Multiple API servers cannot all use auto-increment from a single table without coordination bottleneck. Solutions: 1. Single MySQL auto-increment with connection pooling (works at moderate scale). 2. Snowflake ID generator: 64-bit ID with timestamp + datacenter + machine + sequence (Twitter’s approach). 3. ID generation service: dedicated microservice that hands out ID ranges to API servers. 4. UUID to base62 (128-bit UUIDs are too long; hash the UUID to 7 chars).

What are the rate limiting rules for a URL shortener?
?
Unauthenticated users: 10 URL shortenings per hour per IP. Free tier: 100 URL shortenings per day per user. Premium tier: 10,000 URL shortenings per day per user. Redirect reads: 10,000 per minute per IP (prevent scraping all short URLs). Enforce at API Gateway. Return HTTP 429 Too Many Requests with Retry-After header. Also validates that submitted URLs are valid and reachable.

How do you prevent URL shortener abuse (spam, malware)?
?

  1. Rate limiting per IP and user account (429 if exceeded). 2. URL validation: verify submitted URL is well-formed, optionally check if reachable. 3. Domain blacklist: reject known spam/malware domains. 4. Integrate Google Safe Browsing API to scan URLs for malware/phishing. 5. CAPTCHA for anonymous or suspicious submissions. 6. Flag and review user-reported URLs. 7. Deactivate short URLs that redirect to reported malicious content.

What is the full final architecture of a URL shortener?
?
Client → Load Balancer → API Servers (stateless, horizontally scaled) → Redis Cache (shortURL → longURL, LRU, ~20GB) → MySQL (Primary for writes + Read Replicas for reads). Write path: API server → MySQL primary → cache write. Read path: API server → Redis cache → (cache miss) → MySQL read replica. Rate limiting at Load Balancer/API Gateway. Analytics: 302 redirects → async click events → Kafka → analytics service → data warehouse.

Why are API servers stateless in a URL shortener design?
?
API servers hold no session state — they look up shortURL → longURL from Redis/DB on every request. Stateless servers can be horizontally scaled freely (add/remove servers without rebalancing state). Load balancer can route any request to any server (round-robin, least-connections). Enables zero-downtime deployments (roll servers one by one). All shared state lives in Redis (fast) and MySQL (durable).

What storage capacity is needed over 10 years for 100M URLs/day?
?
Records: 100M/day × 365 days × 10 years = 365 billion records. Per-record size: ~500 bytes (short_url 7B + long_url up to 2048B + metadata). Total: 365 billion × 500 bytes ≈ 182 TB. This is manageable with modern databases and sharding. For read replicas, multiply storage by the number of replicas. Use data archiving (cold storage) for URLs older than a threshold to reduce hot DB size.

How do you add analytics (click tracking) to a URL shortener?
?
Use 302 redirects (not 301) so every click hits the server. On each redirect: 1. Log click event asynchronously (non-blocking): { short_url, timestamp, ip_address, country, user_agent, referrer }. 2. Publish event to Kafka/message queue. 3. Analytics service consumes events and aggregates: total clicks, unique clicks, geographic breakdown, referrer sources, time-of-day patterns. 4. Store aggregates in a data warehouse (Redshift, BigQuery). Dashboard shows real-time and historical stats.

What is the key design trade-off in URL shortener between 301 and 302?
?
301 Permanent: Pro — browser caches redirect, drastically reduces server load (after first visit, no server call). Con — cannot track repeat visits, cannot update redirect destination later. 302 Temporary: Pro — every visit goes through server (full analytics, can change destination). Con — higher server load, higher latency (extra network hop). The right choice depends on use case: performance → 301, analytics/flexibility → 302. Always clarify with interviewer.

What makes a URL shortener a good system design problem?
?
It covers many fundamental concepts in one clean problem: (1) Read-heavy systems and caching strategies, (2) Hash functions vs encoding (base62), (3) Collision detection and deduplication, (4) HTTP redirect types (301 vs 302) and their implications, (5) Database schema design and indexing, (6) Horizontal scaling of stateless API servers, (7) Rate limiting and abuse prevention, (8) Back-of-envelope estimation (62^7, scale numbers). Easy to extend with analytics, custom aliases, expiration.


Total Cards: 25
Review Time: 20-25 minutes
Priority: HIGH - Very common interview question (Easy-Medium)!
Last Updated: 2026-04-13