Scalability Patterns: Cross-Chapter Comparison Reference

comparisons scalability patterns architecture

Type: Cross-chapter reference — NOT a chapter notes file
Covers: Vol 1 Ch01, Ch04, Ch05, Ch08, Ch11, Ch12, Ch14, Ch15 · Vol 2 Ch01, Ch05, Ch06, Ch09, Ch10, Ch13
Last Updated: 2026-04-13

1. Horizontal vs Vertical Scaling

Aspect	Vertical (Scale Up)	Horizontal (Scale Out)
Mechanism	Replace with a bigger machine (more CPU, RAM, disk)	Add more machines to the pool
Limit	Hard ceiling (largest instance: ~96 CPU, ~384 GB RAM on AWS)	Virtually unlimited (add nodes)
Failure mode	Single point of failure — one machine goes down, system is down	Can lose individual nodes without outage
Consistency	Simple — no distributed state problem	Complex — must handle distributed state, eventual consistency
Cost	Expensive per unit at large scale	Cheaper per unit, pay-as-you-grow
Speed of scaling	Fast (resize existing machine)	Slow (provision, configure, load balance new nodes)
When to use first	Always — exhaust vertical before horizontal	When vertical ceiling hit or cost prohibitive

SDI progression (Vol 1 Ch01 — Scale from Zero to Millions):

0 → 1K users: Single server, vertical scale
1K → 100K users: Add replica DB, CDN, basic load balancer
100K → 1M users: Multiple app servers (horizontal), sharded DB
1M+ users: Full microservices, consistent hashing, multi-region

Real examples from SDI chapters:

URL shortener (Vol 1 Ch08): Start with single MySQL; at 100M URLs/day → Redis cache absorbs reads, vertical scale DB, add replicas
Chat system (Vol 1 Ch12): WebSocket servers scale horizontally; each server handles ~N concurrent connections; load balancer distributes
Metrics monitoring (Vol 2 Ch05): Ingestion layer scales horizontally (stateless collectors); time-series DB scales via sharding

2. Load Balancing Strategies

Algorithm	How it works	Pros	Cons	When to use	SDI examples
Round Robin	Requests distributed sequentially across servers	Simple, fair for uniform requests	Ignores server load; slow servers pile up	Stateless API servers with similar request sizes	Default for most app server pools
Weighted Round Robin	Servers get proportional traffic based on weight	Handles heterogeneous hardware	Weights need manual tuning	Mixed server capacities	When some servers are more powerful
Least Connections	Send to server with fewest active connections	Better for long-lived connections	Overhead tracking connection count	WebSocket servers, long-poll connections	Chat system (Ch12) — persistent connections
Least Response Time	Send to server with lowest current latency	Best real-time performance	Complex measurement	Latency-sensitive services	Stock exchange market data (V2 Ch13)
IP Hash	Hash client IP to a consistent server	Sticky sessions (same client → same server)	Uneven if clients cluster by IP; fails if server removed	Session affinity without shared session store	Legacy apps with server-side sessions
Consistent Hashing	Hash ring; client/key → nearest server on ring	Minimal remapping when servers added/removed	Need virtual nodes for balance	Distributed caches, sharded data services	Redis Cluster, Cassandra routing

Layer 4 vs Layer 7 load balancers:

L4 (TCP/UDP): Routes by IP + port only. Faster, lower overhead. Can’t inspect HTTP content. Good for raw throughput.
L7 (HTTP): Routes by URL path, headers, cookies. Can do content-based routing (/api → API servers, /static → CDN). Good for microservices.

3. Data Partitioning Patterns

Strategy	Logic	Pros	Cons	SDI chapters using it
Range-based	Assign key ranges to shards (A–F → shard 1)	Easy range queries; natural ordering	Hotspots when data is skewed	Web crawler (Ch09) — crawl by URL prefix; Time-series by time range (V2 Ch05)
Hash-based	`shard = hash(key) % N`	Even distribution; simple	No range queries; full reshard when N changes	URL shortener (Ch08) — hash of shortCode; chat messages by conversation ID
Consistent hashing	Hash ring; each node owns an arc; add/remove moves minimal data	Minimal resharding disruption; handles node churn gracefully	Need virtual nodes to avoid uneven distribution; more complex routing	Key-value store (Ch06); Redis Cluster; Cassandra; distributed cache (Vol 2 Ch04)
Geographic / directory	Route based on user location or explicit lookup table	Low latency; data residency compliance	Cross-region queries complex; uneven shard sizes	Google Maps (V2 Ch03) — tile shards by lat/lon quadrant; Hotel reservation (V2 Ch07) — by region

Resharding problem: The main reason consistent hashing exists. When you add shard N+1 with simple hash modulo, all keys remapping costs enormous I/O. With consistent hashing, only 1/N of keys move.

Celebrity problem (hotspot): A single partition receives disproportionate traffic (one celebrity’s posts, one trending URL). Solutions:

Add suffix/prefix to shard key to spread across shards (user:celebrity:shard_0, user:celebrity:shard_1)
Cache hot keys in every app server’s local memory (1–5s TTL)
Read replicas for the hot partition

4. Replication Patterns

Pattern	Writes	Reads	Consistency	Failure handling	SDI chapters
Leader-Follower (Primary-Replica)	Primary only	Primary + any replica	Eventual (replication lag)	Promote replica on primary failure	Most SQL DBs (MySQL replicas in URL shortener, news feed, notifications)
Multi-Leader (Active-Active)	Any node	Any node	Conflict resolution required (last-write-wins, CRDTs, custom)	Any node can take over; write conflicts possible	Google Docs-style collaborative editing; Multi-region active-active (Google Drive conflicts)
Leaderless (Quorum)	All nodes accept writes; quorum W of N confirm	Quorum R of N nodes respond; merge conflicts	Tunable: `W + R > N` = strong; `W + R ≤ N` = eventual	No leader election needed; Sloppy quorum during partitions	Cassandra (chat Ch12, metrics V2 Ch05); DynamoDB; Key-value store (Ch06 — Dynamo-style)

Trade-offs table:

	Availability	Consistency	Complexity	Best for
Leader-Follower	Medium (failover needed)	Strong (sync) or Eventual (async)	Low	Most OLTP systems
Multi-Leader	High (no single failure point)	Low (conflict-prone)	High	Multi-region writes, collaboration
Leaderless	High (no leader election)	Tunable via quorum	Medium	High write availability, Cassandra-style

Quorum formula (from Vol 1 Ch06 — Key-Value Store):

N = replication factor, W = write quorum, R = read quorum
W + R > N → strong consistency (reads always see latest write)
Common: N=3, W=2, R=2 (majority quorum)
Latency optimized: W=1, R=3 (fast writes, slow reads)
Durability optimized: W=3, R=1 (all nodes confirm write)

5. Fanout Patterns

Fanout = distributing a single event to multiple destinations. The core trade-off is: do work at write time or read time?

Pattern	When work is done	Latency for writer	Latency for reader	Storage cost	SDI chapter
Fanout on Write (Push model)	At post/event creation time	High (must write to all N followers)	Low (pre-built feed)	High (N copies per post)	Vol 1 Ch11 News Feed — write post → push to all followers’ feed caches
Fanout on Read (Pull model)	At feed retrieval time	Low (just store post)	High (must query all N followees)	Low (one copy)	Vol 1 Ch11 — older Twitter approach; simpler but slow for high-fan users
Hybrid	Write for regular users; read for celebrities	Medium	Medium	Medium	Vol 1 Ch11 — preferred production approach; push to non-celebrity followers, pull for celebrities on read

When to use which:

Write fanout: Best when followers are few (<500), feed access is frequent, read latency must be low
Read fanout: Best when user has millions of followers (celebrities), or feed is rarely accessed
Hybrid: Real production choice at scale — write fanout for normal users, read fanout (or separate celebrity handling) for high-follower accounts

Generalized fanout beyond news feed:

Notification system (Vol 1 Ch10): Fanout on write to per-channel queues (one queue per notification type)
YouTube upload (Vol 1 Ch14): Fanout on write to transcoding jobs (one Kafka event → N parallel workers)
Google Drive sync (Vol 1 Ch15): Fanout to all of user’s devices on file change

6. Rate Limiting Patterns

Summary reference — full detail in Vol 1 Ch04.

Algorithm	Memory	Accuracy	Allows burst	Complexity	Production use
Token Bucket	Low (bucket state per user)	Good	Yes (consume all tokens at once)	Low	AWS API Gateway, Stripe — industry default
Leaky Bucket	Low (queue size per user)	Good	No (constant drain rate)	Medium	Traffic shaping, smooth downstream load
Fixed Window Counter	Very low (one counter per window)	Poor (burst at boundary)	Partial	Very low	Prototypes only
Sliding Window Log	High (all timestamps)	Excellent	No	Medium	Low-volume, accuracy-critical
Sliding Window Counter	Low (two counters)	Very good (weighted approximation)	No	Medium	Cloudflare, Azure — best production balance

Where rate limiting lives in the architecture:

API Gateway: Per-user, per-IP, per-endpoint limits
Redis: Stores counters (atomic INCR, automatic TTL expiry)
Distributed: Redis Cluster shards counters by user ID; race conditions solved with Lua scripts

7. CDN Patterns

Aspect	Push CDN	Pull CDN
How content gets to edge	You upload/push content to CDN nodes in advance	CDN fetches from origin on first cache miss, then caches
First request	Fast (already at edge)	Slow (origin fetch + cache)
Content freshness	Manual invalidation or re-push	TTL-based auto-expiry
Best for	Rarely-changing content you know will be popular (software releases, JS bundles)	Frequently-changing or unpredictable content
Storage cost	High (pre-populate all edges)	Lower (only cache what’s requested)
Complexity	Higher (you manage distribution)	Lower (CDN manages it)

What to put on CDN:

Static assets: JS, CSS, fonts, images (always)
Video segments: HLS/DASH chunks (YouTube Ch14, Netflix-style)
Map tiles: Pre-rendered PNG/vector tiles (Google Maps V2 Ch03)
File downloads: User files from Google Drive (Vol 1 Ch15)
API responses: Only if safe to cache (no user-specific data without cache-key on user ID)

What NOT to put on CDN:

Authenticated API responses (unless you include auth token in cache key)
Real-time data (order book, live prices)
User-specific personalized content (news feed, notifications)

SDI chapters using CDN heavily:

YouTube (Vol 1 Ch14): CDN is mandatory — video at 5M DAU without CDN = origin server melts down. CDN serves ~95% of video traffic.
Google Drive (Vol 1 Ch15): Pull CDN for file downloads; cached at edge per file version.
Google Maps (Vol 2 Ch03): Map tiles are static; Push CDN for popular zoom levels/regions.
URL shortener (Vol 1 Ch08): Can cache redirect responses at CDN edge (301 cached by browser + CDN).

8. Microservices vs Monolith

Factor	Monolith	Microservices
Team size	< 10–20 engineers	50+ engineers
Deployment	One deployable unit	Independent per service
Scaling	Scale the whole app	Scale specific bottleneck services
Technology	Single stack	Polyglot (best tool per service)
Data	Shared single DB	DB per service (no shared DB!)
Latency	In-process calls (sub-ms)	Network calls (1–10ms per hop)
Debugging	Simple stack traces	Requires distributed tracing
When to start	Always (premature microservices = pain)	When monolith becomes deployment bottleneck

API Gateway pattern (mandatory for microservices):

Single entry point for all client requests
Handles auth, rate limiting, request routing, protocol translation
Without it, clients must know about every service
SDI reference: Every chapter with microservices uses an API Gateway

Service mesh (optional, for large-scale microservices):

Sidecars (Envoy) on every pod handle: retries, circuit breaking, mTLS, observability
Use when > ~20 services or when network policy / tracing is critical

9. Scalability Decision Framework

Given a set of constraints, which patterns apply?

Constraint / Symptom	Pattern to apply	SDI example
Read traffic >> write traffic (>5:1)	Read replicas + caching (Redis)	URL shortener, news feed, leaderboard
Write throughput > single DB capacity	Sharding (consistent hash or range)	Chat messages, metrics ingestion
Single server is SPOF	Horizontal scale + load balancer	All production systems
Global users with latency concerns	CDN + GeoDNS + multi-region	YouTube, Google Maps, Google Drive
Traffic is spiky / bursty	Message queue as buffer	Notification system, video transcoding
Real-time fan-out to many subscribers	Pub/Sub or Kafka topics	News feed fanout, notification fan-out
Hotspot / celebrity problem	Hybrid fanout + local cache + key splitting	News feed celebrity handling
Need to replay events	Kafka with retention	Ad click aggregation, metrics backfill
Near-millisecond response time	In-memory store (Redis) + connection pooling	Leaderboard, rate limiter, order book
Ultra-low latency (<1ms)	Co-location, LMAX Disruptor, in-process queue	Stock exchange matching engine
Data consistency across services	Outbox pattern + Saga	Payment system, hotel reservation
Schema flexibility needed	NoSQL (document DB or wide-column)	User profiles, activity events

10. Numbers Every Engineer Should Know for Scalability

Latency Table (approximate, 2024)

Operation	Latency
L1 cache read	0.5 ns
L2 cache read	7 ns
Main memory (RAM) read	100 ns
Redis GET (local network)	~0.5–1 ms
SSD random read	~0.1 ms
Network round-trip (same datacenter)	~0.5 ms
Network round-trip (cross-region, US East ↔ West)	~70 ms
Network round-trip (US ↔ Europe)	~100–120 ms
HDD seek	~10 ms
MySQL query (simple, indexed)	~1–5 ms
MySQL query (complex join)	~10–100 ms
Cassandra read (quorum)	~1–5 ms
S3 GET request	~10–50 ms
DNS lookup	~1–100 ms

Throughput Table

Component	Typical throughput
Single web server (nginx)	~50K–100K req/sec
Single app server (Java Spring)	~1K–10K req/sec (CPU bound)
Redis (single node)	~100K–1M ops/sec
MySQL (reads, indexed)	~10K–100K QPS per server
MySQL (writes)	~1K–10K writes/sec
Cassandra (writes)	~50K–100K writes/sec per node
Kafka (single partition)	~10 MB/sec or ~100K msgs/sec
CDN (global, Cloudflare)	Terabits/sec aggregate
AWS S3	Thousands of req/sec per prefix

Storage Scale Reference

Scale	Users	Storage need / year	Typical architecture
Startup	< 10K	< 100 GB	Single server + single DB
Growing	10K–1M	100 GB – 10 TB	Load balancer + DB replicas + CDN
Large	1M–100M	10 TB – 1 PB	Sharded DB + distributed cache + object storage
FAANG	100M+	1 PB+	Custom storage systems + multi-region

Useful multiples:

1 million seconds ≈ 11.5 days
2.5 million seconds ≈ 1 month
1 billion seconds ≈ 31.7 years
QPS = DAU × actions_per_day / 86,400
Peak QPS ≈ 2–3× average QPS
Storage = daily_writes × size_per_entry × retention_days × replication_factor

See also:

key-patterns > 2. Database Sharding — Sharding strategies
key-patterns > 3. Consistent Hashing — Hash ring in detail
key-patterns > 4. Load Balancing Algorithms — Algorithm comparison
key-patterns > 5. Fanout Patterns — Fanout on write vs read
distributed-system-components — Component-by-component reference
estimation-cheatsheet — Full numbers reference
ch01-scale-from-zero-to-millions — Progression of adding components
ch04-rate-limiter — Rate limiting algorithms in full
ch05-consistent-hashing — Consistent hashing deep dive
ch11-news-feed — Fanout patterns in production
cache-strategies — Caching across chapters
storage-systems — Storage system selection

Study Notes by Niladri & AI

Explorer

scalability-patterns