Chapter 7 Cheat Sheet — Sharding
Quick Revision Time: 5 minutes | Interview Prep: 15 minutes
One-Line Summaries
| Concept | One-Liner |
|---|---|
| Sharding | Split data across nodes; each node holds a subset |
| Key-range sharding | Adjacent keys co-located; enables range queries; risks sequential hot spots |
| Hash sharding | Hash key for uniform spread; range queries require scatter/gather |
| Consistent hashing | Ring-based; adding nodes only moves adjacent keys (~1/N of data) |
| Hot spot | One shard gets disproportionate load (celebrity posts, viral content) |
| Local secondary index | Per-shard index; fast writes, scatter/gather reads |
| Global secondary index | Single term-partitioned index; fast reads, async/eventual updates |
| Scatter/gather | Fan out query to ALL shards; tail latency = slowest shard |
| Rebalancing | Moving shards between nodes as topology changes |
| Multitenancy sharding | Shard key includes tenant_id; row/schema/database-level isolation |
Sharding Strategy Decision Tree
What are your access patterns?
│
├─ Need range queries (all events Jan-Feb, all sensors in region X)?
│ ├─ Use key-range sharding
│ └─ Watch for: sequential key hot spots (timestamps → prefix with entity ID)
│
├─ Need uniform distribution, no range queries?
│ └─ Use hash sharding (consistent hashing for cluster changes)
│
├─ Need BOTH range and uniform distribution?
│ └─ Compound key: hash(partition_key) + sorted clustering_key
│ Example: (user_id hashed) + (timestamp sorted within shard)
│
├─ SaaS / multi-tenant platform?
│ ├─ Small tenants → row-level isolation, tenant_id as shard key prefix
│ ├─ Large tenants → dedicated shards or dedicated DB
│ └─ Mix → tiered: shared pool + reserved shards for top N tenants
│
└─ Celebrity / viral content hot spot?
├─ Write hot spot → random suffix (100 variants per key)
├─ Read hot spot → cache layer (Redis/CDN in front)
└─ Both → write suffix + read-aside cache
Key-Range vs Hash Sharding
Key-Range Sharding Hash Sharding
─────────────────────────────── ─────────────────────────────
[key_001─key_500] → Shard 0 hash(key) → bucket → Shard N
[key_501─key_999] → Shard 1
[key_1000─key_1999] → Shard 2
✅ Range queries span one shard ✅ Uniform distribution
✅ Sorted data → prefix scans work ✅ No hot spots from sequential keys
✅ Co-locate related data (same user) ✅ Simple routing logic
❌ Sequential key hot spots ❌ Range queries = scatter/gather
❌ Uneven shard sizes possible ❌ Cannot infer adjacent keys
❌ Manual boundary management ❌ Skewed access still causes hot spots
Consistent Hashing Ring
Hash space: 0 ──────────────────────────── 2^64
Nodes at random ring positions:
key_X (hash=25)
↓
0 [A@10] •────── [B@45] ──────── [C@80] ──── 2^64
key_X (hash=25) → Node B (first clockwise from 25)
Add Node D at position 35:
0 [A@10] [D@35]key_X [B@45] [C@80] ──── 2^64
key_X now routes to D — only keys in (10,35] need to move!
Multitenancy Isolation Models
┌─────────────────────┬───────────────────────┬────────────────────────┐
│ Row-Level │ Schema-Level │ Database-Level │
├─────────────────────┼───────────────────────┼────────────────────────┤
│ All tenants share │ Each tenant has its │ Each tenant has a │
│ tables; tenant_id │ own schema in shared │ dedicated database │
│ column distinguishes│ database instance │ (or cluster) │
├─────────────────────┼───────────────────────┼────────────────────────┤
│ ✅ Low overhead │ ✅ Moderate isolation │ ✅ Full isolation │
│ ✅ Simple schema │ ✅ Per-tenant backups │ ✅ Easy compliance │
│ ❌ Noisy neighbors │ ❌ Schema count limits │ ❌ High ops overhead │
│ ❌ Bug = data leak │ ❌ Harder aggregation │ ❌ Hard to aggregate │
├─────────────────────┼───────────────────────┼────────────────────────┤
│ Shared SaaS apps │ Shopify, mid-tier SaaS │ Enterprise contracts │
└─────────────────────┴───────────────────────┴────────────────────────┘
Local vs Global Secondary Indexes
LOCAL (Document-Partitioned): GLOBAL (Term-Partitioned):
──────────────────────────────── ──────────────────────────────────
Shard 0: data + idx for its rows Index shard "red": ptrs to all shards
Shard 1: data + idx for its rows Index shard "blue": ptrs to all shards
Shard 2: data + idx for its rows
WRITE: update 1 shard → fast WRITE: update 2 shards → slower
READ: scatter to ALL shards → slow READ: lookup 1 index shard → fast
tail latency problem (may be slightly stale)
Best: write-heavy + tolerates read cost Best: read-heavy secondary access
Used: MongoDB, Elasticsearch, Cassandra Used: DynamoDB GSI, Riak Search
Rebalancing Strategies
1. Fixed shard count (Elasticsearch, Riak, Couchbase)
───────────────────────────────────────────────────
Create 1000 shards for 10 nodes → 100/node
Add node: steal ~91 shards, no splitting needed
Con: shard count set at creation; hard to change later
2. Dynamic splitting (HBase, MongoDB, RethinkDB)
───────────────────────────────────────────────────
Shard > 10 GB → split into two 5 GB shards
Shard < 1 GB → merge with neighbor
Pro: auto-adapts to data volume
Con: cold-start problem (single shard handles everything at launch)
3. Vnodes / proportional (Cassandra)
───────────────────────────────────────────────────
Each node owns N virtual nodes (default: 256)
Add node: randomly steal ~256/(total_nodes+1) vnodes each from existing nodes
Pro: natural scaling; finer granularity with more nodes
Con: gossip overhead grows with vnode count
Request Routing Approaches
Approach 1: Any-node + forward Approach 2: Routing tier
────────────────────────────── ──────────────────────────────
Client → Node A (random) Client → ZooKeeper-backed Router
Node A: "I don't own K" ↓ (knows shard map)
Node A → Node B (owns K) Correct Shard Node
Node B → Node A → Client ↓
Result → Client
Extra hop in data path Extra hop before data path
Used: Cassandra (gossip) Used: Kafka, some MongoDB
Approach 3: Smart client
────────────────────────────────────────
Client subscribes to ZooKeeper/etcd for shard map
Client → Correct Node directly (no extra hops)
Lowest latency. Used: Kafka producer/consumer API, Cassandra drivers
Hot Spot Relief Reference
| Scenario | Problem | Fix | Trade-off |
|---|---|---|---|
| Timestamp shard key | All writes to “now” shard | Prefix with entity ID (sensor_id, user_id) | Key design change required |
| Celebrity post reads | Millions of reads to one shard | Redis/CDN cache in front | Cache invalidation complexity |
| Viral write burst | Write spike to single shard | Random suffix (100 variants) | Reads must aggregate 100 keys |
| Large tenant in shared pool | One tenant saturates shared shard | Dedicated shard for top-N tenants | Tenant routing layer needed |
Auto vs Manual Rebalancing
DANGER SCENARIO — Auto rebalancing during failure:
──────────────────────────────────────────────────
1. Node C slows down (GC pause, network blip)
2. Auto-rebalancer: "Node C seems down, move its shards!"
3. Heavy data transfer begins on already-stressed Node A and Node B
4. A and B, now overloaded, also slow down
5. Rebalancer: "A and B are slow too!" → triggers more moves
6. Cascade failure: cluster completely down
PREVENTION:
- Add delay threshold before triggering rebalancing
- Require quorum of nodes to agree a node is truly down
- Require human operator approval for large rebalancing operations
- Separate rebalancing monitoring from regular load alerts
Key Trade-offs Table
| Decision | Pros | Cons | Use When |
|---|---|---|---|
| Key-range sharding | Range queries, co-location | Sequential hot spots | Time-series + compound prefix key |
| Hash sharding | Uniform distribution, no sequential hot spots | No range queries | Random-access workloads |
| Local secondary index | Fast writes (single shard) | Scatter/gather reads | Write-heavy + occasional secondary search |
| Global secondary index | Fast reads (single index shard) | Eventual consistency, write cost | Read-heavy secondary access |
| Auto rebalancing | No operator intervention | Dangerous during failures | Dev/test, managed cloud services |
| Manual rebalancing | Safe, controlled timing | Operator burden | Production SLA-critical clusters |
| Row-level tenancy | Low overhead, simple | Noisy neighbor, leak risk | Small-scale SaaS, homogeneous tenants |
| DB-level tenancy | Full isolation, compliance easy | High overhead | Enterprise contracts, regulated data |
Modern Additions (2026)
Serverless auto-sharding:
├─ DynamoDB on-demand: shards scale automatically; no shard key tuning
├─ PlanetScale (Vitess): MySQL sharding transparent via VTGate
└─ Neon PostgreSQL: storage-layer sharding; app sees single DB
NewSQL transparent sharding:
├─ CockroachDB: auto range splits + rebalancing + Raft per range
├─ Google Spanner: TrueTime + auto shards + global ACID transactions
└─ YugabyteDB: PostgreSQL-compatible, auto Raft-based sharding
Kafka KRaft (no ZooKeeper):
├─ Partition metadata stored in Kafka's own Raft log
├─ Simpler operations, faster partition reassignment
└─ One fewer system to operate and monitor
Vector DB sharding:
├─ Pinecone, Weaviate, Milvus: shard by cluster assignment (k-means)
├─ Approximate nearest neighbor requires replicated indexes per shard
└─ Not key-range or hash; purpose-built for high-dimensional space
Interview Response Templates
”How would you shard a social media platform?”
“I’d first analyze access patterns: user profile reads are by user_id (hash shard on user_id for uniform distribution), but feed queries are time-range-based (compound key: hash user_id, range by timestamp within shard). For the celebrity hot-spot problem, I’d use a caching layer (Redis) for public profiles and posts of high-follower accounts, backed by CDN for truly public content. For secondary indexes like ‘all posts with hashtag X’, I’d use global term-partitioned indexes for read efficiency, accepting slight eventual consistency."
"Local vs global secondary index — when do you use each?”
“Local secondary indexes are write-cheap (single shard) but read-expensive (scatter/gather = tail latency amplification). I’d use them when writes are frequent and secondary reads are rare or latency-tolerant. Global secondary indexes reverse this: writes touch two shards and the index is eventually consistent, but secondary reads hit one partition directly. I’d use them for read-heavy secondary access patterns where index freshness can be slightly stale — like product catalog searches, not financial ledger reads.”
Last Updated: 2026-05-29