Chapter 7 Cheat Sheet — Sharding

ddia-2e sharding cheatsheet

Quick Revision Time: 5 minutes | Interview Prep: 15 minutes


One-Line Summaries

ConceptOne-Liner
ShardingSplit data across nodes; each node holds a subset
Key-range shardingAdjacent keys co-located; enables range queries; risks sequential hot spots
Hash shardingHash key for uniform spread; range queries require scatter/gather
Consistent hashingRing-based; adding nodes only moves adjacent keys (~1/N of data)
Hot spotOne shard gets disproportionate load (celebrity posts, viral content)
Local secondary indexPer-shard index; fast writes, scatter/gather reads
Global secondary indexSingle term-partitioned index; fast reads, async/eventual updates
Scatter/gatherFan out query to ALL shards; tail latency = slowest shard
RebalancingMoving shards between nodes as topology changes
Multitenancy shardingShard key includes tenant_id; row/schema/database-level isolation

Sharding Strategy Decision Tree

What are your access patterns?
│
├─ Need range queries (all events Jan-Feb, all sensors in region X)?
│  ├─ Use key-range sharding
│  └─ Watch for: sequential key hot spots (timestamps → prefix with entity ID)
│
├─ Need uniform distribution, no range queries?
│  └─ Use hash sharding (consistent hashing for cluster changes)
│
├─ Need BOTH range and uniform distribution?
│  └─ Compound key: hash(partition_key) + sorted clustering_key
│     Example: (user_id hashed) + (timestamp sorted within shard)
│
├─ SaaS / multi-tenant platform?
│  ├─ Small tenants → row-level isolation, tenant_id as shard key prefix
│  ├─ Large tenants → dedicated shards or dedicated DB
│  └─ Mix → tiered: shared pool + reserved shards for top N tenants
│
└─ Celebrity / viral content hot spot?
   ├─ Write hot spot → random suffix (100 variants per key)
   ├─ Read hot spot → cache layer (Redis/CDN in front)
   └─ Both → write suffix + read-aside cache

Key-Range vs Hash Sharding

Key-Range Sharding                     Hash Sharding
───────────────────────────────        ─────────────────────────────
[key_001─key_500] → Shard 0            hash(key) → bucket → Shard N
[key_501─key_999] → Shard 1
[key_1000─key_1999] → Shard 2

✅ Range queries span one shard         ✅ Uniform distribution
✅ Sorted data → prefix scans work      ✅ No hot spots from sequential keys
✅ Co-locate related data (same user)   ✅ Simple routing logic
❌ Sequential key hot spots             ❌ Range queries = scatter/gather
❌ Uneven shard sizes possible          ❌ Cannot infer adjacent keys
❌ Manual boundary management           ❌ Skewed access still causes hot spots

Consistent Hashing Ring

Hash space: 0 ──────────────────────────── 2^64

Nodes at random ring positions:

      key_X (hash=25)
           ↓
0  [A@10]  •──────  [B@45]  ────────  [C@80]  ──── 2^64
   
key_X (hash=25) → Node B (first clockwise from 25)

Add Node D at position 35:
0  [A@10]  [D@35]key_X [B@45]       [C@80]  ──── 2^64

key_X now routes to D — only keys in (10,35] need to move!

Multitenancy Isolation Models

┌─────────────────────┬───────────────────────┬────────────────────────┐
│ Row-Level           │ Schema-Level           │ Database-Level         │
├─────────────────────┼───────────────────────┼────────────────────────┤
│ All tenants share   │ Each tenant has its    │ Each tenant has a      │
│ tables; tenant_id   │ own schema in shared   │ dedicated database     │
│ column distinguishes│ database instance      │ (or cluster)           │
├─────────────────────┼───────────────────────┼────────────────────────┤
│ ✅ Low overhead      │ ✅ Moderate isolation  │ ✅ Full isolation       │
│ ✅ Simple schema     │ ✅ Per-tenant backups   │ ✅ Easy compliance      │
│ ❌ Noisy neighbors   │ ❌ Schema count limits  │ ❌ High ops overhead    │
│ ❌ Bug = data leak   │ ❌ Harder aggregation   │ ❌ Hard to aggregate    │
├─────────────────────┼───────────────────────┼────────────────────────┤
│ Shared SaaS apps    │ Shopify, mid-tier SaaS │ Enterprise contracts   │
└─────────────────────┴───────────────────────┴────────────────────────┘

Local vs Global Secondary Indexes

LOCAL (Document-Partitioned):              GLOBAL (Term-Partitioned):
────────────────────────────────           ──────────────────────────────────
Shard 0: data + idx for its rows           Index shard "red": ptrs to all shards
Shard 1: data + idx for its rows           Index shard "blue": ptrs to all shards
Shard 2: data + idx for its rows

WRITE: update 1 shard → fast               WRITE: update 2 shards → slower
READ:  scatter to ALL shards → slow        READ: lookup 1 index shard → fast
       tail latency problem                      (may be slightly stale)

Best: write-heavy + tolerates read cost    Best: read-heavy secondary access
Used: MongoDB, Elasticsearch, Cassandra    Used: DynamoDB GSI, Riak Search

Rebalancing Strategies

1. Fixed shard count (Elasticsearch, Riak, Couchbase)
   ───────────────────────────────────────────────────
   Create 1000 shards for 10 nodes → 100/node
   Add node: steal ~91 shards, no splitting needed
   Con: shard count set at creation; hard to change later

2. Dynamic splitting (HBase, MongoDB, RethinkDB)
   ───────────────────────────────────────────────────
   Shard > 10 GB → split into two 5 GB shards
   Shard < 1 GB  → merge with neighbor
   Pro: auto-adapts to data volume
   Con: cold-start problem (single shard handles everything at launch)

3. Vnodes / proportional (Cassandra)
   ───────────────────────────────────────────────────
   Each node owns N virtual nodes (default: 256)
   Add node: randomly steal ~256/(total_nodes+1) vnodes each from existing nodes
   Pro: natural scaling; finer granularity with more nodes
   Con: gossip overhead grows with vnode count

Request Routing Approaches

Approach 1: Any-node + forward         Approach 2: Routing tier
──────────────────────────────         ──────────────────────────────
Client → Node A (random)               Client → ZooKeeper-backed Router
  Node A: "I don't own K"                          ↓ (knows shard map)
  Node A → Node B (owns K)              Correct Shard Node
  Node B → Node A → Client                         ↓
                                        Result → Client
Extra hop in data path                  Extra hop before data path
Used: Cassandra (gossip)               Used: Kafka, some MongoDB

Approach 3: Smart client
────────────────────────────────────────
Client subscribes to ZooKeeper/etcd for shard map
Client → Correct Node directly (no extra hops)
Lowest latency. Used: Kafka producer/consumer API, Cassandra drivers

Hot Spot Relief Reference

ScenarioProblemFixTrade-off
Timestamp shard keyAll writes to “now” shardPrefix with entity ID (sensor_id, user_id)Key design change required
Celebrity post readsMillions of reads to one shardRedis/CDN cache in frontCache invalidation complexity
Viral write burstWrite spike to single shardRandom suffix (100 variants)Reads must aggregate 100 keys
Large tenant in shared poolOne tenant saturates shared shardDedicated shard for top-N tenantsTenant routing layer needed

Auto vs Manual Rebalancing

DANGER SCENARIO — Auto rebalancing during failure:
──────────────────────────────────────────────────
1. Node C slows down (GC pause, network blip)
2. Auto-rebalancer: "Node C seems down, move its shards!"
3. Heavy data transfer begins on already-stressed Node A and Node B
4. A and B, now overloaded, also slow down
5. Rebalancer: "A and B are slow too!" → triggers more moves
6. Cascade failure: cluster completely down

PREVENTION:
- Add delay threshold before triggering rebalancing
- Require quorum of nodes to agree a node is truly down
- Require human operator approval for large rebalancing operations
- Separate rebalancing monitoring from regular load alerts

Key Trade-offs Table

DecisionProsConsUse When
Key-range shardingRange queries, co-locationSequential hot spotsTime-series + compound prefix key
Hash shardingUniform distribution, no sequential hot spotsNo range queriesRandom-access workloads
Local secondary indexFast writes (single shard)Scatter/gather readsWrite-heavy + occasional secondary search
Global secondary indexFast reads (single index shard)Eventual consistency, write costRead-heavy secondary access
Auto rebalancingNo operator interventionDangerous during failuresDev/test, managed cloud services
Manual rebalancingSafe, controlled timingOperator burdenProduction SLA-critical clusters
Row-level tenancyLow overhead, simpleNoisy neighbor, leak riskSmall-scale SaaS, homogeneous tenants
DB-level tenancyFull isolation, compliance easyHigh overheadEnterprise contracts, regulated data

Modern Additions (2026)

Serverless auto-sharding:
├─ DynamoDB on-demand: shards scale automatically; no shard key tuning
├─ PlanetScale (Vitess): MySQL sharding transparent via VTGate
└─ Neon PostgreSQL: storage-layer sharding; app sees single DB

NewSQL transparent sharding:
├─ CockroachDB: auto range splits + rebalancing + Raft per range
├─ Google Spanner: TrueTime + auto shards + global ACID transactions
└─ YugabyteDB: PostgreSQL-compatible, auto Raft-based sharding

Kafka KRaft (no ZooKeeper):
├─ Partition metadata stored in Kafka's own Raft log
├─ Simpler operations, faster partition reassignment
└─ One fewer system to operate and monitor

Vector DB sharding:
├─ Pinecone, Weaviate, Milvus: shard by cluster assignment (k-means)
├─ Approximate nearest neighbor requires replicated indexes per shard
└─ Not key-range or hash; purpose-built for high-dimensional space

Interview Response Templates

”How would you shard a social media platform?”

“I’d first analyze access patterns: user profile reads are by user_id (hash shard on user_id for uniform distribution), but feed queries are time-range-based (compound key: hash user_id, range by timestamp within shard). For the celebrity hot-spot problem, I’d use a caching layer (Redis) for public profiles and posts of high-follower accounts, backed by CDN for truly public content. For secondary indexes like ‘all posts with hashtag X’, I’d use global term-partitioned indexes for read efficiency, accepting slight eventual consistency."

"Local vs global secondary index — when do you use each?”

“Local secondary indexes are write-cheap (single shard) but read-expensive (scatter/gather = tail latency amplification). I’d use them when writes are frequent and secondary reads are rare or latency-tolerant. Global secondary indexes reverse this: writes touch two shards and the index is eventually consistent, but secondary reads hit one partition directly. I’d use them for read-heavy secondary access patterns where index freshness can be slightly stale — like product catalog searches, not financial ledger reads.”


Last Updated: 2026-05-29