Chapter 7: Sharding
ddia-2e sharding partitioning distributed-systems
Status: Notes complete
Overview
Sharding (the 2nd edition’s renamed version of “Partitioning”) is the technique of splitting a large dataset across multiple nodes so each node holds only a subset of the data. When a single machine cannot hold or serve all the data, sharding enables horizontal scaling: adding more nodes increases both storage capacity and query throughput. The central engineering challenge is choosing a sharding strategy that distributes data and load evenly, while supporting the query patterns the application needs — and handling the operational realities of nodes joining and leaving the cluster over time.
2nd Edition change: The chapter is renamed from “Partitioning” to “Sharding” to match modern industry terminology used by Elasticsearch, MongoDB, Vitess, and most cloud databases. The content is largely equivalent to 1st edition Ch6 but adds a substantial new section on Sharding for Multitenancy — critical for SaaS platform design. The term “shard” is used throughout where the 1st edition used “partition.”
Key insight: Sharding and replication are orthogonal and usually combined — each shard is replicated to multiple nodes. Sharding solves scale; replication solves availability.
Key Concepts
Pros and Cons of Sharding
Why shard?
- Dataset exceeds single-node storage capacity
- Read/write throughput exceeds single-node limits
- Latency requirements demand data be geographically close to users
- Regulatory requirements mandate data residency in specific regions
Why NOT shard (yet)?
- Operational complexity increases dramatically: rebalancing, schema changes, cross-shard queries all become harder
- Cross-shard transactions require distributed coordination (2PC or saga pattern)
- Secondary indexes become either scatter/gather (local) or eventually consistent (global)
- Debugging, monitoring, and tooling are all more complex
Decision rule: Start without sharding. Shard when you have a concrete bottleneck (storage, throughput, or latency) that cannot be addressed by vertical scaling, caching, or read replicas alone.
Sharding for Multitenancy
Problem context: SaaS platforms serve many customers (tenants) from a shared infrastructure. The key question is how to isolate tenant data — both for security/compliance and to prevent one tenant’s heavy load from degrading another’s (the “noisy neighbor” problem).
Three isolation models:
| Model | Description | Pros | Cons | Examples |
|---|---|---|---|---|
| Row-level isolation | All tenants share tables; tenant_id column distinguishes rows | Low overhead, simple schema | Noisy neighbor risk; accidental data leakage if bugs exist | Shared SaaS apps |
| Schema-level isolation | Each tenant gets their own schema within a shared database | Moderate isolation; easier backups per tenant | Schema count limits (PostgreSQL: ~100s practical) | Shopify, multi-tenant PostgreSQL |
| Database-level isolation | Each tenant gets a separate database (or even cluster) | Full isolation, easy compliance, simple backups | Highest overhead; hard to aggregate cross-tenant | Enterprise SaaS with large customers |
Sharding by tenant: For row-level isolation, the tenant_id is typically the shard key (or first component of a compound shard key). This guarantees all data for one tenant lands on the same shard, enabling efficient tenant-scoped queries without cross-shard joins.
Hybrid approach (common in practice):
- Small/free-tier tenants: row-level isolation on shared shards
- Large/enterprise tenants: dedicated shards or dedicated databases
- “Tenant routing” layer maps tenant_id → shard dynamically
Noisy neighbor mitigation:
- Rate limiting per tenant at the application layer
- Tenant-aware load balancing (route heavy tenants to dedicated nodes)
- Burst absorption via auto-scaling (serverless databases)
Compliance considerations (GDPR, HIPAA):
- Database-level isolation simplifies data deletion (drop database for a tenant)
- Row-level isolation requires careful soft-delete or purge logic
- Schema-level isolation can work if schema-level backup/restore is automated
Sharding of Key-Value Data
Sharding by Key Range
How it works: Assign a continuous range of keys to each shard. Keys within each shard are kept sorted (typically in a B-tree).
Lexicographic key-range sharding:
─────────────────────────────────────────────────────────────────
Key: [A───────D] [E──────────M] [N──────────R] [S────────Z]
Shard: Shard 0 Shard 1 Shard 2 Shard 3
─────────────────────────────────────────────────────────────────
Advantages:
- Range queries are efficient: scan one contiguous shard
- Keys are stored sorted → range scans, prefix queries, “next N” queries all work
- Data locality: related keys (e.g., same user’s time-ordered events) land on the same shard
Disadvantages:
- Hot spots from access patterns: If the key is a timestamp, all writes go to “today’s” shard
- Fix: Prefix timestamp with a sensor/user ID so different entities go to different shards
- Uneven distribution: Some key ranges may contain far more data than others
- Split boundaries must be manually set or dynamically managed
Used by: HBase, Google Bigtable, MongoDB (range-based), CockroachDB (automatic range splits), Spanner
Sharding by Hash of Key
How it works: Apply a hash function to the key; the hash value determines which shard receives the data.
Hash-based sharding:
─────────────────────────────────────────────────────────────────
hash("user_001") = 0x3A → Shard 2
hash("user_002") = 0x91 → Shard 5
hash("user_003") = 0x17 → Shard 1
(hash values evenly spread across shard space)
─────────────────────────────────────────────────────────────────
Hash function properties:
- Must be deterministic: same key always produces same hash
- Should distribute uniformly across the hash space
- Not cryptographic (performance matters): MD5, Murmur3, FNV are common
- NOT Python’s
hash()— it’s randomized per process; use a stable hash
Consistent hashing (ring-based):
Consistent hashing ring (hash space 0 → 2^64):
key_X (hash=25)
↓
0 [Node A @ 10] [Node B @ 45] [Node C @ 80] 2^64
────●────────────────────●──────────────────●────────────●
key_X (hash=25) → Node B (first node clockwise from 25)
Add Node D at position 35:
────●─────────●──────●──────────────────●────────────────●
A@10 D@35 key_X B@45 C@80
key_X (hash=25) → now goes to Node D
Only keys in range [10, 35] need to move!
All other keys unchanged → minimal rebalancing
Advantages:
- Even distribution of data and load (for uniformly-distributed keys)
- No hot spots from sequential key patterns (timestamps, auto-increment IDs)
- Consistent hashing: adding/removing nodes only requires moving ~1/N of the data
Disadvantages:
- Range queries impossible across shard boundaries: adjacent logical keys are now scattered
- Cassandra compromise: Compound primary key — hash the first part (partition key), sort the rest (clustering key) within one shard
- Skewed workloads: Popular keys still cause hot spots regardless of hash distribution
Used by: Cassandra (partition key), DynamoDB, MongoDB (hash-based sharding), Redis Cluster
Skewed Workloads and Relieving Hot Spots
Hot spot definition: A single shard receiving disproportionately high read or write load compared to others.
Celebrity/viral content problem: When a celebrity posts (Twitter/Instagram/YouTube), millions of users simultaneously read that one item → all requests to a single shard.
Hash partitioning does NOT solve this: even if celebrity user IDs are evenly distributed, the access frequency for one key vastly exceeds others.
Mitigation strategies:
Strategy 1: Random suffix on hot writes
─────────────────────────────────────────
Write: celebrity_123_00, celebrity_123_01, ..., celebrity_123_99
(writes spread across 100 shards)
Read: query all 100 keys, merge results
(read complexity increases, but load spreads)
Trade-off: Consistent aggregation requires reading from all 100 shards.
Only worth it if writes are the bottleneck.
Strategy 2: Read replicas for hot shards
─────────────────────────────────────────
Increase replication factor for hot shard's primary
Route reads to replicas → horizontal read scaling
Write still goes to single shard primary
Strategy 3: Application-layer caching
───────────────────────────────────────
Celebrity's content → Redis/Memcached in front of DB
90%+ of reads served from cache, DB shard not hit
Works well for content that is immutable or rarely changes
Strategy 4: CDN offload
────────────────────────
Celebrity's public posts → CDN edge cache
Read requests never reach the database at all
Only feasible for public, static or semi-static content
Modern databases (Cassandra 4.x, DynamoDB): token-aware load balancing and automatic read routing reduce hot-spot impact but don’t eliminate them.
Operations: Automatic Versus Manual Rebalancing
Why rebalancing is needed: Load changes over time — data grows, access patterns shift, nodes are added for capacity or removed for failures. Shards must be redistributed to maintain balance.
Strategy 1: Fixed number of shards (Riak, Elasticsearch, Couchbase)
Initial: 1000 shards across 10 nodes → 100 shards/node
Add 11th node: move ~91 shards from existing nodes → ~91 shards/node
Shard count fixed at creation — changing it requires full data migration
- Pro: Simple; no shard splitting logic needed
- Con: Must choose shard count at creation; too few = can’t scale; too many = overhead per shard
Strategy 2: Dynamic sharding (HBase, RethinkDB, MongoDB)
Shard split trigger: shard size > threshold (e.g., 10 GB)
→ Split into two 5 GB shards
→ Migrate one half to another node
Shard merge trigger: shard size < threshold (e.g., 1 GB)
→ Merge with adjacent shard
- Pro: Adapts automatically to data growth
- Con: Single shard at startup (all data hits one node until first split); split storms possible
Strategy 3: Shards proportional to nodes (Cassandra)
Each node gets N virtual nodes (vnodes), e.g., 256
10 nodes → 2560 vnodes
Add 11th node: randomly steal ~232 vnodes (one from each existing node)
More nodes = more total vnodes = finer load granularity
- Pro: Scales naturally; random stealing distributes load evenly
- Con: Many vnodes = more gossip overhead; higher coordination cost
Automatic vs manual rebalancing:
| Automatic | Manual | |
|---|---|---|
| Convenience | High — no operator action needed | Low — requires operator judgment |
| Safety | Risk: rebalancing during a failure amplifies stress on surviving nodes | Safe — operator can choose a quiet time |
| Speed | Immediate reaction to imbalance | Delayed by human availability |
| Best for | Dev environments, managed cloud services | Production clusters, SLA-sensitive systems |
The danger of automatic rebalancing during failures:
- One node slows down or times out
- Auto-rebalancer interprets this as node failure → begins moving its shards
- Rebalancing = heavy network + disk I/O on the surviving nodes
- Already-stressed nodes get more load → cascading failure cascade
- Best practice: Add delays, require quorum agreement, or require human approval for large rebalancing operations
Request Routing
Problem: When a client wants to read or write key K, which of the N nodes should it contact?
Three approaches:
Approach 1: Any-node forwarding (gossip protocol)
──────────────────────────────────────────────────
Client → Node A (random)
↓ (if A doesn't own key K)
Node A → forwards to Node B (which owns K)
Node B returns result to Node A
Node A returns result to Client
Extra hop overhead. Used by: Cassandra (gossip), Riak
Approach 2: Routing tier (proxy layer)
────────────────────────────────────────
Client → Routing Tier (ZooKeeper-backed)
↓ (tier knows current shard→node mapping)
Correct Node → result → Client
Single extra network hop but no forwarding between data nodes.
Used by: Kafka (via ZooKeeper/KRaft), some MongoDB configs
Approach 3: Partition-aware clients
──────────────────────────────────────
Client maintains shard map (subscribed to ZooKeeper/etcd updates)
Client → Correct Node directly (no extra hops)
Lowest latency. Used by: Kafka producer/consumer API, some Cassandra drivers
Metadata management: ZooKeeper, etcd, or KRaft store the authoritative shard → node mapping. All nodes, routing tiers, and smart clients subscribe to updates. When a shard moves (rebalancing), ZooKeeper notifies all subscribers.
ZooKeeper-based routing in Kafka:
- Topic partitions assigned to brokers
- Partition metadata stored in ZooKeeper (Kafka <3.x) or KRaft log (Kafka 3.x+)
- Producer/consumer clients fetch metadata on startup, cache it, refresh on errors
- Partition leadership changes propagate quickly via metadata updates
Sharding and Secondary Indexes
Primary key sharding is conceptually clean: hash or range on the primary key, done. But secondary indexes (indexes on non-primary-key columns) are fundamentally harder because the secondary index attribute doesn’t determine which shard holds the data.
Local Secondary Indexes (Document-Partitioned)
How it works: Each shard maintains its own secondary index for its local data only.
Local Secondary Index (document-partitioned):
──────────────────────────────────────────────
Shard 0: Shard 1: Shard 2:
data: car_01 {color:red} data: car_07 {color:red} data: car_15 {color:blue}
data: car_02 {color:blue} data: car_08 {color:blue} data: car_16 {color:red}
local_idx: local_idx: local_idx:
red → [car_01] red → [car_07, car_09] red → [car_16]
blue → [car_02] blue → [car_08] blue → [car_15]
WRITE: update one shard's local index → fast, single-shard write
READ for color=red:
→ Scatter: query ALL shards for their local red index
→ Gather: merge [car_01] + [car_07, car_09] + [car_16] = [car_01, car_07, car_09, car_16]
→ Tail latency = slowest responding shard
Used by: MongoDB, Riak, Cassandra (with ALLOW FILTERING), Elasticsearch, VoltDB
Global Secondary Indexes (Term-Partitioned)
How it works: A single global index, itself sharded by the index term (not by the primary key).
Global Secondary Index (term-partitioned):
───────────────────────────────────────────
Index shard for "red": Index shard for "blue":
red → [shard0/car_01, blue → [shard0/car_02,
shard1/car_07, shard1/car_08,
shard1/car_09, shard2/car_15]
shard2/car_16]
WRITE car_01 {color:red}:
→ write to data shard (shard 0) + write to index shard for "red"
→ TWO shard writes; often async → eventual consistency
READ for color=red:
→ query ONE index shard for "red"
→ get list of pointers → fetch data from individual shards
→ fast index lookup, but index may be slightly stale
Used by: DynamoDB Global Secondary Indexes, Riak Search
Local vs Global Secondary Index Comparison
| Dimension | Local (Document-Partitioned) | Global (Term-Partitioned) |
|---|---|---|
| Write cost | Single shard write | Cross-shard write (2+ shards) |
| Write latency | Low | Higher (especially with sync updates) |
| Read cost | Scatter/gather across all shards | Single index shard lookup |
| Read latency | High (tail latency = slowest shard) | Low |
| Consistency | Immediately consistent with data | Often eventually consistent (async) |
| Maintenance | Each shard maintains its own | Global index must be kept in sync |
| Best for | Write-heavy workloads; data is large | Read-heavy secondary access; small # of indexed values |
| Examples | MongoDB, Elasticsearch, Cassandra | DynamoDB GSI, Riak Search |
Comparison Tables
Sharding Strategy Comparison
| Strategy | Range Queries | Distribution | Hot Spots | Rebalancing Ease | Used By |
|---|---|---|---|---|---|
| Key-range | Excellent | Uneven (by design) | Risk with sequential keys | Dynamic splitting | HBase, Bigtable, Spanner |
| Hash | Poor (scatter) | Excellent | Still possible (access skew) | Consistent hashing | Cassandra, DynamoDB, Redis |
| Compound (hash+range) | Within-shard only | Good | Low | Moderate | Cassandra, DynamoDB |
| Geographic/geohash | Nearby cells | Good for spatial | Dense areas | Range-based | PostGIS, Uber H3 |
Rebalancing Strategy Comparison
| Strategy | Auto-adapts | Min Movement | Startup Hotspot | Complexity | Used By |
|---|---|---|---|---|---|
| Fixed shard count | No (manual redistribution) | Low | No | Low | Elasticsearch, Riak |
| Dynamic splitting | Yes (by size) | Low | Yes (one shard start) | Medium | HBase, MongoDB |
| Vnodes/proportional | Yes (by node count) | Medium | No | Medium | Cassandra |
Important Points Summary
- Sharding is the renamed term for partitioning: The 2nd edition adopts industry terminology (Elasticsearch, MongoDB) over the academic “partitioning.”
- Shard key choice is the most important sharding decision: It determines data distribution, query patterns, and hot-spot risk. Wrong choices are hard to fix without full data migration.
- Hash sharding and range sharding are not compatible: Choosing one forces you to give up the other unless you use compound keys.
- Multitenancy adds a second dimension to sharding: The tenant boundary must be respected for both isolation and performance. Row-level, schema-level, and database-level isolation each have different operational trade-offs.
- Hot spots require application-level solutions: No sharding strategy automatically solves for access-pattern skew (celebrity posts, viral content). Cache + CDN + random suffix are the main tools.
- Local secondary indexes are write-fast, read-expensive: Scatter/gather across all shards produces tail-latency amplification proportional to shard count.
- Global secondary indexes are read-fast but eventually consistent: Writes must asynchronously update the global index; reads may see stale data.
- Automatic rebalancing during failures is dangerous: Combining heavy data movement with a degraded cluster can cascade a partial failure into a total outage.
- ZooKeeper/etcd/KRaft provide the shard metadata layer: All routing approaches ultimately depend on a consistent, fault-tolerant metadata store.
- NewSQL databases (CockroachDB, Spanner) make sharding transparent: Automatic range splitting and rebalancing with full ACID transactions — the “no shard key needed” promise.
Modern Context (2026)
The shift from “partition” to “shard”:
The 2nd edition rename reflects a genuine industry shift. When Elasticsearch, MongoDB, and Redis all call them “shards,” using “partition” creates unnecessary vocabulary friction. The 2nd edition aligns with the practitioner’s lexicon.
Serverless auto-sharding:
- DynamoDB on-demand: shards scale automatically with traffic, invisible to the application
- Neon (PostgreSQL): sharding at the storage layer; application queries a logical single database
- PlanetScale (Vitess): horizontal sharding for MySQL with transparent VTGate routing
- Applications no longer need to design shard keys for many use cases
NewSQL transparent sharding:
- CockroachDB and YugabyteDB: automatic range-based sharding with Raft replication per shard
- Spanner: TrueTime + Paxos + automatic shard splits; globally consistent
- Cross-shard transactions are first-class (not bolted on); latency is higher but correctness is guaranteed
Kafka KRaft (metadata without ZooKeeper):
- Kafka 3.x+ removes ZooKeeper dependency for partition metadata
- KRaft protocol: metadata stored in a Kafka-internal Raft log
- Simpler ops, faster partition reassignment, single fewer system to operate
Vector database sharding:
- ANN (approximate nearest neighbor) search requires specialized sharding: index replicated or partitioned by cluster assignment (k-means), not by key range or hash
- Pinecone, Weaviate, Milvus: purpose-built sharding strategies for high-dimensional vector search
Questions for Reflection
- Why does the 2nd edition rename “partitioning” to “sharding,” and what does this reflect about how the field has evolved?
- A SaaS company has 50,000 small tenants and 5 enterprise tenants who each generate 40% of the load. How would you design the sharding strategy to isolate noisy neighbors while serving small tenants cheaply?
- A write-heavy IoT application stores sensor readings keyed by
(sensor_id, timestamp). The team proposes hash-partitioning bysensor_id. What hot-spot risk does this introduce, and how would you mitigate it? - Explain why scatter/gather reads against a local secondary index produce worse tail latency than a direct shard lookup. What is the formal relationship between P99 latency and shard count?
- When is manual rebalancing safer than automatic rebalancing? Describe a real failure scenario where automatic rebalancing would have amplified an incident.
- How does CockroachDB’s approach to sharding differ from Cassandra’s? In what scenarios would you prefer each?
Related Resources
- ch06-replication — Replication is combined with sharding; each shard has replicas
- ch08-transactions — Cross-shard transactions require distributed coordination
- ch10-consistency-and-consensus — Raft/Paxos used for shard metadata in CockroachDB/Spanner
- ch06-partitioning — 1st edition version (Partitioning) for comparison
- ch09-trouble-with-distributed-systems — Network failures affect shard routing and rebalancing
Last Updated: 2026-05-29