Chapter 7: Sharding

ddia-2e sharding partitioning distributed-systems

Status: Notes complete

Overview

Sharding (the 2nd edition’s renamed version of “Partitioning”) is the technique of splitting a large dataset across multiple nodes so each node holds only a subset of the data. When a single machine cannot hold or serve all the data, sharding enables horizontal scaling: adding more nodes increases both storage capacity and query throughput. The central engineering challenge is choosing a sharding strategy that distributes data and load evenly, while supporting the query patterns the application needs — and handling the operational realities of nodes joining and leaving the cluster over time.

2nd Edition change: The chapter is renamed from “Partitioning” to “Sharding” to match modern industry terminology used by Elasticsearch, MongoDB, Vitess, and most cloud databases. The content is largely equivalent to 1st edition Ch6 but adds a substantial new section on Sharding for Multitenancy — critical for SaaS platform design. The term “shard” is used throughout where the 1st edition used “partition.”

Key insight: Sharding and replication are orthogonal and usually combined — each shard is replicated to multiple nodes. Sharding solves scale; replication solves availability.

Key Concepts

Pros and Cons of Sharding

Why shard?

Dataset exceeds single-node storage capacity
Read/write throughput exceeds single-node limits
Latency requirements demand data be geographically close to users
Regulatory requirements mandate data residency in specific regions

Why NOT shard (yet)?

Operational complexity increases dramatically: rebalancing, schema changes, cross-shard queries all become harder
Cross-shard transactions require distributed coordination (2PC or saga pattern)
Secondary indexes become either scatter/gather (local) or eventually consistent (global)
Debugging, monitoring, and tooling are all more complex

Decision rule: Start without sharding. Shard when you have a concrete bottleneck (storage, throughput, or latency) that cannot be addressed by vertical scaling, caching, or read replicas alone.

Sharding for Multitenancy

Problem context: SaaS platforms serve many customers (tenants) from a shared infrastructure. The key question is how to isolate tenant data — both for security/compliance and to prevent one tenant’s heavy load from degrading another’s (the “noisy neighbor” problem).

Three isolation models:

Model	Description	Pros	Cons	Examples
Row-level isolation	All tenants share tables; `tenant_id` column distinguishes rows	Low overhead, simple schema	Noisy neighbor risk; accidental data leakage if bugs exist	Shared SaaS apps
Schema-level isolation	Each tenant gets their own schema within a shared database	Moderate isolation; easier backups per tenant	Schema count limits (PostgreSQL: ~100s practical)	Shopify, multi-tenant PostgreSQL
Database-level isolation	Each tenant gets a separate database (or even cluster)	Full isolation, easy compliance, simple backups	Highest overhead; hard to aggregate cross-tenant	Enterprise SaaS with large customers

Sharding by tenant: For row-level isolation, the tenant_id is typically the shard key (or first component of a compound shard key). This guarantees all data for one tenant lands on the same shard, enabling efficient tenant-scoped queries without cross-shard joins.

Hybrid approach (common in practice):

Small/free-tier tenants: row-level isolation on shared shards
Large/enterprise tenants: dedicated shards or dedicated databases
“Tenant routing” layer maps tenant_id → shard dynamically

Noisy neighbor mitigation:

Rate limiting per tenant at the application layer
Tenant-aware load balancing (route heavy tenants to dedicated nodes)
Burst absorption via auto-scaling (serverless databases)

Compliance considerations (GDPR, HIPAA):

Database-level isolation simplifies data deletion (drop database for a tenant)
Row-level isolation requires careful soft-delete or purge logic
Schema-level isolation can work if schema-level backup/restore is automated

Sharding of Key-Value Data

Sharding by Key Range

How it works: Assign a continuous range of keys to each shard. Keys within each shard are kept sorted (typically in a B-tree).

Lexicographic key-range sharding:
─────────────────────────────────────────────────────────────────
Key:    [A───────D]  [E──────────M]  [N──────────R]  [S────────Z]
Shard:    Shard 0       Shard 1         Shard 2         Shard 3
─────────────────────────────────────────────────────────────────

Advantages:

Range queries are efficient: scan one contiguous shard
Keys are stored sorted → range scans, prefix queries, “next N” queries all work
Data locality: related keys (e.g., same user’s time-ordered events) land on the same shard

Disadvantages:

Hot spots from access patterns: If the key is a timestamp, all writes go to “today’s” shard
- Fix: Prefix timestamp with a sensor/user ID so different entities go to different shards
Uneven distribution: Some key ranges may contain far more data than others
Split boundaries must be manually set or dynamically managed

Used by: HBase, Google Bigtable, MongoDB (range-based), CockroachDB (automatic range splits), Spanner

Sharding by Hash of Key

How it works: Apply a hash function to the key; the hash value determines which shard receives the data.

Hash-based sharding:
─────────────────────────────────────────────────────────────────
hash("user_001") = 0x3A → Shard 2
hash("user_002") = 0x91 → Shard 5
hash("user_003") = 0x17 → Shard 1
(hash values evenly spread across shard space)
─────────────────────────────────────────────────────────────────

Hash function properties:

Must be deterministic: same key always produces same hash
Should distribute uniformly across the hash space
Not cryptographic (performance matters): MD5, Murmur3, FNV are common
NOT Python’s hash() — it’s randomized per process; use a stable hash

Consistent hashing (ring-based):

Consistent hashing ring (hash space 0 → 2^64):

          key_X (hash=25)
              ↓
  0   [Node A @ 10]   [Node B @ 45]   [Node C @ 80]   2^64
  ────●────────────────────●──────────────────●────────────●
  
  key_X (hash=25) → Node B (first node clockwise from 25)
  
  Add Node D at position 35:
  ────●─────────●──────●──────────────────●────────────────●
      A@10    D@35  key_X B@45            C@80
  
  key_X (hash=25) → now goes to Node D
  Only keys in range [10, 35] need to move!
  All other keys unchanged → minimal rebalancing

Advantages:

Even distribution of data and load (for uniformly-distributed keys)
No hot spots from sequential key patterns (timestamps, auto-increment IDs)
Consistent hashing: adding/removing nodes only requires moving ~1/N of the data

Disadvantages:

Range queries impossible across shard boundaries: adjacent logical keys are now scattered
- Cassandra compromise: Compound primary key — hash the first part (partition key), sort the rest (clustering key) within one shard
Skewed workloads: Popular keys still cause hot spots regardless of hash distribution

Used by: Cassandra (partition key), DynamoDB, MongoDB (hash-based sharding), Redis Cluster

Skewed Workloads and Relieving Hot Spots

Hot spot definition: A single shard receiving disproportionately high read or write load compared to others.

Celebrity/viral content problem: When a celebrity posts (Twitter/Instagram/YouTube), millions of users simultaneously read that one item → all requests to a single shard.

Hash partitioning does NOT solve this: even if celebrity user IDs are evenly distributed, the access frequency for one key vastly exceeds others.

Mitigation strategies:

Strategy 1: Random suffix on hot writes
─────────────────────────────────────────
Write: celebrity_123_00, celebrity_123_01, ..., celebrity_123_99
       (writes spread across 100 shards)
Read:  query all 100 keys, merge results
       (read complexity increases, but load spreads)

Trade-off: Consistent aggregation requires reading from all 100 shards.
           Only worth it if writes are the bottleneck.

Strategy 2: Read replicas for hot shards
─────────────────────────────────────────
Increase replication factor for hot shard's primary
Route reads to replicas → horizontal read scaling
Write still goes to single shard primary

Strategy 3: Application-layer caching
───────────────────────────────────────
Celebrity's content → Redis/Memcached in front of DB
90%+ of reads served from cache, DB shard not hit
Works well for content that is immutable or rarely changes

Strategy 4: CDN offload
────────────────────────
Celebrity's public posts → CDN edge cache
Read requests never reach the database at all
Only feasible for public, static or semi-static content

Modern databases (Cassandra 4.x, DynamoDB): token-aware load balancing and automatic read routing reduce hot-spot impact but don’t eliminate them.

Operations: Automatic Versus Manual Rebalancing

Why rebalancing is needed: Load changes over time — data grows, access patterns shift, nodes are added for capacity or removed for failures. Shards must be redistributed to maintain balance.

Strategy 1: Fixed number of shards (Riak, Elasticsearch, Couchbase)

Initial: 1000 shards across 10 nodes → 100 shards/node
Add 11th node: move ~91 shards from existing nodes → ~91 shards/node
Shard count fixed at creation — changing it requires full data migration

Pro: Simple; no shard splitting logic needed
Con: Must choose shard count at creation; too few = can’t scale; too many = overhead per shard

Strategy 2: Dynamic sharding (HBase, RethinkDB, MongoDB)

Shard split trigger: shard size > threshold (e.g., 10 GB)
  → Split into two 5 GB shards
  → Migrate one half to another node

Shard merge trigger: shard size < threshold (e.g., 1 GB)  
  → Merge with adjacent shard

Pro: Adapts automatically to data growth
Con: Single shard at startup (all data hits one node until first split); split storms possible

Strategy 3: Shards proportional to nodes (Cassandra)

Each node gets N virtual nodes (vnodes), e.g., 256
10 nodes → 2560 vnodes
Add 11th node: randomly steal ~232 vnodes (one from each existing node)
More nodes = more total vnodes = finer load granularity

Pro: Scales naturally; random stealing distributes load evenly
Con: Many vnodes = more gossip overhead; higher coordination cost

Automatic vs manual rebalancing:

	Automatic	Manual
Convenience	High — no operator action needed	Low — requires operator judgment
Safety	Risk: rebalancing during a failure amplifies stress on surviving nodes	Safe — operator can choose a quiet time
Speed	Immediate reaction to imbalance	Delayed by human availability
Best for	Dev environments, managed cloud services	Production clusters, SLA-sensitive systems

The danger of automatic rebalancing during failures:

One node slows down or times out
Auto-rebalancer interprets this as node failure → begins moving its shards
Rebalancing = heavy network + disk I/O on the surviving nodes
Already-stressed nodes get more load → cascading failure cascade
Best practice: Add delays, require quorum agreement, or require human approval for large rebalancing operations

Request Routing

Problem: When a client wants to read or write key K, which of the N nodes should it contact?

Three approaches:

Approach 1: Any-node forwarding (gossip protocol)
──────────────────────────────────────────────────
Client → Node A (random)
          ↓ (if A doesn't own key K)
          Node A → forwards to Node B (which owns K)
          Node B returns result to Node A
          Node A returns result to Client

Extra hop overhead. Used by: Cassandra (gossip), Riak

Approach 2: Routing tier (proxy layer)
────────────────────────────────────────
Client → Routing Tier (ZooKeeper-backed)
              ↓ (tier knows current shard→node mapping)
         Correct Node → result → Client

Single extra network hop but no forwarding between data nodes.
Used by: Kafka (via ZooKeeper/KRaft), some MongoDB configs

Approach 3: Partition-aware clients
──────────────────────────────────────
Client maintains shard map (subscribed to ZooKeeper/etcd updates)
Client → Correct Node directly (no extra hops)

Lowest latency. Used by: Kafka producer/consumer API, some Cassandra drivers

Metadata management: ZooKeeper, etcd, or KRaft store the authoritative shard → node mapping. All nodes, routing tiers, and smart clients subscribe to updates. When a shard moves (rebalancing), ZooKeeper notifies all subscribers.

ZooKeeper-based routing in Kafka:

Topic partitions assigned to brokers
Partition metadata stored in ZooKeeper (Kafka <3.x) or KRaft log (Kafka 3.x+)
Producer/consumer clients fetch metadata on startup, cache it, refresh on errors
Partition leadership changes propagate quickly via metadata updates

Sharding and Secondary Indexes

Primary key sharding is conceptually clean: hash or range on the primary key, done. But secondary indexes (indexes on non-primary-key columns) are fundamentally harder because the secondary index attribute doesn’t determine which shard holds the data.

Local Secondary Indexes (Document-Partitioned)

How it works: Each shard maintains its own secondary index for its local data only.

Local Secondary Index (document-partitioned):
──────────────────────────────────────────────
Shard 0:                    Shard 1:                    Shard 2:
  data: car_01 {color:red}    data: car_07 {color:red}    data: car_15 {color:blue}
  data: car_02 {color:blue}   data: car_08 {color:blue}   data: car_16 {color:red}
  local_idx:                  local_idx:                  local_idx:
    red → [car_01]              red → [car_07, car_09]      red → [car_16]
    blue → [car_02]             blue → [car_08]             blue → [car_15]

WRITE: update one shard's local index → fast, single-shard write

READ for color=red:
  → Scatter: query ALL shards for their local red index
  → Gather: merge [car_01] + [car_07, car_09] + [car_16] = [car_01, car_07, car_09, car_16]
  → Tail latency = slowest responding shard

Used by: MongoDB, Riak, Cassandra (with ALLOW FILTERING), Elasticsearch, VoltDB

Global Secondary Indexes (Term-Partitioned)

How it works: A single global index, itself sharded by the index term (not by the primary key).

Global Secondary Index (term-partitioned):
───────────────────────────────────────────
Index shard for "red":         Index shard for "blue":
  red → [shard0/car_01,          blue → [shard0/car_02,
         shard1/car_07,                  shard1/car_08,
         shard1/car_09,                  shard2/car_15]
         shard2/car_16]

WRITE car_01 {color:red}:
  → write to data shard (shard 0) + write to index shard for "red"
  → TWO shard writes; often async → eventual consistency

READ for color=red:
  → query ONE index shard for "red"
  → get list of pointers → fetch data from individual shards
  → fast index lookup, but index may be slightly stale

Used by: DynamoDB Global Secondary Indexes, Riak Search

Local vs Global Secondary Index Comparison

Dimension	Local (Document-Partitioned)	Global (Term-Partitioned)
Write cost	Single shard write	Cross-shard write (2+ shards)
Write latency	Low	Higher (especially with sync updates)
Read cost	Scatter/gather across all shards	Single index shard lookup
Read latency	High (tail latency = slowest shard)	Low
Consistency	Immediately consistent with data	Often eventually consistent (async)
Maintenance	Each shard maintains its own	Global index must be kept in sync
Best for	Write-heavy workloads; data is large	Read-heavy secondary access; small # of indexed values
Examples	MongoDB, Elasticsearch, Cassandra	DynamoDB GSI, Riak Search

Comparison Tables

Sharding Strategy Comparison

Strategy	Range Queries	Distribution	Hot Spots	Rebalancing Ease	Used By
Key-range	Excellent	Uneven (by design)	Risk with sequential keys	Dynamic splitting	HBase, Bigtable, Spanner
Hash	Poor (scatter)	Excellent	Still possible (access skew)	Consistent hashing	Cassandra, DynamoDB, Redis
Compound (hash+range)	Within-shard only	Good	Low	Moderate	Cassandra, DynamoDB
Geographic/geohash	Nearby cells	Good for spatial	Dense areas	Range-based	PostGIS, Uber H3

Rebalancing Strategy Comparison

Strategy	Auto-adapts	Min Movement	Startup Hotspot	Complexity	Used By
Fixed shard count	No (manual redistribution)	Low	No	Low	Elasticsearch, Riak
Dynamic splitting	Yes (by size)	Low	Yes (one shard start)	Medium	HBase, MongoDB
Vnodes/proportional	Yes (by node count)	Medium	No	Medium	Cassandra

Important Points Summary

Sharding is the renamed term for partitioning: The 2nd edition adopts industry terminology (Elasticsearch, MongoDB) over the academic “partitioning.”
Shard key choice is the most important sharding decision: It determines data distribution, query patterns, and hot-spot risk. Wrong choices are hard to fix without full data migration.
Hash sharding and range sharding are not compatible: Choosing one forces you to give up the other unless you use compound keys.
Multitenancy adds a second dimension to sharding: The tenant boundary must be respected for both isolation and performance. Row-level, schema-level, and database-level isolation each have different operational trade-offs.
Hot spots require application-level solutions: No sharding strategy automatically solves for access-pattern skew (celebrity posts, viral content). Cache + CDN + random suffix are the main tools.
Local secondary indexes are write-fast, read-expensive: Scatter/gather across all shards produces tail-latency amplification proportional to shard count.
Global secondary indexes are read-fast but eventually consistent: Writes must asynchronously update the global index; reads may see stale data.
Automatic rebalancing during failures is dangerous: Combining heavy data movement with a degraded cluster can cascade a partial failure into a total outage.
ZooKeeper/etcd/KRaft provide the shard metadata layer: All routing approaches ultimately depend on a consistent, fault-tolerant metadata store.
NewSQL databases (CockroachDB, Spanner) make sharding transparent: Automatic range splitting and rebalancing with full ACID transactions — the “no shard key needed” promise.

Modern Context (2026)

The shift from “partition” to “shard”:
The 2nd edition rename reflects a genuine industry shift. When Elasticsearch, MongoDB, and Redis all call them “shards,” using “partition” creates unnecessary vocabulary friction. The 2nd edition aligns with the practitioner’s lexicon.

Serverless auto-sharding:

DynamoDB on-demand: shards scale automatically with traffic, invisible to the application
Neon (PostgreSQL): sharding at the storage layer; application queries a logical single database
PlanetScale (Vitess): horizontal sharding for MySQL with transparent VTGate routing
Applications no longer need to design shard keys for many use cases

NewSQL transparent sharding:

CockroachDB and YugabyteDB: automatic range-based sharding with Raft replication per shard
Spanner: TrueTime + Paxos + automatic shard splits; globally consistent
Cross-shard transactions are first-class (not bolted on); latency is higher but correctness is guaranteed

Kafka KRaft (metadata without ZooKeeper):

Kafka 3.x+ removes ZooKeeper dependency for partition metadata
KRaft protocol: metadata stored in a Kafka-internal Raft log
Simpler ops, faster partition reassignment, single fewer system to operate

Vector database sharding:

ANN (approximate nearest neighbor) search requires specialized sharding: index replicated or partitioned by cluster assignment (k-means), not by key range or hash
Pinecone, Weaviate, Milvus: purpose-built sharding strategies for high-dimensional vector search

Questions for Reflection

Why does the 2nd edition rename “partitioning” to “sharding,” and what does this reflect about how the field has evolved?
A SaaS company has 50,000 small tenants and 5 enterprise tenants who each generate 40% of the load. How would you design the sharding strategy to isolate noisy neighbors while serving small tenants cheaply?
A write-heavy IoT application stores sensor readings keyed by (sensor_id, timestamp). The team proposes hash-partitioning by sensor_id. What hot-spot risk does this introduce, and how would you mitigate it?
Explain why scatter/gather reads against a local secondary index produce worse tail latency than a direct shard lookup. What is the formal relationship between P99 latency and shard count?
When is manual rebalancing safer than automatic rebalancing? Describe a real failure scenario where automatic rebalancing would have amplified an incident.
How does CockroachDB’s approach to sharding differ from Cassandra’s? In what scenarios would you prefer each?

ch06-replication — Replication is combined with sharding; each shard has replicas
ch08-transactions — Cross-shard transactions require distributed coordination
ch10-consistency-and-consensus — Raft/Paxos used for shard metadata in CockroachDB/Spanner
ch06-partitioning — 1st edition version (Partitioning) for comparison
ch09-trouble-with-distributed-systems — Network failures affect shard routing and rebalancing

Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch07-sharding

Chapter 7: Sharding

Overview

Key Concepts

Pros and Cons of Sharding

Sharding for Multitenancy

Sharding of Key-Value Data

Sharding by Key Range

Sharding by Hash of Key

Skewed Workloads and Relieving Hot Spots

Operations: Automatic Versus Manual Rebalancing

Request Routing

Sharding and Secondary Indexes

Local Secondary Indexes (Document-Partitioned)

Global Secondary Indexes (Term-Partitioned)

Local vs Global Secondary Index Comparison

Comparison Tables

Sharding Strategy Comparison

Rebalancing Strategy Comparison

Important Points Summary

Modern Context (2026)

Questions for Reflection

Graph View

Table of Contents

Backlinks

Study Notes by Niladri & AI

Explorer

ch07-sharding

Chapter 7: Sharding

Overview

Key Concepts

Pros and Cons of Sharding

Sharding for Multitenancy

Sharding of Key-Value Data

Sharding by Key Range

Sharding by Hash of Key

Skewed Workloads and Relieving Hot Spots

Operations: Automatic Versus Manual Rebalancing

Request Routing

Sharding and Secondary Indexes

Local Secondary Indexes (Document-Partitioned)

Global Secondary Indexes (Term-Partitioned)

Local vs Global Secondary Index Comparison

Comparison Tables

Sharding Strategy Comparison

Rebalancing Strategy Comparison

Important Points Summary

Modern Context (2026)

Questions for Reflection

Related Resources

Graph View

Table of Contents

Backlinks