Chapter 7: Sharding

ddia-2e sharding partitioning distributed-systems

Status: Notes complete


Overview

Sharding (the 2nd edition’s renamed version of “Partitioning”) is the technique of splitting a large dataset across multiple nodes so each node holds only a subset of the data. When a single machine cannot hold or serve all the data, sharding enables horizontal scaling: adding more nodes increases both storage capacity and query throughput. The central engineering challenge is choosing a sharding strategy that distributes data and load evenly, while supporting the query patterns the application needs — and handling the operational realities of nodes joining and leaving the cluster over time.

2nd Edition change: The chapter is renamed from “Partitioning” to “Sharding” to match modern industry terminology used by Elasticsearch, MongoDB, Vitess, and most cloud databases. The content is largely equivalent to 1st edition Ch6 but adds a substantial new section on Sharding for Multitenancy — critical for SaaS platform design. The term “shard” is used throughout where the 1st edition used “partition.”

Key insight: Sharding and replication are orthogonal and usually combined — each shard is replicated to multiple nodes. Sharding solves scale; replication solves availability.


Key Concepts

Pros and Cons of Sharding

Why shard?

  • Dataset exceeds single-node storage capacity
  • Read/write throughput exceeds single-node limits
  • Latency requirements demand data be geographically close to users
  • Regulatory requirements mandate data residency in specific regions

Why NOT shard (yet)?

  • Operational complexity increases dramatically: rebalancing, schema changes, cross-shard queries all become harder
  • Cross-shard transactions require distributed coordination (2PC or saga pattern)
  • Secondary indexes become either scatter/gather (local) or eventually consistent (global)
  • Debugging, monitoring, and tooling are all more complex

Decision rule: Start without sharding. Shard when you have a concrete bottleneck (storage, throughput, or latency) that cannot be addressed by vertical scaling, caching, or read replicas alone.


Sharding for Multitenancy

Problem context: SaaS platforms serve many customers (tenants) from a shared infrastructure. The key question is how to isolate tenant data — both for security/compliance and to prevent one tenant’s heavy load from degrading another’s (the “noisy neighbor” problem).

Three isolation models:

ModelDescriptionProsConsExamples
Row-level isolationAll tenants share tables; tenant_id column distinguishes rowsLow overhead, simple schemaNoisy neighbor risk; accidental data leakage if bugs existShared SaaS apps
Schema-level isolationEach tenant gets their own schema within a shared databaseModerate isolation; easier backups per tenantSchema count limits (PostgreSQL: ~100s practical)Shopify, multi-tenant PostgreSQL
Database-level isolationEach tenant gets a separate database (or even cluster)Full isolation, easy compliance, simple backupsHighest overhead; hard to aggregate cross-tenantEnterprise SaaS with large customers

Sharding by tenant: For row-level isolation, the tenant_id is typically the shard key (or first component of a compound shard key). This guarantees all data for one tenant lands on the same shard, enabling efficient tenant-scoped queries without cross-shard joins.

Hybrid approach (common in practice):

  • Small/free-tier tenants: row-level isolation on shared shards
  • Large/enterprise tenants: dedicated shards or dedicated databases
  • “Tenant routing” layer maps tenant_id → shard dynamically

Noisy neighbor mitigation:

  • Rate limiting per tenant at the application layer
  • Tenant-aware load balancing (route heavy tenants to dedicated nodes)
  • Burst absorption via auto-scaling (serverless databases)

Compliance considerations (GDPR, HIPAA):

  • Database-level isolation simplifies data deletion (drop database for a tenant)
  • Row-level isolation requires careful soft-delete or purge logic
  • Schema-level isolation can work if schema-level backup/restore is automated

Sharding of Key-Value Data

Sharding by Key Range

How it works: Assign a continuous range of keys to each shard. Keys within each shard are kept sorted (typically in a B-tree).

Lexicographic key-range sharding:
─────────────────────────────────────────────────────────────────
Key:    [A───────D]  [E──────────M]  [N──────────R]  [S────────Z]
Shard:    Shard 0       Shard 1         Shard 2         Shard 3
─────────────────────────────────────────────────────────────────

Advantages:

  • Range queries are efficient: scan one contiguous shard
  • Keys are stored sorted → range scans, prefix queries, “next N” queries all work
  • Data locality: related keys (e.g., same user’s time-ordered events) land on the same shard

Disadvantages:

  • Hot spots from access patterns: If the key is a timestamp, all writes go to “today’s” shard
    • Fix: Prefix timestamp with a sensor/user ID so different entities go to different shards
  • Uneven distribution: Some key ranges may contain far more data than others
  • Split boundaries must be manually set or dynamically managed

Used by: HBase, Google Bigtable, MongoDB (range-based), CockroachDB (automatic range splits), Spanner

Sharding by Hash of Key

How it works: Apply a hash function to the key; the hash value determines which shard receives the data.

Hash-based sharding:
─────────────────────────────────────────────────────────────────
hash("user_001") = 0x3A → Shard 2
hash("user_002") = 0x91 → Shard 5
hash("user_003") = 0x17 → Shard 1
(hash values evenly spread across shard space)
─────────────────────────────────────────────────────────────────

Hash function properties:

  • Must be deterministic: same key always produces same hash
  • Should distribute uniformly across the hash space
  • Not cryptographic (performance matters): MD5, Murmur3, FNV are common
  • NOT Python’s hash() — it’s randomized per process; use a stable hash

Consistent hashing (ring-based):

Consistent hashing ring (hash space 0 → 2^64):

          key_X (hash=25)
              ↓
  0   [Node A @ 10]   [Node B @ 45]   [Node C @ 80]   2^64
  ────●────────────────────●──────────────────●────────────●
  
  key_X (hash=25) → Node B (first node clockwise from 25)
  
  Add Node D at position 35:
  ────●─────────●──────●──────────────────●────────────────●
      A@10    D@35  key_X B@45            C@80
  
  key_X (hash=25) → now goes to Node D
  Only keys in range [10, 35] need to move!
  All other keys unchanged → minimal rebalancing

Advantages:

  • Even distribution of data and load (for uniformly-distributed keys)
  • No hot spots from sequential key patterns (timestamps, auto-increment IDs)
  • Consistent hashing: adding/removing nodes only requires moving ~1/N of the data

Disadvantages:

  • Range queries impossible across shard boundaries: adjacent logical keys are now scattered
    • Cassandra compromise: Compound primary key — hash the first part (partition key), sort the rest (clustering key) within one shard
  • Skewed workloads: Popular keys still cause hot spots regardless of hash distribution

Used by: Cassandra (partition key), DynamoDB, MongoDB (hash-based sharding), Redis Cluster

Skewed Workloads and Relieving Hot Spots

Hot spot definition: A single shard receiving disproportionately high read or write load compared to others.

Celebrity/viral content problem: When a celebrity posts (Twitter/Instagram/YouTube), millions of users simultaneously read that one item → all requests to a single shard.

Hash partitioning does NOT solve this: even if celebrity user IDs are evenly distributed, the access frequency for one key vastly exceeds others.

Mitigation strategies:

Strategy 1: Random suffix on hot writes
─────────────────────────────────────────
Write: celebrity_123_00, celebrity_123_01, ..., celebrity_123_99
       (writes spread across 100 shards)
Read:  query all 100 keys, merge results
       (read complexity increases, but load spreads)

Trade-off: Consistent aggregation requires reading from all 100 shards.
           Only worth it if writes are the bottleneck.

Strategy 2: Read replicas for hot shards
─────────────────────────────────────────
Increase replication factor for hot shard's primary
Route reads to replicas → horizontal read scaling
Write still goes to single shard primary

Strategy 3: Application-layer caching
───────────────────────────────────────
Celebrity's content → Redis/Memcached in front of DB
90%+ of reads served from cache, DB shard not hit
Works well for content that is immutable or rarely changes

Strategy 4: CDN offload
────────────────────────
Celebrity's public posts → CDN edge cache
Read requests never reach the database at all
Only feasible for public, static or semi-static content

Modern databases (Cassandra 4.x, DynamoDB): token-aware load balancing and automatic read routing reduce hot-spot impact but don’t eliminate them.

Operations: Automatic Versus Manual Rebalancing

Why rebalancing is needed: Load changes over time — data grows, access patterns shift, nodes are added for capacity or removed for failures. Shards must be redistributed to maintain balance.

Strategy 1: Fixed number of shards (Riak, Elasticsearch, Couchbase)

Initial: 1000 shards across 10 nodes → 100 shards/node
Add 11th node: move ~91 shards from existing nodes → ~91 shards/node
Shard count fixed at creation — changing it requires full data migration
  • Pro: Simple; no shard splitting logic needed
  • Con: Must choose shard count at creation; too few = can’t scale; too many = overhead per shard

Strategy 2: Dynamic sharding (HBase, RethinkDB, MongoDB)

Shard split trigger: shard size > threshold (e.g., 10 GB)
  → Split into two 5 GB shards
  → Migrate one half to another node

Shard merge trigger: shard size < threshold (e.g., 1 GB)  
  → Merge with adjacent shard
  • Pro: Adapts automatically to data growth
  • Con: Single shard at startup (all data hits one node until first split); split storms possible

Strategy 3: Shards proportional to nodes (Cassandra)

Each node gets N virtual nodes (vnodes), e.g., 256
10 nodes → 2560 vnodes
Add 11th node: randomly steal ~232 vnodes (one from each existing node)
More nodes = more total vnodes = finer load granularity
  • Pro: Scales naturally; random stealing distributes load evenly
  • Con: Many vnodes = more gossip overhead; higher coordination cost

Automatic vs manual rebalancing:

AutomaticManual
ConvenienceHigh — no operator action neededLow — requires operator judgment
SafetyRisk: rebalancing during a failure amplifies stress on surviving nodesSafe — operator can choose a quiet time
SpeedImmediate reaction to imbalanceDelayed by human availability
Best forDev environments, managed cloud servicesProduction clusters, SLA-sensitive systems

The danger of automatic rebalancing during failures:

  • One node slows down or times out
  • Auto-rebalancer interprets this as node failure → begins moving its shards
  • Rebalancing = heavy network + disk I/O on the surviving nodes
  • Already-stressed nodes get more load → cascading failure cascade
  • Best practice: Add delays, require quorum agreement, or require human approval for large rebalancing operations

Request Routing

Problem: When a client wants to read or write key K, which of the N nodes should it contact?

Three approaches:

Approach 1: Any-node forwarding (gossip protocol)
──────────────────────────────────────────────────
Client → Node A (random)
          ↓ (if A doesn't own key K)
          Node A → forwards to Node B (which owns K)
          Node B returns result to Node A
          Node A returns result to Client

Extra hop overhead. Used by: Cassandra (gossip), Riak

Approach 2: Routing tier (proxy layer)
────────────────────────────────────────
Client → Routing Tier (ZooKeeper-backed)
              ↓ (tier knows current shard→node mapping)
         Correct Node → result → Client

Single extra network hop but no forwarding between data nodes.
Used by: Kafka (via ZooKeeper/KRaft), some MongoDB configs

Approach 3: Partition-aware clients
──────────────────────────────────────
Client maintains shard map (subscribed to ZooKeeper/etcd updates)
Client → Correct Node directly (no extra hops)

Lowest latency. Used by: Kafka producer/consumer API, some Cassandra drivers

Metadata management: ZooKeeper, etcd, or KRaft store the authoritative shard → node mapping. All nodes, routing tiers, and smart clients subscribe to updates. When a shard moves (rebalancing), ZooKeeper notifies all subscribers.

ZooKeeper-based routing in Kafka:

  • Topic partitions assigned to brokers
  • Partition metadata stored in ZooKeeper (Kafka <3.x) or KRaft log (Kafka 3.x+)
  • Producer/consumer clients fetch metadata on startup, cache it, refresh on errors
  • Partition leadership changes propagate quickly via metadata updates

Sharding and Secondary Indexes

Primary key sharding is conceptually clean: hash or range on the primary key, done. But secondary indexes (indexes on non-primary-key columns) are fundamentally harder because the secondary index attribute doesn’t determine which shard holds the data.

Local Secondary Indexes (Document-Partitioned)

How it works: Each shard maintains its own secondary index for its local data only.

Local Secondary Index (document-partitioned):
──────────────────────────────────────────────
Shard 0:                    Shard 1:                    Shard 2:
  data: car_01 {color:red}    data: car_07 {color:red}    data: car_15 {color:blue}
  data: car_02 {color:blue}   data: car_08 {color:blue}   data: car_16 {color:red}
  local_idx:                  local_idx:                  local_idx:
    red → [car_01]              red → [car_07, car_09]      red → [car_16]
    blue → [car_02]             blue → [car_08]             blue → [car_15]

WRITE: update one shard's local index → fast, single-shard write

READ for color=red:
  → Scatter: query ALL shards for their local red index
  → Gather: merge [car_01] + [car_07, car_09] + [car_16] = [car_01, car_07, car_09, car_16]
  → Tail latency = slowest responding shard

Used by: MongoDB, Riak, Cassandra (with ALLOW FILTERING), Elasticsearch, VoltDB

Global Secondary Indexes (Term-Partitioned)

How it works: A single global index, itself sharded by the index term (not by the primary key).

Global Secondary Index (term-partitioned):
───────────────────────────────────────────
Index shard for "red":         Index shard for "blue":
  red → [shard0/car_01,          blue → [shard0/car_02,
         shard1/car_07,                  shard1/car_08,
         shard1/car_09,                  shard2/car_15]
         shard2/car_16]

WRITE car_01 {color:red}:
  → write to data shard (shard 0) + write to index shard for "red"
  → TWO shard writes; often async → eventual consistency

READ for color=red:
  → query ONE index shard for "red"
  → get list of pointers → fetch data from individual shards
  → fast index lookup, but index may be slightly stale

Used by: DynamoDB Global Secondary Indexes, Riak Search

Local vs Global Secondary Index Comparison

DimensionLocal (Document-Partitioned)Global (Term-Partitioned)
Write costSingle shard writeCross-shard write (2+ shards)
Write latencyLowHigher (especially with sync updates)
Read costScatter/gather across all shardsSingle index shard lookup
Read latencyHigh (tail latency = slowest shard)Low
ConsistencyImmediately consistent with dataOften eventually consistent (async)
MaintenanceEach shard maintains its ownGlobal index must be kept in sync
Best forWrite-heavy workloads; data is largeRead-heavy secondary access; small # of indexed values
ExamplesMongoDB, Elasticsearch, CassandraDynamoDB GSI, Riak Search

Comparison Tables

Sharding Strategy Comparison

StrategyRange QueriesDistributionHot SpotsRebalancing EaseUsed By
Key-rangeExcellentUneven (by design)Risk with sequential keysDynamic splittingHBase, Bigtable, Spanner
HashPoor (scatter)ExcellentStill possible (access skew)Consistent hashingCassandra, DynamoDB, Redis
Compound (hash+range)Within-shard onlyGoodLowModerateCassandra, DynamoDB
Geographic/geohashNearby cellsGood for spatialDense areasRange-basedPostGIS, Uber H3

Rebalancing Strategy Comparison

StrategyAuto-adaptsMin MovementStartup HotspotComplexityUsed By
Fixed shard countNo (manual redistribution)LowNoLowElasticsearch, Riak
Dynamic splittingYes (by size)LowYes (one shard start)MediumHBase, MongoDB
Vnodes/proportionalYes (by node count)MediumNoMediumCassandra

Important Points Summary

  • Sharding is the renamed term for partitioning: The 2nd edition adopts industry terminology (Elasticsearch, MongoDB) over the academic “partitioning.”
  • Shard key choice is the most important sharding decision: It determines data distribution, query patterns, and hot-spot risk. Wrong choices are hard to fix without full data migration.
  • Hash sharding and range sharding are not compatible: Choosing one forces you to give up the other unless you use compound keys.
  • Multitenancy adds a second dimension to sharding: The tenant boundary must be respected for both isolation and performance. Row-level, schema-level, and database-level isolation each have different operational trade-offs.
  • Hot spots require application-level solutions: No sharding strategy automatically solves for access-pattern skew (celebrity posts, viral content). Cache + CDN + random suffix are the main tools.
  • Local secondary indexes are write-fast, read-expensive: Scatter/gather across all shards produces tail-latency amplification proportional to shard count.
  • Global secondary indexes are read-fast but eventually consistent: Writes must asynchronously update the global index; reads may see stale data.
  • Automatic rebalancing during failures is dangerous: Combining heavy data movement with a degraded cluster can cascade a partial failure into a total outage.
  • ZooKeeper/etcd/KRaft provide the shard metadata layer: All routing approaches ultimately depend on a consistent, fault-tolerant metadata store.
  • NewSQL databases (CockroachDB, Spanner) make sharding transparent: Automatic range splitting and rebalancing with full ACID transactions — the “no shard key needed” promise.

Modern Context (2026)

The shift from “partition” to “shard”:
The 2nd edition rename reflects a genuine industry shift. When Elasticsearch, MongoDB, and Redis all call them “shards,” using “partition” creates unnecessary vocabulary friction. The 2nd edition aligns with the practitioner’s lexicon.

Serverless auto-sharding:

  • DynamoDB on-demand: shards scale automatically with traffic, invisible to the application
  • Neon (PostgreSQL): sharding at the storage layer; application queries a logical single database
  • PlanetScale (Vitess): horizontal sharding for MySQL with transparent VTGate routing
  • Applications no longer need to design shard keys for many use cases

NewSQL transparent sharding:

  • CockroachDB and YugabyteDB: automatic range-based sharding with Raft replication per shard
  • Spanner: TrueTime + Paxos + automatic shard splits; globally consistent
  • Cross-shard transactions are first-class (not bolted on); latency is higher but correctness is guaranteed

Kafka KRaft (metadata without ZooKeeper):

  • Kafka 3.x+ removes ZooKeeper dependency for partition metadata
  • KRaft protocol: metadata stored in a Kafka-internal Raft log
  • Simpler ops, faster partition reassignment, single fewer system to operate

Vector database sharding:

  • ANN (approximate nearest neighbor) search requires specialized sharding: index replicated or partitioned by cluster assignment (k-means), not by key range or hash
  • Pinecone, Weaviate, Milvus: purpose-built sharding strategies for high-dimensional vector search

Questions for Reflection

  1. Why does the 2nd edition rename “partitioning” to “sharding,” and what does this reflect about how the field has evolved?
  2. A SaaS company has 50,000 small tenants and 5 enterprise tenants who each generate 40% of the load. How would you design the sharding strategy to isolate noisy neighbors while serving small tenants cheaply?
  3. A write-heavy IoT application stores sensor readings keyed by (sensor_id, timestamp). The team proposes hash-partitioning by sensor_id. What hot-spot risk does this introduce, and how would you mitigate it?
  4. Explain why scatter/gather reads against a local secondary index produce worse tail latency than a direct shard lookup. What is the formal relationship between P99 latency and shard count?
  5. When is manual rebalancing safer than automatic rebalancing? Describe a real failure scenario where automatic rebalancing would have amplified an incident.
  6. How does CockroachDB’s approach to sharding differ from Cassandra’s? In what scenarios would you prefer each?

Last Updated: 2026-05-29