Chapter 6: Partitioning
Overview
Replication copies data to multiple nodes; partitioning (sharding) splits data so each node holds only a subset. Partitioning is necessary when data is too large or too high-throughput for a single machine. The primary challenge is choosing a partitioning scheme that distributes data and query load evenly—avoiding hot spots (where one partition receives disproportionate traffic).
Key insight: Partitioning and replication are usually combined—each partition is replicated to multiple nodes. The two concepts are orthogonal.
Key Concepts
Partitioning by Key Range
How it works:
- Assign a continuous range of keys to each partition
- Keys within each partition kept sorted (like a B-tree)
- Example: Encyclopedia volumes A-D, E-H, I-M, N-R, S-Z
Used by: HBase, Google Bigtable, MongoDB (range-based), early RethinkDB
Advantages:
- Range queries are efficient (scan contiguous partition)
- Keys stored in sorted order → range scans work
Disadvantages:
- Hot spots from access patterns: If keys are timestamps, all writes go to “today’s” partition
- Fix: Prefix timestamp with sensor ID (so different sensors go to different partitions)
- Uneven distribution: Some ranges may have much more data than others
- Manual or automatic rebalancing needed
Partitioning by Hash of Key
How it works:
- Apply hash function to key → uniform distribution across partitions
- Each partition gets a range of hash values (consistent hashing or simple modulo)
- Example: Cassandra, MongoDB (hash-based), DynamoDB
Hash function properties:
- Must be deterministic (same key → same hash)
- Should distribute uniformly
- Not a cryptographic hash (performance matters); MD5, Murmur3, FNV
Consistent hashing:
- Ring of hash values 0..2
- Nodes placed at random positions on ring
- Key goes to first node clockwise from key’s hash
- Adding/removing node: only affects adjacent keys (minimal rebalancing)
Advantages:
- Even distribution of data and load
- Works well when keys are random/uniform
Disadvantages:
- Range queries impossible: Adjacent keys now scattered across partitions
- Cassandra compromise: Compound primary key; hash first part, range on rest
- Skewed workloads: Popular keys (celebrity user) still cause hot spots
Skewed Workloads and Hot Spots
Hot spot: One partition receives much more load than others
Celebrity problem (Twitter/Instagram): When a celebrity posts, all their followers read that post → millions of reads to one partition
Application-level workarounds:
- Append random suffix to hot key (splits writes across partitions, but reads must combine from all)
- Detect hot keys, route to special handling
- Modern DBs (Cassandra 4.x): Token-aware load balancing
When skew is unavoidable: Application must be designed to accept it, or use caching/CDN in front
Partitioning Secondary Indexes
Problem: Primary key partitioning is clear. But secondary indexes (on non-PK columns) are harder—they don’t map cleanly to one partition.
Approach 1: Document-Partitioned (Local) Secondary Indexes:
- Each partition maintains its own secondary index for its local data
- Example: Partition 1 has index of red cars owned by users in that partition
- Write: update index in the local partition (fast, single partition)
- Read: must query all partitions and combine results (scatter/gather)
- Scatter/gather is expensive (tail latency = slowest partition)
- Used by: MongoDB, Riak, Cassandra, Elasticsearch, VoltDB
Approach 2: Term-Partitioned (Global) Secondary Indexes:
- One global index partitioned by the index term
- Example: All red cars across all partitions stored in one “red” index partition
- Write: must update global index (which may be on a different partition) → cross-partition write
- Read: efficient (index directs you to the exact partition)
- Used by: DynamoDB Global Secondary Indexes, Riak Search
| Document-Partitioned | Term-Partitioned | |
|---|---|---|
| Writes | Single partition | Cross-partition |
| Reads | Scatter/gather | Single index partition |
| Consistency | Immediately consistent | Eventually consistent (async) |
Rebalancing Partitions
Why rebalancing: Load changes over time; nodes added/removed; data grows
Strategy 1: Fixed number of partitions:
- Create many more partitions than nodes (e.g., 1000 partitions for 10 nodes)
- When adding node, move some partitions from existing nodes to new node
- Number of partitions fixed at DB creation time
- Used by: Riak, Elasticsearch, Couchbase, Voldemort
Strategy 2: Dynamic partitioning:
- Partition splits when it grows beyond a threshold (e.g., HBase: 10GB default)
- Partition merges when it shrinks below a threshold
- Adapts to data distribution automatically
- Used by: HBase, RethinkDB, MongoDB
Strategy 3: Fixed partitions per node:
- Number of partitions proportional to number of nodes
- Each node has fixed number of partitions (e.g., 256 per node in Cassandra)
- When adding node: new node steals random partitions from each existing node
- Used by: Cassandra, Ketama
Automatic vs manual rebalancing:
- Automatic: Convenient but risky (rebalancing during failure can cascade)
- Rebalancing moves lots of data; if done during incident, makes things worse
- Manual: Admin decides when to rebalance; safer for operational control
Request Routing
Problem: Client needs to know which node holds a given partition for a given key
Approaches:
- Allow clients to contact any node: Node either handles or forwards (Cassandra gossip protocol)
- Routing tier: Clients contact a routing service that knows current partition assignment
- Routing tier forwards to correct node (Kafka uses ZooKeeper / KRaft for this)
- Partition-aware clients: Clients track partition assignment directly
- Client library knows which node to contact
Partition metadata management:
- ZooKeeper (or etcd, KRaft): Store authoritative partition → node mapping
- Nodes and routing tier subscribe to ZooKeeper for updates
- Used by: Kafka, HBase, Solr
Important Points
- Partitioning solves scale; replication solves availability: Use both together.
- Hash partitioning trades range queries for uniform distribution: Can’t have both.
- Hot spots require application-level workarounds: Databases can’t automatically fix skewed access patterns.
- Secondary indexes complicate partitioning: Document-partitioned = scatter/gather reads; term-partitioned = cross-partition writes.
- Rebalancing should be manual or carefully controlled: Automatic rebalancing during failures can amplify the incident.
- Consistent hashing minimizes data movement: Only nearby keys affected when nodes added/removed.
Examples & Case Studies
-
Cassandra Compound Primary Key
- Partition key: hashed (uniform distribution)
- Clustering key: sorted within partition (enables range scans within partition)
- Example:
(user_id, timestamp)— all events for user_id in one partition, sorted by time
-
DynamoDB GSI (Global Secondary Index)
- Write: async propagation to GSI partition (eventually consistent)
- Read: direct lookup on GSI partition (efficient)
- Trade-off: GSI reads may be slightly stale
-
Elasticsearch
- Document-partitioned (called “shards”)
- Each shard has its own Lucene index
- Search = scatter to all shards, merge results
- Routing key can be set per document to co-locate related documents
-
Kafka Partitions
- Topic divided into partitions
- Message key determines partition (via hash)
- Messages with same key always go to same partition (ordering guarantee)
- Consumer group: each partition consumed by one consumer
Questions
- What is the difference between key-range and hash partitioning?
- Why do range queries become inefficient with hash partitioning?
- What is a hot spot and how do you mitigate it?
- What are the trade-offs between document-partitioned and term-partitioned secondary indexes?
- When would you choose dynamic partitioning over fixed partitioning?
- How does consistent hashing minimize rebalancing overhead?
- What is the role of ZooKeeper in partition-aware routing?
- How does Cassandra’s compound primary key combine hash and range partitioning?
Modern Context (2026)
Distributed SQL (NewSQL):
- CockroachDB, Spanner, YugabyteDB: automatic range partitioning with cross-partition transactions
- Range partitions (ranges) automatically split and rebalance
- Transparent to application — no manual shard key design
Serverless auto-partitioning:
- DynamoDB on-demand: partitions scale automatically with traffic
- Neon (PostgreSQL): automatic partitioning at storage layer
- Application doesn’t choose or manage partitions
Kafka KRaft (no ZooKeeper):
- Kafka 3.x+: KRaft protocol replaces ZooKeeper for partition metadata
- Metadata stored in Kafka itself (replicated log)
- Eliminates ZooKeeper operational complexity
Vitess (PlanetScale):
- Horizontal sharding layer for MySQL
- Transparent to application: VTGate routes queries to correct shard
- Enables resharding without downtime
TimeSeries partitioning:
- Time-based partitioning now universal in modern time-series DBs
- InfluxDB, TimescaleDB, Prometheus: automatic time-based partitioning
- Old time ranges can be compressed/archived independently
Status: Notes complete
Last Updated: 2026-04-13