Chapter 6: Partitioning

Overview

Replication copies data to multiple nodes; partitioning (sharding) splits data so each node holds only a subset. Partitioning is necessary when data is too large or too high-throughput for a single machine. The primary challenge is choosing a partitioning scheme that distributes data and query load evenly—avoiding hot spots (where one partition receives disproportionate traffic).

Key insight: Partitioning and replication are usually combined—each partition is replicated to multiple nodes. The two concepts are orthogonal.

Key Concepts

Partitioning by Key Range

How it works:

Assign a continuous range of keys to each partition
Keys within each partition kept sorted (like a B-tree)
Example: Encyclopedia volumes A-D, E-H, I-M, N-R, S-Z

Used by: HBase, Google Bigtable, MongoDB (range-based), early RethinkDB

Advantages:

Range queries are efficient (scan contiguous partition)
Keys stored in sorted order → range scans work

Disadvantages:

Hot spots from access patterns: If keys are timestamps, all writes go to “today’s” partition
- Fix: Prefix timestamp with sensor ID (so different sensors go to different partitions)
Uneven distribution: Some ranges may have much more data than others
Manual or automatic rebalancing needed

Partitioning by Hash of Key

How it works:

Apply hash function to key → uniform distribution across partitions
Each partition gets a range of hash values (consistent hashing or simple modulo)
Example: Cassandra, MongoDB (hash-based), DynamoDB

Hash function properties:

Must be deterministic (same key → same hash)
Should distribute uniformly
Not a cryptographic hash (performance matters); MD5, Murmur3, FNV

Consistent hashing:

Ring of hash values 0..2
Nodes placed at random positions on ring
Key goes to first node clockwise from key’s hash
Adding/removing node: only affects adjacent keys (minimal rebalancing)

Advantages:

Even distribution of data and load
Works well when keys are random/uniform

Disadvantages:

Range queries impossible: Adjacent keys now scattered across partitions
- Cassandra compromise: Compound primary key; hash first part, range on rest
Skewed workloads: Popular keys (celebrity user) still cause hot spots

Skewed Workloads and Hot Spots

Hot spot: One partition receives much more load than others

Celebrity problem (Twitter/Instagram): When a celebrity posts, all their followers read that post → millions of reads to one partition

Application-level workarounds:

Append random suffix to hot key (splits writes across partitions, but reads must combine from all)
Detect hot keys, route to special handling
Modern DBs (Cassandra 4.x): Token-aware load balancing

When skew is unavoidable: Application must be designed to accept it, or use caching/CDN in front

Partitioning Secondary Indexes

Problem: Primary key partitioning is clear. But secondary indexes (on non-PK columns) are harder—they don’t map cleanly to one partition.

Approach 1: Document-Partitioned (Local) Secondary Indexes:

Each partition maintains its own secondary index for its local data
Example: Partition 1 has index of red cars owned by users in that partition
Write: update index in the local partition (fast, single partition)
Read: must query all partitions and combine results (scatter/gather)
- Scatter/gather is expensive (tail latency = slowest partition)
Used by: MongoDB, Riak, Cassandra, Elasticsearch, VoltDB

Approach 2: Term-Partitioned (Global) Secondary Indexes:

One global index partitioned by the index term
Example: All red cars across all partitions stored in one “red” index partition
Write: must update global index (which may be on a different partition) → cross-partition write
Read: efficient (index directs you to the exact partition)
Used by: DynamoDB Global Secondary Indexes, Riak Search

	Document-Partitioned	Term-Partitioned
Writes	Single partition	Cross-partition
Reads	Scatter/gather	Single index partition
Consistency	Immediately consistent	Eventually consistent (async)

Rebalancing Partitions

Why rebalancing: Load changes over time; nodes added/removed; data grows

Strategy 1: Fixed number of partitions:

Create many more partitions than nodes (e.g., 1000 partitions for 10 nodes)
When adding node, move some partitions from existing nodes to new node
Number of partitions fixed at DB creation time
Used by: Riak, Elasticsearch, Couchbase, Voldemort

Strategy 2: Dynamic partitioning:

Partition splits when it grows beyond a threshold (e.g., HBase: 10GB default)
Partition merges when it shrinks below a threshold
Adapts to data distribution automatically
Used by: HBase, RethinkDB, MongoDB

Strategy 3: Fixed partitions per node:

Number of partitions proportional to number of nodes
Each node has fixed number of partitions (e.g., 256 per node in Cassandra)
When adding node: new node steals random partitions from each existing node
Used by: Cassandra, Ketama

Automatic vs manual rebalancing:

Automatic: Convenient but risky (rebalancing during failure can cascade)
- Rebalancing moves lots of data; if done during incident, makes things worse
Manual: Admin decides when to rebalance; safer for operational control

Request Routing

Problem: Client needs to know which node holds a given partition for a given key

Approaches:

Allow clients to contact any node: Node either handles or forwards (Cassandra gossip protocol)
Routing tier: Clients contact a routing service that knows current partition assignment
- Routing tier forwards to correct node (Kafka uses ZooKeeper / KRaft for this)
Partition-aware clients: Clients track partition assignment directly
- Client library knows which node to contact

Partition metadata management:

ZooKeeper (or etcd, KRaft): Store authoritative partition → node mapping
Nodes and routing tier subscribe to ZooKeeper for updates
Used by: Kafka, HBase, Solr

Important Points

Partitioning solves scale; replication solves availability: Use both together.
Hash partitioning trades range queries for uniform distribution: Can’t have both.
Hot spots require application-level workarounds: Databases can’t automatically fix skewed access patterns.
Secondary indexes complicate partitioning: Document-partitioned = scatter/gather reads; term-partitioned = cross-partition writes.
Rebalancing should be manual or carefully controlled: Automatic rebalancing during failures can amplify the incident.
Consistent hashing minimizes data movement: Only nearby keys affected when nodes added/removed.

Examples & Case Studies

Cassandra Compound Primary Key
- Partition key: hashed (uniform distribution)
- Clustering key: sorted within partition (enables range scans within partition)
- Example: (user_id, timestamp) — all events for user_id in one partition, sorted by time
DynamoDB GSI (Global Secondary Index)
- Write: async propagation to GSI partition (eventually consistent)
- Read: direct lookup on GSI partition (efficient)
- Trade-off: GSI reads may be slightly stale
Elasticsearch
- Document-partitioned (called “shards”)
- Each shard has its own Lucene index
- Search = scatter to all shards, merge results
- Routing key can be set per document to co-locate related documents
Kafka Partitions
- Topic divided into partitions
- Message key determines partition (via hash)
- Messages with same key always go to same partition (ordering guarantee)
- Consumer group: each partition consumed by one consumer

Questions

What is the difference between key-range and hash partitioning?
Why do range queries become inefficient with hash partitioning?
What is a hot spot and how do you mitigate it?
What are the trade-offs between document-partitioned and term-partitioned secondary indexes?
When would you choose dynamic partitioning over fixed partitioning?
How does consistent hashing minimize rebalancing overhead?
What is the role of ZooKeeper in partition-aware routing?
How does Cassandra’s compound primary key combine hash and range partitioning?

Modern Context (2026)

Distributed SQL (NewSQL):

CockroachDB, Spanner, YugabyteDB: automatic range partitioning with cross-partition transactions
Range partitions (ranges) automatically split and rebalance
Transparent to application — no manual shard key design

Serverless auto-partitioning:

DynamoDB on-demand: partitions scale automatically with traffic
Neon (PostgreSQL): automatic partitioning at storage layer
Application doesn’t choose or manage partitions

Kafka KRaft (no ZooKeeper):

Kafka 3.x+: KRaft protocol replaces ZooKeeper for partition metadata
Metadata stored in Kafka itself (replicated log)
Eliminates ZooKeeper operational complexity

Vitess (PlanetScale):

Horizontal sharding layer for MySQL
Transparent to application: VTGate routes queries to correct shard
Enables resharding without downtime

TimeSeries partitioning:

Time-based partitioning now universal in modern time-series DBs
InfluxDB, TimescaleDB, Prometheus: automatic time-based partitioning
Old time ranges can be compressed/archived independently

Status: Notes complete
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch06-partitioning