Chapter 6: Partitioning

Overview

Replication copies data to multiple nodes; partitioning (sharding) splits data so each node holds only a subset. Partitioning is necessary when data is too large or too high-throughput for a single machine. The primary challenge is choosing a partitioning scheme that distributes data and query load evenly—avoiding hot spots (where one partition receives disproportionate traffic).

Key insight: Partitioning and replication are usually combined—each partition is replicated to multiple nodes. The two concepts are orthogonal.

Key Concepts

Partitioning by Key Range

How it works:

  • Assign a continuous range of keys to each partition
  • Keys within each partition kept sorted (like a B-tree)
  • Example: Encyclopedia volumes A-D, E-H, I-M, N-R, S-Z

Used by: HBase, Google Bigtable, MongoDB (range-based), early RethinkDB

Advantages:

  • Range queries are efficient (scan contiguous partition)
  • Keys stored in sorted order → range scans work

Disadvantages:

  • Hot spots from access patterns: If keys are timestamps, all writes go to “today’s” partition
    • Fix: Prefix timestamp with sensor ID (so different sensors go to different partitions)
  • Uneven distribution: Some ranges may have much more data than others
  • Manual or automatic rebalancing needed

Partitioning by Hash of Key

How it works:

  • Apply hash function to key → uniform distribution across partitions
  • Each partition gets a range of hash values (consistent hashing or simple modulo)
  • Example: Cassandra, MongoDB (hash-based), DynamoDB

Hash function properties:

  • Must be deterministic (same key → same hash)
  • Should distribute uniformly
  • Not a cryptographic hash (performance matters); MD5, Murmur3, FNV

Consistent hashing:

  • Ring of hash values 0..2
  • Nodes placed at random positions on ring
  • Key goes to first node clockwise from key’s hash
  • Adding/removing node: only affects adjacent keys (minimal rebalancing)

Advantages:

  • Even distribution of data and load
  • Works well when keys are random/uniform

Disadvantages:

  • Range queries impossible: Adjacent keys now scattered across partitions
    • Cassandra compromise: Compound primary key; hash first part, range on rest
  • Skewed workloads: Popular keys (celebrity user) still cause hot spots

Skewed Workloads and Hot Spots

Hot spot: One partition receives much more load than others

Celebrity problem (Twitter/Instagram): When a celebrity posts, all their followers read that post → millions of reads to one partition

Application-level workarounds:

  • Append random suffix to hot key (splits writes across partitions, but reads must combine from all)
  • Detect hot keys, route to special handling
  • Modern DBs (Cassandra 4.x): Token-aware load balancing

When skew is unavoidable: Application must be designed to accept it, or use caching/CDN in front

Partitioning Secondary Indexes

Problem: Primary key partitioning is clear. But secondary indexes (on non-PK columns) are harder—they don’t map cleanly to one partition.

Approach 1: Document-Partitioned (Local) Secondary Indexes:

  • Each partition maintains its own secondary index for its local data
  • Example: Partition 1 has index of red cars owned by users in that partition
  • Write: update index in the local partition (fast, single partition)
  • Read: must query all partitions and combine results (scatter/gather)
    • Scatter/gather is expensive (tail latency = slowest partition)
  • Used by: MongoDB, Riak, Cassandra, Elasticsearch, VoltDB

Approach 2: Term-Partitioned (Global) Secondary Indexes:

  • One global index partitioned by the index term
  • Example: All red cars across all partitions stored in one “red” index partition
  • Write: must update global index (which may be on a different partition) → cross-partition write
  • Read: efficient (index directs you to the exact partition)
  • Used by: DynamoDB Global Secondary Indexes, Riak Search
Document-PartitionedTerm-Partitioned
WritesSingle partitionCross-partition
ReadsScatter/gatherSingle index partition
ConsistencyImmediately consistentEventually consistent (async)

Rebalancing Partitions

Why rebalancing: Load changes over time; nodes added/removed; data grows

Strategy 1: Fixed number of partitions:

  • Create many more partitions than nodes (e.g., 1000 partitions for 10 nodes)
  • When adding node, move some partitions from existing nodes to new node
  • Number of partitions fixed at DB creation time
  • Used by: Riak, Elasticsearch, Couchbase, Voldemort

Strategy 2: Dynamic partitioning:

  • Partition splits when it grows beyond a threshold (e.g., HBase: 10GB default)
  • Partition merges when it shrinks below a threshold
  • Adapts to data distribution automatically
  • Used by: HBase, RethinkDB, MongoDB

Strategy 3: Fixed partitions per node:

  • Number of partitions proportional to number of nodes
  • Each node has fixed number of partitions (e.g., 256 per node in Cassandra)
  • When adding node: new node steals random partitions from each existing node
  • Used by: Cassandra, Ketama

Automatic vs manual rebalancing:

  • Automatic: Convenient but risky (rebalancing during failure can cascade)
    • Rebalancing moves lots of data; if done during incident, makes things worse
  • Manual: Admin decides when to rebalance; safer for operational control

Request Routing

Problem: Client needs to know which node holds a given partition for a given key

Approaches:

  1. Allow clients to contact any node: Node either handles or forwards (Cassandra gossip protocol)
  2. Routing tier: Clients contact a routing service that knows current partition assignment
    • Routing tier forwards to correct node (Kafka uses ZooKeeper / KRaft for this)
  3. Partition-aware clients: Clients track partition assignment directly
    • Client library knows which node to contact

Partition metadata management:

  • ZooKeeper (or etcd, KRaft): Store authoritative partition → node mapping
  • Nodes and routing tier subscribe to ZooKeeper for updates
  • Used by: Kafka, HBase, Solr

Important Points

  • Partitioning solves scale; replication solves availability: Use both together.
  • Hash partitioning trades range queries for uniform distribution: Can’t have both.
  • Hot spots require application-level workarounds: Databases can’t automatically fix skewed access patterns.
  • Secondary indexes complicate partitioning: Document-partitioned = scatter/gather reads; term-partitioned = cross-partition writes.
  • Rebalancing should be manual or carefully controlled: Automatic rebalancing during failures can amplify the incident.
  • Consistent hashing minimizes data movement: Only nearby keys affected when nodes added/removed.

Examples & Case Studies

  1. Cassandra Compound Primary Key

    • Partition key: hashed (uniform distribution)
    • Clustering key: sorted within partition (enables range scans within partition)
    • Example: (user_id, timestamp) — all events for user_id in one partition, sorted by time
  2. DynamoDB GSI (Global Secondary Index)

    • Write: async propagation to GSI partition (eventually consistent)
    • Read: direct lookup on GSI partition (efficient)
    • Trade-off: GSI reads may be slightly stale
  3. Elasticsearch

    • Document-partitioned (called “shards”)
    • Each shard has its own Lucene index
    • Search = scatter to all shards, merge results
    • Routing key can be set per document to co-locate related documents
  4. Kafka Partitions

    • Topic divided into partitions
    • Message key determines partition (via hash)
    • Messages with same key always go to same partition (ordering guarantee)
    • Consumer group: each partition consumed by one consumer

Questions

  1. What is the difference between key-range and hash partitioning?
  2. Why do range queries become inefficient with hash partitioning?
  3. What is a hot spot and how do you mitigate it?
  4. What are the trade-offs between document-partitioned and term-partitioned secondary indexes?
  5. When would you choose dynamic partitioning over fixed partitioning?
  6. How does consistent hashing minimize rebalancing overhead?
  7. What is the role of ZooKeeper in partition-aware routing?
  8. How does Cassandra’s compound primary key combine hash and range partitioning?

Modern Context (2026)

Distributed SQL (NewSQL):

  • CockroachDB, Spanner, YugabyteDB: automatic range partitioning with cross-partition transactions
  • Range partitions (ranges) automatically split and rebalance
  • Transparent to application — no manual shard key design

Serverless auto-partitioning:

  • DynamoDB on-demand: partitions scale automatically with traffic
  • Neon (PostgreSQL): automatic partitioning at storage layer
  • Application doesn’t choose or manage partitions

Kafka KRaft (no ZooKeeper):

  • Kafka 3.x+: KRaft protocol replaces ZooKeeper for partition metadata
  • Metadata stored in Kafka itself (replicated log)
  • Eliminates ZooKeeper operational complexity

Vitess (PlanetScale):

  • Horizontal sharding layer for MySQL
  • Transparent to application: VTGate routes queries to correct shard
  • Enables resharding without downtime

TimeSeries partitioning:

  • Time-based partitioning now universal in modern time-series DBs
  • InfluxDB, TimescaleDB, Prometheus: automatic time-based partitioning
  • Old time ranges can be compressed/archived independently

Status: Notes complete
Last Updated: 2026-04-13