Chapter 6 Flashcards - Partitioning
Basic Concepts
What is partitioning (sharding) and why is it needed?
?
- Partitioning: Splitting a large dataset across multiple nodes, so each node holds only a subset
- Also called: Sharding (MongoDB, Elasticsearch), Tablets (Bigtable), Regions (HBase), VNodes (Cassandra)
- Why needed:
- Data too large for single machine
- Query load too high for single machine
- Enables horizontal scaling (add nodes = add capacity)
- Combined with replication: Each partition is replicated to multiple nodes for availability
- Orthogonal: Replication and partitioning are independent concerns
What is a hot spot and what causes it?
?
- Hot spot: One partition receives disproportionately high read or write load
- Causes:
- Sequential keys (timestamps): All new writes go to the “latest” partition
- Popular keys (celebrities): Celeb’s posts read by millions → same partition overwhelmed
- Non-uniform key distribution with range partitioning: Some ranges have more data/traffic
- Mitigation:
- Prepend random/varied prefix to keys (distributes sequential writes)
- Caching layer for popular keys (reduces reads to hot partition)
- Application-level sharding of hot keys (e.g., append
_0through_99)
Partitioning Strategies
What are the two main partitioning strategies and their key trade-off?
?
Key-range partitioning:
- Assign continuous ranges of keys to partitions (like encyclopedia volumes)
- Keys sorted within partitions → range queries efficient
- ✅ Range scans fast; ✅ Sorted data
- ❌ Hot spots from sequential key access patterns (e.g., timestamps)
- Used by: HBase, Bigtable, MongoDB (range)
Hash partitioning:
- Hash the key → hash value determines partition
- Even distribution regardless of key patterns
- ✅ Uniform distribution; ✅ No sequential hot spots
- ❌ Range queries require scatter/gather across all partitions
- Used by: Cassandra (partition key), DynamoDB, MongoDB (hash index)
Trade-off: Range queries vs uniform distribution — you can’t easily have both
What is consistent hashing and how does it minimize rebalancing?
?
- Concept: Hash space mapped to a ring (0 to 2^64); nodes placed at random positions
- Key routing: A key goes to the first node clockwise from the key’s hash
- Adding a node:
- New node placed at a random position on the ring
- Only the keys between new node and its predecessor need to move
- All other keys remain on their current nodes
- Removing a node: Only that node’s keys move to successor
- Result: Minimal data movement on cluster topology changes
- Used by: Amazon Dynamo, Cassandra, Chord DHT
How does Cassandra’s compound primary key achieve both even distribution and range queries?
?
- Compound key:
(partition_key, clustering_key) - Partition key: Hashed to determine which node/partition receives the data
- Even distribution across all nodes
- Cannot do range queries across different partition keys
- Clustering key: Sorted within the partition
- Range queries work within a single partition
- E.g., all tweets for
user_id=123sorted bytimestamp
Example: PRIMARY KEY (user_id, created_at)
- All events for user_id=123 are in one partition (hashed)
- Within that partition, events sorted by created_at
- Query:
SELECT * WHERE user_id=123 AND created_at BETWEEN t1 AND t2→ efficient! - Cannot query all users’ events in time range across all partitions efficiently
Secondary Indexes
What are the two approaches to secondary indexes in partitioned databases?
?
1. Document-partitioned (local) secondary indexes:
- Each partition maintains its own secondary index for its local data
- Write: update only the local partition’s index (fast, single-partition)
- Read: must scatter to ALL partitions and gather results
- ✅ Fast writes; ❌ Expensive reads (scatter/gather = tail latency = slowest partition)
- Used by: MongoDB, Riak, Cassandra, Elasticsearch
2. Term-partitioned (global) secondary indexes:
- One global index, itself partitioned by the index term
- Write: must update global index (which lives on another partition) → cross-partition write
- Read: look up the global index partition → get exact locations → fast
- ✅ Efficient reads; ❌ Writes touch multiple partitions; index may be slightly stale (async updates)
- Used by: DynamoDB Global Secondary Indexes, Riak Search
What is scatter/gather and why is it problematic?
?
-
Scatter/gather: Send query to all partitions (scatter), collect and merge all responses (gather)
-
Required when: Using document-partitioned secondary indexes; range queries across partitions
-
Problems:
- Tail latency: Response time = slowest partition (worst-case latency)
- Resource amplification: N partitions = N-fold read amplification
- Coordination overhead: Must wait for all responses before returning result
- Failure sensitivity: If one partition is slow/unavailable, whole query is affected
-
Mitigation: Use global secondary indexes where possible; limit scatter/gather in SLA-sensitive paths
Rebalancing
What are the three strategies for rebalancing partitions?
?
1. Fixed number of partitions (Riak, Elasticsearch, Couchbase):
- Create many more partitions than nodes at creation (e.g., 1000 for 10 nodes)
- Add node: steal whole partitions from existing nodes; no partition splitting
- ✅ Simple; ❌ Count fixed at DB creation; hard to change
2. Dynamic partitioning (HBase, RethinkDB, MongoDB):
- Split partition when too large; merge when too small
- ✅ Adapts to data volume automatically
- ❌ New empty DB → one partition, one node handles all traffic
3. Partitions proportional to nodes (Cassandra):
- Each node has a fixed number of partitions (e.g., 256)
- Add node: new node randomly steals partitions from existing nodes
- ✅ Scales naturally with cluster size
Why is automatic rebalancing potentially dangerous during failures?
?
-
Rebalancing = moves large amounts of data between nodes (network + disk intensive)
-
During a failure:
- One node is slow or down
- Automatic rebalancer kicks in: “Node seems down, let me move its partitions”
- Rebalancing adds heavy I/O load to remaining nodes
- Already-stressed nodes get more stressed → cascading failure
-
Real example: Riak/Cassandra cluster under heavy load; one node GC pauses; auto-rebalancer triggers; surviving nodes overwhelmed → multi-node failure
-
Best practice:
- Use automatic rebalancing conservatively (add delays, thresholds)
- Require human approval for large rebalancing operations
- Monitor rebalancing operations separately from regular load
Request Routing
What are the three approaches to request routing in partitioned systems?
?
1. Any-node forwarding (gossip protocol):
- Client contacts any node
- If that node owns the partition: handles request
- If not: forwards to correct node (one extra hop)
- Used by: Cassandra (gossip protocol for partition metadata)
2. Routing tier (dedicated router):
- Clients contact a partition-aware routing service (proxy)
- Router has current partition → node mapping
- Forwards to correct node directly
- Used by: Kafka (via ZooKeeper/KRaft), some MongoDB deployments
3. Partition-aware clients:
- Client library maintains current partition map
- Contacts correct node directly (no extra hops)
- Client subscribes to ZooKeeper/etcd for partition map updates
- Used by: Kafka producer/consumer APIs
Coordination service: ZooKeeper, etcd, or KRaft used as source of truth for partition assignments
Modern Context (2026)
How do NewSQL databases handle partitioning differently from traditional NoSQL?
?
- Traditional NoSQL (Cassandra, DynamoDB): Application must choose shard key; fixed partitioning
- NewSQL (CockroachDB, Spanner, YugabyteDB): Automatic range-based partitioning
CockroachDB approach:
- Data automatically split into ranges (~64MB each)
- Ranges automatically split when too large, merged when too small
- Load-based splitting: hot ranges split based on QPS, not just size
- Application sees a single logical table; partitioning is invisible
- Transactions work across ranges (via Raft consensus)
Benefits: No shard key design decisions; automatic rebalancing; cross-shard transactions
Cost: Distributed transactions have higher latency than single-shard operations
How does Kafka use partitioning and why does the partition key matter?
?
- Kafka topic = ordered, append-only log, divided into partitions
- Each partition is an ordered sequence of messages
Partition key:
- Producers specify a key per message
- Key is hashed to determine partition
- Messages with same key always go to same partition → ordering guaranteed per key
- No key: round-robin distribution
Why partition key matters:
- Ordering: Need order for a user’s events? Use
user_idas key - Colocating related events: All events for same entity in one partition → simpler stream processing
- Hot spots: Popular user_id as key → hot partition → same problems as DB hot spots
Consumer groups:
- Each partition consumed by exactly one consumer in a group
- Parallelism = number of partitions (more partitions = more consumers possible)
Interview Scenarios
Design the partitioning scheme for a ride-sharing app (like Uber) with drivers and riders.
?
Entities to partition: Users, Drivers, Trips, Location updates
Location updates (highest volume):
- Write: millions of location updates per second from drivers
- Key:
driver_id→ hash partition - ✅ Uniform distribution; drivers are independent
- Range queries: “Find all drivers near location X” → geospatial, not key-based
Geospatial queries (nearest drivers):
- Partition by geohash (geographic grid cells)
- Range query = nearby geohash cells → range scan
- Hot spots: dense city centers → may need finer granularity
Trips table:
- Partition by
trip_id(hash) for lookups - Secondary index by
user_id(document-partitioned) for “my trips” queries - Time-range queries → compound key
(user_id, created_at)
Real approach (Uber): Separate storage systems by access pattern; Redis for real-time location, Cassandra for trips history
A social media company’s “trending topics” query is slow because it requires scatter/gather across 50 partitions. How do you fix it?
?
Problem: Trending requires counting tags across all posts → all 50 partitions → high latency
Solution 1: Global secondary index (term-partitioned)
- Build a global index partitioned by hashtag
- All mentions of
#WorldCup→ single index partition - Read is fast; write must update global index (cross-partition)
- Trade-off: slight write latency increase + eventual consistency on index
Solution 2: Pre-aggregation with stream processing
- Kafka stream of all new posts
- Flink/Spark computes trending window (e.g., last 1 hour sliding window)
- Write aggregate counts to dedicated “trending” table
- Read trending → single table scan on small pre-computed data
Solution 3: Materialized view
- Periodically (every 5 min) batch-compute trending topics
- Store result in fast cache (Redis)
- Acceptable staleness for trending topics (5 min is fine)
Best answer for interview: Stream processing (Solution 2) for real-time; materialized cache (Solution 3) for serving
Quick Facts
What is the typical partition count for Elasticsearch and Cassandra clusters?
?
Elasticsearch:
- Default: 1 primary shard per index (configurable at creation, cannot change after)
- Common: 5 shards for medium indices (can search in parallel)
- Rule of thumb: 10-50GB per shard optimal
- Over-sharding: each shard has overhead; too many shards = memory pressure
Cassandra:
- VNodes: Each physical node hosts multiple virtual nodes (default: 256 VNodes per node)
- 10-node cluster with 256 VNodes each = 2560 virtual nodes
- Each VNode owns a token range
- More VNodes = smoother rebalancing when adding/removing nodes
- Trade-off: Many VNodes = more gossip overhead
DynamoDB:
- Partitions scaled automatically; ~10GB or ~3,000 WCU/10,000 RCU per partition
- Cannot see or control partition count (managed service)
Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13