Chapter 6 Flashcards - Partitioning

flashcards chapter-6 ddia

Basic Concepts

What is partitioning (sharding) and why is it needed?
?

Partitioning: Splitting a large dataset across multiple nodes, so each node holds only a subset
Also called: Sharding (MongoDB, Elasticsearch), Tablets (Bigtable), Regions (HBase), VNodes (Cassandra)
Why needed:
- Data too large for single machine
- Query load too high for single machine
- Enables horizontal scaling (add nodes = add capacity)
Combined with replication: Each partition is replicated to multiple nodes for availability
Orthogonal: Replication and partitioning are independent concerns

What is a hot spot and what causes it?
?

Hot spot: One partition receives disproportionately high read or write load
Causes:
1. Sequential keys (timestamps): All new writes go to the “latest” partition
2. Popular keys (celebrities): Celeb’s posts read by millions → same partition overwhelmed
3. Non-uniform key distribution with range partitioning: Some ranges have more data/traffic
Mitigation:
- Prepend random/varied prefix to keys (distributes sequential writes)
- Caching layer for popular keys (reduces reads to hot partition)
- Application-level sharding of hot keys (e.g., append _0 through _99)

Partitioning Strategies

What are the two main partitioning strategies and their key trade-off?
?
Key-range partitioning:

Assign continuous ranges of keys to partitions (like encyclopedia volumes)
Keys sorted within partitions → range queries efficient
✅ Range scans fast; ✅ Sorted data
❌ Hot spots from sequential key access patterns (e.g., timestamps)
Used by: HBase, Bigtable, MongoDB (range)

Hash partitioning:

Hash the key → hash value determines partition
Even distribution regardless of key patterns
✅ Uniform distribution; ✅ No sequential hot spots
❌ Range queries require scatter/gather across all partitions
Used by: Cassandra (partition key), DynamoDB, MongoDB (hash index)

Trade-off: Range queries vs uniform distribution — you can’t easily have both

What is consistent hashing and how does it minimize rebalancing?
?

Concept: Hash space mapped to a ring (0 to 2^64); nodes placed at random positions
Key routing: A key goes to the first node clockwise from the key’s hash
Adding a node:
- New node placed at a random position on the ring
- Only the keys between new node and its predecessor need to move
- All other keys remain on their current nodes
Removing a node: Only that node’s keys move to successor
Result: Minimal data movement on cluster topology changes
Used by: Amazon Dynamo, Cassandra, Chord DHT

How does Cassandra’s compound primary key achieve both even distribution and range queries?
?

Compound key: (partition_key, clustering_key)
Partition key: Hashed to determine which node/partition receives the data
- Even distribution across all nodes
- Cannot do range queries across different partition keys
Clustering key: Sorted within the partition
- Range queries work within a single partition
- E.g., all tweets for user_id=123 sorted by timestamp

Example: PRIMARY KEY (user_id, created_at)

All events for user_id=123 are in one partition (hashed)
Within that partition, events sorted by created_at
Query: SELECT * WHERE user_id=123 AND created_at BETWEEN t1 AND t2 → efficient!
Cannot query all users’ events in time range across all partitions efficiently

Secondary Indexes

What are the two approaches to secondary indexes in partitioned databases?
?
1. Document-partitioned (local) secondary indexes:

Each partition maintains its own secondary index for its local data
Write: update only the local partition’s index (fast, single-partition)
Read: must scatter to ALL partitions and gather results
✅ Fast writes; ❌ Expensive reads (scatter/gather = tail latency = slowest partition)
Used by: MongoDB, Riak, Cassandra, Elasticsearch

2. Term-partitioned (global) secondary indexes:

One global index, itself partitioned by the index term
Write: must update global index (which lives on another partition) → cross-partition write
Read: look up the global index partition → get exact locations → fast
✅ Efficient reads; ❌ Writes touch multiple partitions; index may be slightly stale (async updates)
Used by: DynamoDB Global Secondary Indexes, Riak Search

What is scatter/gather and why is it problematic?
?

Scatter/gather: Send query to all partitions (scatter), collect and merge all responses (gather)
Required when: Using document-partitioned secondary indexes; range queries across partitions
Problems:
- Tail latency: Response time = slowest partition (worst-case latency)
- Resource amplification: N partitions = N-fold read amplification
- Coordination overhead: Must wait for all responses before returning result
- Failure sensitivity: If one partition is slow/unavailable, whole query is affected
Mitigation: Use global secondary indexes where possible; limit scatter/gather in SLA-sensitive paths

Rebalancing

What are the three strategies for rebalancing partitions?
?
1. Fixed number of partitions (Riak, Elasticsearch, Couchbase):

Create many more partitions than nodes at creation (e.g., 1000 for 10 nodes)
Add node: steal whole partitions from existing nodes; no partition splitting
✅ Simple; ❌ Count fixed at DB creation; hard to change

2. Dynamic partitioning (HBase, RethinkDB, MongoDB):

Split partition when too large; merge when too small
✅ Adapts to data volume automatically
❌ New empty DB → one partition, one node handles all traffic

3. Partitions proportional to nodes (Cassandra):

Each node has a fixed number of partitions (e.g., 256)
Add node: new node randomly steals partitions from existing nodes
✅ Scales naturally with cluster size

Why is automatic rebalancing potentially dangerous during failures?
?

Rebalancing = moves large amounts of data between nodes (network + disk intensive)
During a failure:
- One node is slow or down
- Automatic rebalancer kicks in: “Node seems down, let me move its partitions”
- Rebalancing adds heavy I/O load to remaining nodes
- Already-stressed nodes get more stressed → cascading failure
Real example: Riak/Cassandra cluster under heavy load; one node GC pauses; auto-rebalancer triggers; surviving nodes overwhelmed → multi-node failure
Best practice:
- Use automatic rebalancing conservatively (add delays, thresholds)
- Require human approval for large rebalancing operations
- Monitor rebalancing operations separately from regular load

Request Routing

What are the three approaches to request routing in partitioned systems?
?
1. Any-node forwarding (gossip protocol):

Client contacts any node
If that node owns the partition: handles request
If not: forwards to correct node (one extra hop)
Used by: Cassandra (gossip protocol for partition metadata)

2. Routing tier (dedicated router):

Clients contact a partition-aware routing service (proxy)
Router has current partition → node mapping
Forwards to correct node directly
Used by: Kafka (via ZooKeeper/KRaft), some MongoDB deployments

3. Partition-aware clients:

Client library maintains current partition map
Contacts correct node directly (no extra hops)
Client subscribes to ZooKeeper/etcd for partition map updates
Used by: Kafka producer/consumer APIs

Coordination service: ZooKeeper, etcd, or KRaft used as source of truth for partition assignments

Modern Context (2026)

How do NewSQL databases handle partitioning differently from traditional NoSQL?
?

Traditional NoSQL (Cassandra, DynamoDB): Application must choose shard key; fixed partitioning
NewSQL (CockroachDB, Spanner, YugabyteDB): Automatic range-based partitioning

CockroachDB approach:

Data automatically split into ranges (~64MB each)
Ranges automatically split when too large, merged when too small
Load-based splitting: hot ranges split based on QPS, not just size
Application sees a single logical table; partitioning is invisible
Transactions work across ranges (via Raft consensus)

Benefits: No shard key design decisions; automatic rebalancing; cross-shard transactions
Cost: Distributed transactions have higher latency than single-shard operations

How does Kafka use partitioning and why does the partition key matter?
?

Kafka topic = ordered, append-only log, divided into partitions
Each partition is an ordered sequence of messages

Partition key:

Producers specify a key per message
Key is hashed to determine partition
Messages with same key always go to same partition → ordering guaranteed per key
No key: round-robin distribution

Why partition key matters:

Ordering: Need order for a user’s events? Use user_id as key
Colocating related events: All events for same entity in one partition → simpler stream processing
Hot spots: Popular user_id as key → hot partition → same problems as DB hot spots

Consumer groups:

Each partition consumed by exactly one consumer in a group
Parallelism = number of partitions (more partitions = more consumers possible)

Interview Scenarios

Design the partitioning scheme for a ride-sharing app (like Uber) with drivers and riders.
?
Entities to partition: Users, Drivers, Trips, Location updates

Location updates (highest volume):

Write: millions of location updates per second from drivers
Key: driver_id → hash partition
✅ Uniform distribution; drivers are independent
Range queries: “Find all drivers near location X” → geospatial, not key-based

Geospatial queries (nearest drivers):

Partition by geohash (geographic grid cells)
Range query = nearby geohash cells → range scan
Hot spots: dense city centers → may need finer granularity

Trips table:

Partition by trip_id (hash) for lookups
Secondary index by user_id (document-partitioned) for “my trips” queries
Time-range queries → compound key (user_id, created_at)

Real approach (Uber): Separate storage systems by access pattern; Redis for real-time location, Cassandra for trips history

A social media company’s “trending topics” query is slow because it requires scatter/gather across 50 partitions. How do you fix it?
?
Problem: Trending requires counting tags across all posts → all 50 partitions → high latency

Solution 1: Global secondary index (term-partitioned)

Build a global index partitioned by hashtag
All mentions of #WorldCup → single index partition
Read is fast; write must update global index (cross-partition)
Trade-off: slight write latency increase + eventual consistency on index

Solution 2: Pre-aggregation with stream processing

Kafka stream of all new posts
Flink/Spark computes trending window (e.g., last 1 hour sliding window)
Write aggregate counts to dedicated “trending” table
Read trending → single table scan on small pre-computed data

Solution 3: Materialized view

Periodically (every 5 min) batch-compute trending topics
Store result in fast cache (Redis)
Acceptable staleness for trending topics (5 min is fine)

Best answer for interview: Stream processing (Solution 2) for real-time; materialized cache (Solution 3) for serving

Quick Facts

What is the typical partition count for Elasticsearch and Cassandra clusters?
?
Elasticsearch:

Default: 1 primary shard per index (configurable at creation, cannot change after)
Common: 5 shards for medium indices (can search in parallel)
Rule of thumb: 10-50GB per shard optimal
Over-sharding: each shard has overhead; too many shards = memory pressure

Cassandra:

VNodes: Each physical node hosts multiple virtual nodes (default: 256 VNodes per node)
10-node cluster with 256 VNodes each = 2560 virtual nodes
Each VNode owns a token range
More VNodes = smoother rebalancing when adding/removing nodes
Trade-off: Many VNodes = more gossip overhead

DynamoDB:

Partitions scaled automatically; ~10GB or ~3,000 WCU/10,000 RCU per partition
Cannot see or control partition count (managed service)

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch06-flashcards