Chapter 6 Flashcards - Partitioning

flashcards chapter-6 ddia


Basic Concepts

What is partitioning (sharding) and why is it needed?
?

  • Partitioning: Splitting a large dataset across multiple nodes, so each node holds only a subset
  • Also called: Sharding (MongoDB, Elasticsearch), Tablets (Bigtable), Regions (HBase), VNodes (Cassandra)
  • Why needed:
    • Data too large for single machine
    • Query load too high for single machine
    • Enables horizontal scaling (add nodes = add capacity)
  • Combined with replication: Each partition is replicated to multiple nodes for availability
  • Orthogonal: Replication and partitioning are independent concerns

What is a hot spot and what causes it?
?

  • Hot spot: One partition receives disproportionately high read or write load
  • Causes:
    1. Sequential keys (timestamps): All new writes go to the “latest” partition
    2. Popular keys (celebrities): Celeb’s posts read by millions → same partition overwhelmed
    3. Non-uniform key distribution with range partitioning: Some ranges have more data/traffic
  • Mitigation:
    • Prepend random/varied prefix to keys (distributes sequential writes)
    • Caching layer for popular keys (reduces reads to hot partition)
    • Application-level sharding of hot keys (e.g., append _0 through _99)

Partitioning Strategies

What are the two main partitioning strategies and their key trade-off?
?
Key-range partitioning:

  • Assign continuous ranges of keys to partitions (like encyclopedia volumes)
  • Keys sorted within partitions → range queries efficient
  • ✅ Range scans fast; ✅ Sorted data
  • ❌ Hot spots from sequential key access patterns (e.g., timestamps)
  • Used by: HBase, Bigtable, MongoDB (range)

Hash partitioning:

  • Hash the key → hash value determines partition
  • Even distribution regardless of key patterns
  • ✅ Uniform distribution; ✅ No sequential hot spots
  • ❌ Range queries require scatter/gather across all partitions
  • Used by: Cassandra (partition key), DynamoDB, MongoDB (hash index)

Trade-off: Range queries vs uniform distribution — you can’t easily have both

What is consistent hashing and how does it minimize rebalancing?
?

  • Concept: Hash space mapped to a ring (0 to 2^64); nodes placed at random positions
  • Key routing: A key goes to the first node clockwise from the key’s hash
  • Adding a node:
    • New node placed at a random position on the ring
    • Only the keys between new node and its predecessor need to move
    • All other keys remain on their current nodes
  • Removing a node: Only that node’s keys move to successor
  • Result: Minimal data movement on cluster topology changes
  • Used by: Amazon Dynamo, Cassandra, Chord DHT

How does Cassandra’s compound primary key achieve both even distribution and range queries?
?

  • Compound key: (partition_key, clustering_key)
  • Partition key: Hashed to determine which node/partition receives the data
    • Even distribution across all nodes
    • Cannot do range queries across different partition keys
  • Clustering key: Sorted within the partition
    • Range queries work within a single partition
    • E.g., all tweets for user_id=123 sorted by timestamp

Example: PRIMARY KEY (user_id, created_at)

  • All events for user_id=123 are in one partition (hashed)
  • Within that partition, events sorted by created_at
  • Query: SELECT * WHERE user_id=123 AND created_at BETWEEN t1 AND t2 → efficient!
  • Cannot query all users’ events in time range across all partitions efficiently

Secondary Indexes

What are the two approaches to secondary indexes in partitioned databases?
?
1. Document-partitioned (local) secondary indexes:

  • Each partition maintains its own secondary index for its local data
  • Write: update only the local partition’s index (fast, single-partition)
  • Read: must scatter to ALL partitions and gather results
  • ✅ Fast writes; ❌ Expensive reads (scatter/gather = tail latency = slowest partition)
  • Used by: MongoDB, Riak, Cassandra, Elasticsearch

2. Term-partitioned (global) secondary indexes:

  • One global index, itself partitioned by the index term
  • Write: must update global index (which lives on another partition) → cross-partition write
  • Read: look up the global index partition → get exact locations → fast
  • ✅ Efficient reads; ❌ Writes touch multiple partitions; index may be slightly stale (async updates)
  • Used by: DynamoDB Global Secondary Indexes, Riak Search

What is scatter/gather and why is it problematic?
?

  • Scatter/gather: Send query to all partitions (scatter), collect and merge all responses (gather)

  • Required when: Using document-partitioned secondary indexes; range queries across partitions

  • Problems:

    • Tail latency: Response time = slowest partition (worst-case latency)
    • Resource amplification: N partitions = N-fold read amplification
    • Coordination overhead: Must wait for all responses before returning result
    • Failure sensitivity: If one partition is slow/unavailable, whole query is affected
  • Mitigation: Use global secondary indexes where possible; limit scatter/gather in SLA-sensitive paths

Rebalancing

What are the three strategies for rebalancing partitions?
?
1. Fixed number of partitions (Riak, Elasticsearch, Couchbase):

  • Create many more partitions than nodes at creation (e.g., 1000 for 10 nodes)
  • Add node: steal whole partitions from existing nodes; no partition splitting
  • ✅ Simple; ❌ Count fixed at DB creation; hard to change

2. Dynamic partitioning (HBase, RethinkDB, MongoDB):

  • Split partition when too large; merge when too small
  • ✅ Adapts to data volume automatically
  • ❌ New empty DB → one partition, one node handles all traffic

3. Partitions proportional to nodes (Cassandra):

  • Each node has a fixed number of partitions (e.g., 256)
  • Add node: new node randomly steals partitions from existing nodes
  • ✅ Scales naturally with cluster size

Why is automatic rebalancing potentially dangerous during failures?
?

  • Rebalancing = moves large amounts of data between nodes (network + disk intensive)

  • During a failure:

    • One node is slow or down
    • Automatic rebalancer kicks in: “Node seems down, let me move its partitions”
    • Rebalancing adds heavy I/O load to remaining nodes
    • Already-stressed nodes get more stressed → cascading failure
  • Real example: Riak/Cassandra cluster under heavy load; one node GC pauses; auto-rebalancer triggers; surviving nodes overwhelmed → multi-node failure

  • Best practice:

    • Use automatic rebalancing conservatively (add delays, thresholds)
    • Require human approval for large rebalancing operations
    • Monitor rebalancing operations separately from regular load

Request Routing

What are the three approaches to request routing in partitioned systems?
?
1. Any-node forwarding (gossip protocol):

  • Client contacts any node
  • If that node owns the partition: handles request
  • If not: forwards to correct node (one extra hop)
  • Used by: Cassandra (gossip protocol for partition metadata)

2. Routing tier (dedicated router):

  • Clients contact a partition-aware routing service (proxy)
  • Router has current partition → node mapping
  • Forwards to correct node directly
  • Used by: Kafka (via ZooKeeper/KRaft), some MongoDB deployments

3. Partition-aware clients:

  • Client library maintains current partition map
  • Contacts correct node directly (no extra hops)
  • Client subscribes to ZooKeeper/etcd for partition map updates
  • Used by: Kafka producer/consumer APIs

Coordination service: ZooKeeper, etcd, or KRaft used as source of truth for partition assignments

Modern Context (2026)

How do NewSQL databases handle partitioning differently from traditional NoSQL?
?

  • Traditional NoSQL (Cassandra, DynamoDB): Application must choose shard key; fixed partitioning
  • NewSQL (CockroachDB, Spanner, YugabyteDB): Automatic range-based partitioning

CockroachDB approach:

  • Data automatically split into ranges (~64MB each)
  • Ranges automatically split when too large, merged when too small
  • Load-based splitting: hot ranges split based on QPS, not just size
  • Application sees a single logical table; partitioning is invisible
  • Transactions work across ranges (via Raft consensus)

Benefits: No shard key design decisions; automatic rebalancing; cross-shard transactions
Cost: Distributed transactions have higher latency than single-shard operations

How does Kafka use partitioning and why does the partition key matter?
?

  • Kafka topic = ordered, append-only log, divided into partitions
  • Each partition is an ordered sequence of messages

Partition key:

  • Producers specify a key per message
  • Key is hashed to determine partition
  • Messages with same key always go to same partition → ordering guaranteed per key
  • No key: round-robin distribution

Why partition key matters:

  • Ordering: Need order for a user’s events? Use user_id as key
  • Colocating related events: All events for same entity in one partition → simpler stream processing
  • Hot spots: Popular user_id as key → hot partition → same problems as DB hot spots

Consumer groups:

  • Each partition consumed by exactly one consumer in a group
  • Parallelism = number of partitions (more partitions = more consumers possible)

Interview Scenarios

Design the partitioning scheme for a ride-sharing app (like Uber) with drivers and riders.
?
Entities to partition: Users, Drivers, Trips, Location updates

Location updates (highest volume):

  • Write: millions of location updates per second from drivers
  • Key: driver_id → hash partition
  • ✅ Uniform distribution; drivers are independent
  • Range queries: “Find all drivers near location X” → geospatial, not key-based

Geospatial queries (nearest drivers):

  • Partition by geohash (geographic grid cells)
  • Range query = nearby geohash cells → range scan
  • Hot spots: dense city centers → may need finer granularity

Trips table:

  • Partition by trip_id (hash) for lookups
  • Secondary index by user_id (document-partitioned) for “my trips” queries
  • Time-range queries → compound key (user_id, created_at)

Real approach (Uber): Separate storage systems by access pattern; Redis for real-time location, Cassandra for trips history

A social media company’s “trending topics” query is slow because it requires scatter/gather across 50 partitions. How do you fix it?
?
Problem: Trending requires counting tags across all posts → all 50 partitions → high latency

Solution 1: Global secondary index (term-partitioned)

  • Build a global index partitioned by hashtag
  • All mentions of #WorldCup → single index partition
  • Read is fast; write must update global index (cross-partition)
  • Trade-off: slight write latency increase + eventual consistency on index

Solution 2: Pre-aggregation with stream processing

  • Kafka stream of all new posts
  • Flink/Spark computes trending window (e.g., last 1 hour sliding window)
  • Write aggregate counts to dedicated “trending” table
  • Read trending → single table scan on small pre-computed data

Solution 3: Materialized view

  • Periodically (every 5 min) batch-compute trending topics
  • Store result in fast cache (Redis)
  • Acceptable staleness for trending topics (5 min is fine)

Best answer for interview: Stream processing (Solution 2) for real-time; materialized cache (Solution 3) for serving

Quick Facts

What is the typical partition count for Elasticsearch and Cassandra clusters?
?
Elasticsearch:

  • Default: 1 primary shard per index (configurable at creation, cannot change after)
  • Common: 5 shards for medium indices (can search in parallel)
  • Rule of thumb: 10-50GB per shard optimal
  • Over-sharding: each shard has overhead; too many shards = memory pressure

Cassandra:

  • VNodes: Each physical node hosts multiple virtual nodes (default: 256 VNodes per node)
  • 10-node cluster with 256 VNodes each = 2560 virtual nodes
  • Each VNode owns a token range
  • More VNodes = smoother rebalancing when adding/removing nodes
  • Trade-off: Many VNodes = more gossip overhead

DynamoDB:

  • Partitions scaled automatically; ~10GB or ~3,000 WCU/10,000 RCU per partition
  • Cannot see or control partition count (managed service)

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13