Chapter 6 Flashcards - Key-Value Store
flashcards volume1 key-value-store distributed-systems storage
What is a key-value store and what are 4 real-world examples?
?
A non-relational database storing data as unique key-value pairs. Key = unique identifier (string/hash), value = anything (string, blob, JSON). Examples: Redis (in-memory, caching), DynamoDB (AWS managed, fully distributed), Cassandra (wide-column, tunable consistency), Memcached (simple in-memory caching). Fundamental building block for caches, session stores, and feature flags.
What are the two operations in a key-value store API?
?
get(key) — retrieve the value for a given key. put(key, value) — insert or update a key-value pair. Simple interface, but the distributed implementation of these two operations contains all the complexity: consistent hashing, quorum, replication, conflict resolution, and failure handling.
What does CAP theorem state and why does it matter for key-value stores?
?
A distributed system can only guarantee 2 of 3: Consistency (every read gets latest write), Availability (every request gets a response), Partition tolerance (system works despite network partition). Since network partitions WILL happen in real systems, P is always required. So you choose: CP (sacrifice availability during partition) or AP (sacrifice consistency). Most KV stores choose AP (Cassandra, DynamoDB) for higher availability.
Why does naive hashing (hash(key) % N) fail in distributed systems?
?
When N changes (server added/removed), almost all keys remap to different servers. This causes massive data movement — potentially rehashing all keys. Example: 1M keys across 10 servers → add 1 server → ~909K keys need to move. This causes thundering herd (cache misses), high I/O, and unavailability. Solution: consistent hashing, where adding/removing a server only moves K/N keys.
How does consistent hashing work?
?
Servers and keys are both hashed onto a circular ring (0 to 2^32-1). A key maps to the first server found by walking clockwise on the ring. When a server is added, it takes over a portion of keys from its clockwise neighbor. When removed, its keys go to the next clockwise server. Only K/N keys are remapped (K=total keys, N=servers). Core property: minimal disruption on topology changes.
What are virtual nodes in consistent hashing and why are they needed?
?
Each physical server is assigned multiple positions (virtual nodes/vnodes) on the consistent hash ring instead of just one. Example: Server A gets positions 10, 110, 210, 310; Server B gets 50, 150, 250, 350. Benefits: (1) More even data distribution — real positions are random and uneven without vnodes. (2) Better load balancing — each server handles a proportional share. (3) Easy scaling — new server takes a few vnodes from each existing server.
What is the quorum consensus rule for strong consistency?
?
W + R > N guarantees strong consistency. W = write quorum (min replicas that must ACK a write), R = read quorum (min replicas that must respond to a read), N = total replicas. The rule ensures overlap: at least one replica that was written will also be read, so you never miss the latest write. Example: N=3, W=2, R=2 → 2+2=4 > 3, overlap guaranteed.
What are the 3 common quorum configurations for N=3? When to use each?
?
(1) N=3, W=1, R=3: Fast writes, slow reads. Optimized for write-heavy workloads. (2) N=3, W=3, R=1: Slow writes (all must ACK), fast reads. Optimized for read-heavy workloads. (3) N=3, W=2, R=2: Balanced — most common default. Good for general use cases. All three satisfy W+R>N → strong consistency. For eventual consistency: use W=1, R=1 (W+R=2, not > 3).
What happens when W + R <= N?
?
Eventual consistency — no overlap is guaranteed between nodes written to and nodes read from. The system may return stale data. Trade-off: lower latency, higher availability (fewer nodes need to agree). Example: N=3, W=1, R=1 → write to 1 node, read from 1 node, might read from a different node that hasn’t received the write yet. Suitable for social feeds, view counters, shopping carts.
What is a coordinator node in a key-value store?
?
The node responsible for routing a client’s request to the correct N replica nodes for a given key. Selected by consistent hashing — the key maps to a position, and the node at that position (or the first clockwise node) becomes coordinator. It acts as a proxy: sends writes to N replicas and waits for W ACKs, or sends reads to N replicas and waits for R responses. Any node can be a coordinator; there is no single central coordinator.
What is a vector clock and what problem does it solve?
?
A vector clock tracks causality in distributed writes as a list of [server, version] pairs: D([s1,v1], [s2,v2], …). Problem: During a network partition, two clients write concurrently to different replicas → conflict. Vector clocks detect conflicts by comparing versions. If neither clock dominates (no version is greater in all positions), a conflict exists and must be resolved. Without vector clocks, you’d silently overwrite data or return the wrong version.
How do you detect a conflict using vector clocks?
?
Compare two version vectors: Version X dominates Y if every counter in X >= every counter in Y. If X dominates Y, then X is the causally later write — use X. If neither X dominates Y nor Y dominates X (some counters are higher on each side), then the writes are concurrent — conflict! This conflict must be resolved, either by sending both versions to the client for merge, or by applying a policy like last-write-wins.
What is Last-Write-Wins (LWW) conflict resolution? What are the trade-offs?
?
Attach a timestamp to each write. On conflict, keep the version with the latest timestamp, discard the older. Used by Cassandra by default. Pro: Simple — no complex merge logic, no client involvement. Con: Can lose data — two concurrent writes during a partition, one is silently discarded regardless of which had “more important” data. Safe for use cases where occasional data loss is acceptable (user preferences, counters). Not safe for financial data.
How does the gossip protocol work for failure detection?
?
Each node maintains a membership list: [NodeID, IP, HeartbeatCounter, Timestamp]. Algorithm: (1) Each node increments own heartbeat counter periodically. (2) Randomly picks K peers and exchanges membership list. (3) Recipients merge lists, taking max heartbeat per node. (4) If heartbeat for a node hasn’t updated in T seconds → suspect failure. (5) After T’ seconds, mark offline. Information propagates in O(log N) rounds — exponentially fast. No central authority, no O(N²) all-to-all pings.
What is a sloppy quorum and when is it used?
?
Instead of requiring the exact N designated replica nodes for a key, use the first W/R available healthy nodes encountered on the ring. Used when one of the designated N nodes is temporarily down. Example: Key maps to [A, B, C], A is down → write goes to [B, C, D] instead. Node D is a “stand-in” that stores the data temporarily. After A recovers, hinted handoff transfers the data back to A. Trade-off: improves availability but temporarily relaxes consistency guarantees.
What is hinted handoff?
?
When a node (say D) receives a write as a stand-in for a down node (say A) via sloppy quorum, it stores a “hint”: metadata saying “this data belongs to Node A.” When A comes back online, D detects this and pushes the data to A, then deletes its own copy. This ensures the original replica set eventually holds all data even after temporary failures. Used by DynamoDB and Cassandra to maintain high availability during short outages.
What is anti-entropy and how do Merkle trees enable it?
?
Anti-entropy is the process of synchronizing diverged replicas after a server was down for a long time. Merkle trees make this efficient: build a binary hash tree over key ranges (leaves = hashes of individual keys, parents = hash of children). To sync: compare root hashes. If equal → replicas match. If different → recursively compare subtrees to find the diverged key range. Only sync the keys that actually differ. O(log N) comparisons vs O(N) naive comparison.
Describe the full write path in a distributed key-value store.
?
- Write to Commit Log (WAL) first — sequential append-only disk write. Fast, durable. Replayed on crash to recover memtable. 2. Write to Memtable — in-memory sorted structure (red-black tree or skip list). Fast reads and writes. 3. When memtable exceeds threshold (e.g., 32 MB) → flush to SSTable on disk. SSTable is immutable and sorted by key. 4. Background compaction merges multiple SSTables, resolves duplicate keys (newer wins), removes tombstones (deleted keys). Commit log deleted after memtable flush.
What is an SSTable and what are its key properties?
?
SSTable (Sorted String Table) is an immutable, on-disk file of key-value pairs sorted by key. Produced when a memtable is flushed to disk. Properties: immutable (never modified in place), sorted (enables binary search), contains sparse index (maps keys to file offsets). Multiple SSTables can exist for the same key range; compaction merges them. Used in LevelDB, RocksDB, Cassandra, HBase. Immutability enables crash safety and simplifies concurrent access.
Describe the full read path in a distributed key-value store.
?
- Check memtable — if key exists in memory, return immediately (most recent writes are here). 2. If not found, check Bloom filter for each SSTable — “is this key possibly in this SSTable?” If Bloom says NO, skip that SSTable. 3. For SSTables where Bloom says MAYBE, use sparse index to find the file offset for the key’s range. 4. Read the data block from disk, decompress, return value. 5. Most recent version wins across multiple SSTables. Bloom filters are critical — avoid expensive disk seeks for missing keys.
What is a Bloom filter and how does it help reads?
?
A Bloom filter is a probabilistic data structure that answers “is this key possibly in this set?” It can say “definitely NOT in set” (never false negatives — safe to skip) or “maybe in set” (may be a false positive — read the SSTable anyway). Built per SSTable. Space-efficient: ~10 bits/key for ~1% false positive rate. Critical for read performance: without Bloom filters, every read would need to scan all SSTables. False positives waste one disk read; false negatives are impossible.
What is compaction in a key-value store?
?
Background process that merges multiple SSTables into a single larger SSTable. Purposes: (1) Remove duplicate keys — multiple SSTables may have different versions of the same key; compaction keeps only the latest. (2) Remove tombstones — deleted keys are marked with tombstones during writes; compaction removes them. (3) Reduce read amplification — fewer SSTables to scan on reads. (4) Reclaim disk space. Trade-off: compaction uses CPU and I/O. Two strategies: size-tiered (merge similar-sized) and leveled (maintain size levels).
Compare Cassandra, DynamoDB, and Redis on key dimensions.
?
Redis: CAP=CP (cluster mode), sub-millisecond latency, optional persistence (RDB/AOF), best for caching/sessions/leaderboards. Cassandra: CAP=AP (tunable to CP), wide-column data model, multi-master replication, tunable quorum (N/W/R), best for IoT/time-series/high write throughput. DynamoDB: CAP=AP (default) / CP (strong reads at +50% cost), fully managed, auto-scales, best for e-commerce/gaming/serverless. Key difference: Redis is primarily a cache; Cassandra/DynamoDB are primary databases.
When would you choose CP vs AP for a key-value store?
?
Choose CP (consistency over availability) when: stale data causes real harm — bank balances, inventory counts, booking systems (double-booking is unacceptable), authentication tokens. Choose AP (availability over consistency) when: stale data is acceptable — social media likes/views, shopping carts (merge on checkout), user preferences, feature flags, recommendation scores. Rule of thumb: “What is worse for the user — seeing stale data, or getting an error?” If error is worse, use AP.
How does multi-datacenter replication work in a key-value store?
?
Replication within a datacenter is synchronous (W=2 or W=3 to meet quorum locally). Replication across datacenters is asynchronous — the write is ACK’d once the local quorum is satisfied, then propagated to remote DCs in the background. This avoids cross-DC latency on the critical write path. The remote DC will eventually have consistent data (eventual consistency across DCs). If an entire DC fails, the other DC(s) can serve traffic from their replica copies.
What is the commit log (WAL) and why is it written before the memtable?
?
The commit log (Write-Ahead Log) is an append-only sequential file on disk. It is written FIRST before updating the memtable. Reason: If the server crashes after writing to the commit log but before flushing the memtable to disk, the commit log can be replayed on restart to reconstruct the in-memory state — no data loss. The commit log is deleted only after the memtable is successfully flushed to an SSTable. Sequential append is very fast (no random I/O), adding minimal latency to writes.
What are the downsides of vector clocks?
?
(1) Vector size grows with number of servers — in a large cluster, each version tag carries many [server, version] pairs. (2) Conflict resolution logic is pushed to the client — the application must implement merge logic, which is complex. (3) In practice, vector clocks are bounded by dropping oldest entries after a limit, which can cause false conflict detection. Solutions: Use last-write-wins (LWW) instead (simpler, but can lose data), or limit vector clock size and accept occasional false conflicts.
What are the key differences between strong and eventual consistency in terms of W, R, N?
?
Strong consistency: W + R > N. The overlap ensures every read sees the latest write. Higher latency (more nodes must ACK) but no stale reads. Example: N=3, W=2, R=2. Eventual consistency: W + R <= N. No overlap guaranteed; may read from a node that hasn’t received the write yet. Lower latency, higher availability. Example: N=3, W=1, R=1. Tunable systems (Cassandra, DynamoDB) let you configure W and R per request or table based on the use case.
Why can’t a single server key-value store scale to internet traffic?
?
Memory is expensive and physically limited — a single server can hold only a few hundred GB to a few TB at most. Single point of failure — if the server crashes, all data is inaccessible. No horizontal scalability — can’t add more machines to handle more requests. Solution: Distributed key-value store using consistent hashing (partition data), replication (fault tolerance), and quorum (tunable consistency). This is why Cassandra/DynamoDB are designed as distributed systems from the ground up.
Total Cards: 29
Review Time: 25-30 minutes
Priority: HIGH - Hard interview question, core distributed systems!
Last Updated: 2026-04-13