Chapter 6: Design a Key-Value Store
volume1 key-value-store distributed-systems storage
Status: π© Interview ready
Difficulty: Hard (Core distributed systems)
Time to complete: 60 min read + practice
Overview
A key-value store is a non-relational database that stores data as unique key-value pairs. The key is a unique identifier (string or hash), and the value can be anything (string, blob, JSON, binary).
Real-world systems:
- Redis: In-memory, single-server + cluster mode, widely used for caching
- DynamoDB: AWS managed, fully distributed, high availability
- Cassandra: Wide-column, highly scalable, tunable consistency
- Memcached: In-memory caching, simpler than Redis
Why this matters:
- Fundamental distributed systems design problem
- Tests knowledge of CAP theorem, consistent hashing, quorum, and conflict resolution
- Core building block behind many large-scale systems (caches, session stores, feature flags)
Problem Statement
Design a key-value store that supports:
get(key)β retrieve the value for a given keyput(key, value)β insert or update a key-value pair
Context: Must handle millions of requests per second at internet scale, with high availability and partition tolerance.
Step 1: Requirements & Scope (5 min)
Functional Requirements
Clarifying questions:
- Single-server or distributed? β Distributed (internet scale)
- Strong or eventual consistency? β Tunable (trade-off, discussable)
- High availability or strong consistency? β High availability preferred
- Data size? β Small objects (< 10 KB per value)
- How much data? β Store big data (hundreds of TB)
- What operations? β
get(key)andput(key, value)only
Scope:
- Key-value pair size: < 10 KB
- Store big data sets (hundreds of TB)
- High availability (low downtime)
- High scalability (supports large data set)
- Automatic scaling (add/remove servers based on traffic)
- Tunable consistency
- Low latency
Non-Functional Requirements
- High availability: System should always be up even during failures
- High scalability: Scale to support large datasets
- Automatic scaling: Servers added/removed automatically
- Tunable consistency: Configurable consistency level
- Low latency: Reads and writes should be fast
- Partition tolerance: System functions even with network partitions
Step 2: High-Level Design (10 min)
Single Server Key-Value Store
If all data fits on one server (simple approach):
- Use an in-memory hash map
- Full data fits in memory: Fast but limited to one machine
- Compress data + store frequently used in memory, rest on disk
Limitations:
- Memory is expensive and limited
- Single point of failure
- Not scalable
CAP Theorem β The Core Trade-off
Consistency
/\
/ \
/ \
/ CA \
/--------\
/ CP AP \
/____________\
Availability Partition Tolerance
CAP Theorem: A distributed system can only guarantee 2 of 3:
- Consistency β every read gets the most recent write
- Availability β every request gets a response (not necessarily the latest data)
- Partition tolerance β system continues despite network partition
In practice: Network partitions WILL happen. So you must choose between CP or AP.
| System | Choice | Reasoning |
|---|---|---|
| Zookeeper, HBase | CP | Financial data β consistency critical |
| Cassandra, DynamoDB | AP | High availability preferred |
| Redis (cluster) | CP or AP | Configurable per use case |
Key insight for interviews: Since P is always required, you choose between CP (sacrifice availability during partition) or AP (sacrifice consistency during partition).
Data Partition β Consistent Hashing
Problem: How to distribute data across multiple servers?
Naive hashing: hash(key) % N
- Problem: When N changes (add/remove server), almost all keys remap β massive data movement
Consistent Hashing (the solution):
0
/ \
330 30
/ \
300 60
\ /
270 90
\ /
240β120
|
180
Servers placed at positions on ring.
Keys map clockwise to next server.
Adding/removing server only moves ~K/N keys.
Virtual nodes (vnodes): Each server has multiple positions on the ring
Server A: positions 10, 110, 210, 310
Server B: positions 50, 150, 250, 350
Server C: positions 90, 190, 290, 30
Benefits:
- More balanced distribution
- Non-uniform ring when using real positions
- Easy to scale: New server gets subset of virtual nodes
Consistent hashing formula:
K= total keys,N= number of servers- When server added/removed: Only
K/Nkeys need to be remapped - Without consistent hashing:
K Γ (N-1)/Nkeys need remapping
Data Replication
Why replicate? Fault tolerance and availability.
N-replica strategy:
- After hashing a key to a position, walk clockwise and store on next
Nunique physical servers - Example with N=3:
Ring (clockwise):
Key "user123" β hash position 100
Walk clockwise β Server A (pos 110) β Primary
β Server B (pos 150) β Replica 1
β Server C (pos 190) β Replica 2
Note: Virtual nodes on the same physical server are skipped β must be N physical servers.
Consistency Models
| Model | Description | Example |
|---|---|---|
| Strong | Read always returns latest write | Bank account balance |
| Eventual | Given no new updates, all replicas converge | Social media likes |
| Weak | No guarantee when data becomes consistent | Video views counter |
Step 3: Deep Dive (30 min)
Quorum Consensus β The Core Mechanism
Variables:
N= number of replicasW= write quorum (minimum replicas that must acknowledge a write)R= read quorum (minimum replicas that must respond to a read)
Rule: W + R > N guarantees strong consistency
Why? If W + R > N, at least one replica that was written must also be read. You canβt miss the latest write.
Example: N=3 (3 replicas)
W=1, R=3 (Fast writes, slow reads)
ββββββββββββββββββββββββββββββββββββ
β Write to 1 replica β
β Read from all 3 (find latest) β
β β Optimized for write-heavy β
ββββββββββββββββββββββββββββββββββββ
W=3, R=1 (Slow writes, fast reads)
ββββββββββββββββββββββββββββββββββββ
β Write to all 3 replicas β
β Read from 1 (guaranteed latest) β
β β Optimized for read-heavy β
ββββββββββββββββββββββββββββββββββββ
W=2, R=2 (Balanced β most common)
ββββββββββββββββββββββββββββββββββββ
β Write to 2 replicas β
β Read from 2 (overlap guaranteed)β
β β Strong consistency balanced β
ββββββββββββββββββββββββββββββββββββ
Configurations for interview:
| N | W | R | W+R | Type | Use Case |
|---|---|---|---|---|---|
| 3 | 1 | 3 | 4 > 3 | Strong | Read-heavy, slow writes OK |
| 3 | 3 | 1 | 4 > 3 | Strong | Write-heavy, slow writes |
| 3 | 2 | 2 | 4 > 3 | Strong | Balanced, default |
| 3 | 1 | 1 | 2 < 3 | Eventual | High availability, low latency |
If W + R <= N β eventual consistency (no overlap guaranteed)
Coordinator Node
Client
|
v
Coordinator (one of the N nodes, selected by consistent hashing)
|
|βββ Replica 1 (write/read)
|βββ Replica 2 (write/read)
|βββ Replica 3 (write/read)
|
v
Return result to client
(After W writes or R reads acknowledged)
The coordinator acts as a proxy between the client and the N replicas. It does not need to store the data itself.
Versioning and Vector Clocks β Conflict Resolution
The problem: When network partition happens, different replicas may accept conflicting writes.
Normal: Client β Server A β replicates to B and C
Partition: Client 1 β Server A (version X)
Client 2 β Server B (version Y) β conflict!
Vector Clock β tracks causality:
Format: D([server, version], [server, version], ...)
D([s1, v1]) β D([s1, v1], [s2, v1]) β D([s1, v2], [s2, v1])
Example walkthrough:
1. Client writes to server Sx:
D([Sx, 1])
2. Client updates via Sy:
D([Sx, 1], [Sy, 1])
3. Client updates via Sz:
D([Sx, 1], [Sy, 1], [Sz, 1])
4. Network partition, two concurrent updates:
Client 1 via Sy: D([Sx, 1], [Sy, 2], [Sz, 1]) β version A
Client 2 via Sz: D([Sx, 1], [Sy, 1], [Sz, 2]) β version B
5. Conflict detected (neither A nor B dominates)
β Sent to client to resolve
β Client merges β D([Sx, 1], [Sy, 2], [Sz, 2])
Conflict detection rules:
- Version X dominates Y if every counter in X >= every counter in Y
- If neither dominates β conflict β needs resolution
Downsides of vector clocks:
- Vector clock grows with number of servers
- Conflict resolution logic pushed to client
- Solution: Limit clock size, evict oldest entries
Last-Write-Wins (LWW) β simpler alternative:
- Attach timestamp to each write
- On conflict, keep most recent timestamp
- Pro: Simple, no client-side merge logic
- Con: Can lose data (concurrent writes, one discarded)
- Used by: Cassandra (by default)
Failure Detection β Gossip Protocol
Problem: How does a node know if another node is down?
Naive approach: All-to-all heartbeats
- N nodes Γ each pings every other = O(NΒ²) messages β not scalable
Gossip Protocol (the solution):
Each node maintains membership list:
NodeID | IP | Heartbeat Counter | Timestamp
-------|--------|-------------------|-----------
Node1 | 1.1.1.1| 45 | 12:00:05
Node2 | 1.1.1.2| 43 | 12:00:03
Node3 | 1.1.1.3| 40 | 12:00:00 β stale!
Algorithm:
1. Each node increments own heartbeat counter periodically
2. Randomly picks a few nodes and sends membership list
3. Recipients merge lists (take max heartbeat for each node)
4. If a node's heartbeat not updated for > T seconds β suspect failure
5. If still not updated for > T' seconds β mark offline
Key property: Information propagates exponentially fast (like gossip!)
- After k rounds: ~2^k nodes are informed
- Scales well: O(log N) rounds to reach all nodes
Handling Temporary Failures β Sloppy Quorum + Hinted Handoff
Problem: A server in the quorum is temporarily down. Do we fail the request?
Sloppy Quorum: Instead of strict quorum (must use the N designated nodes), use the first W/R healthy nodes encountered on the ring.
Hinted Handoff: Write goes to a temporary node, which stores a βhintβ that data belongs to the downed server.
Normal quorum: Key X β [Server A, Server B, Server C]
(Server A is down)
Sloppy quorum: Key X β [Server B, Server C, Server D]
β
Server D stores hint: "This data belongs to Server A"
When Server A recovers:
Server D β pushes data back to Server A
Server D β deletes its copy
Trade-off:
- Improves availability (W=1 or R=1 in effect)
- Reduces consistency during failure window
- Used by: DynamoDB, Cassandra
Handling Permanent Failures β Anti-Entropy with Merkle Trees
Problem: When a server is down for a long time, how do we sync it efficiently?
Naive approach: Compare all key-value pairs between replicas β too slow
Merkle Tree (the solution):
Root hash
/ \
Hash(1-4) Hash(5-8)
/ \ / \
Hash(1-2) Hash(3-4) Hash(5-6) Hash(7-8)
/ \ / \ / \ / \
H1 H2 H3 H4 H5 H6 H7 H8
| | | | | | | |
k1 k2 k3 k4 k5 k6 k7 k8
Sync algorithm:
- Compare root hashes of two replicas
- If equal β in sync, done!
- If different β drill down into subtrees
- Only sync keys in differing leaf buckets
- Amount of data transferred proportional to differences, not total data size
Benefit: O(log N) comparisons instead of O(N)
Handling Data Center Outage β Multi-Datacenter Replication
US-West DC US-East DC
ββββββββββββββββββββ ββββββββββββββββββββ
β [Node A] βββββββββββΊβ [Node D] β
β [Node B] β β [Node E] β
β [Node C] β β [Node F] β
ββββββββββββββββββββ ββββββββββββββββββββ
β² β²
βββββββββββββ β β β ββββββββββββ
Async replication
across datacenters
Within DC: Synchronous replication (W=2 or W=3)
Across DC: Asynchronous replication (eventual consistency)
Key point: Cross-datacenter replication is almost always asynchronous to avoid latency. Writes ackβd locally, then propagated to other DCs.
Write Path
Client Write Request
|
v
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Write Node β
β β
β 1. Write to commit log (WAL β write-ahead log) β
β (sequential disk write, very fast, durability) β
β β
β 2. Write to memtable (in-memory sorted data structure) β
β (fast, all new reads served from here first) β
β β
β 3. When memtable reaches threshold: β
β β Flush to disk as SSTable (Sorted String Table) β
β β Immutable, sorted by key β
β β
β 4. SSTable files accumulate on disk β
β β Background compaction merges SSTables β
β β Removes deleted/stale entries (tombstones) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Commit Log (WAL):
- Append-only sequential write β very fast
- Survives crashes: replay log on restart to rebuild memtable
- Deleted once the memtable is flushed to SSTable
Memtable:
- In-memory sorted structure (e.g., red-black tree or skip list)
- All recent writes live here
- Flushed to disk when it exceeds a size threshold (e.g., 32 MB)
SSTable (Sorted String Table):
- Immutable, sorted by key, stored on disk
- Multiple SSTables may exist for the same key range
- Compaction merges SSTables, resolving duplicate keys (newer wins)
Read Path
Client Read Request (key = "user123")
|
v
1. Check memtable (in-memory) βββ FOUND? β Return value
|
| (not found)
v
2. Check Bloom Filter for each SSTable
"Is key user123 possibly in this SSTable?"
- NO β Skip this SSTable (save disk I/O)
- MAYBE β Read SSTable index + data block
|
v
3. Read from SSTable on disk
- Use SSTable index (sparse index) to find block
- Decompress and read data block
|
v
4. Return value to client
Bloom Filter:
- Probabilistic data structure
- Can say βdefinitely not in setβ or βmaybe in setβ
- False positives possible (wastes one disk read) but no false negatives
- Space-efficient: ~10 bits per key for 1% false-positive rate
- Avoids expensive disk seeks for non-existent keys
Compaction:
Before compaction: After compaction:
SSTable 1: [a=1, b=2, c=3]
SSTable 2: [b=5, d=4] β Merged: [a=1, b=5, c=3, d=4]
SSTable 3: [c=deleted] (c tombstone removed if old enough)
Step 4: CAP Theorem Trade-offs & System Comparison
CAP Trade-offs in Key-Value Stores
CP Systems (sacrifice availability):
- Refuse requests during partition to preserve consistency
- Use case: Financial data, inventory counts
- Examples: HBase, Zookeeper
AP Systems (sacrifice consistency):
- Serve potentially stale data during partition
- Use case: Shopping carts, social feeds, user preferences
- Examples: Cassandra, DynamoDB (default), CouchDB
Key interview insight:
Network partition WILL happen in distributed systems.
You must choose: CP or AP.
"What's worse for your use case?"
- Show wrong balance: Use AP (show eventual)
- Show old shopping cart: Use AP (merge on checkout)
- Show wrong bank balance: Use CP (must be correct)
System Comparison
| Feature | Redis | Cassandra | DynamoDB |
|---|---|---|---|
| CAP | CP (cluster) | AP (tunable) | AP (tunable) |
| Consistency | Strong (single) / Eventual (cluster) | Tunable (quorum) | Eventual (default) / Strong (+50% cost) |
| Data model | Key-value, sorted sets, lists | Wide-column (key β map of columns) | Key-value + document |
| Replication | Primary-replica | Multi-master (N replicas) | Multi-region |
| Scale | Vertical + cluster | Horizontal (add nodes) | Fully managed, auto-scale |
| Latency | Sub-millisecond | Low (single-digit ms) | Low (single-digit ms) |
| Persistence | Optional (RDB/AOF) | Always | Always |
| Use case | Caching, sessions, leaderboards | IoT, time-series, high write | E-commerce, gaming, serverless |
Design Summary
Full Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β get(key) / put(key, value)
v
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Coordinator Node β
β (Selected by consistent hashing β any node can be coordinator) β
β β
β Consistent Hash Ring β determine N replica nodes β
β Quorum logic: wait for W writes or R reads β
ββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β replicate to N nodes
v
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Distributed Node Ring (Consistent Hashing) β
β β
β [Node A] βββGossipβββ [Node B] βββGossipβββ [Node C] β
β vnode1 vnode2 vnode3 β
β vnode4 vnode5 vnode6 β
β β
β Each node handles: β
β - Write path: commit log β memtable β SSTable β
β - Read path: memtable β bloom filter β SSTable β
β - Failure detection: gossip protocol β
β - Conflict resolution: vector clocks or LWW β
β - Anti-entropy: Merkle trees β
β - Temp failure: sloppy quorum + hinted handoff β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β async replication across DCs
v
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Remote Datacenter (same ring structure) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Decisions Summary
| Component | Decision | Reasoning |
|---|---|---|
| Partitioning | Consistent hashing with virtual nodes | Even distribution, minimal remapping on add/remove |
| Replication | N replicas, coordinator + ring walk | Fault tolerance, availability |
| Consistency | Quorum (W + R > N) | Tunable: trade latency for consistency |
| Failure detection | Gossip protocol | Scalable, no single point of failure |
| Conflict resolution | Vector clocks or LWW | Causality tracking or simplicity |
| Temp failures | Sloppy quorum + hinted handoff | Maintain availability during short outages |
| Perm failures | Anti-entropy with Merkle trees | Efficient sync of diverged replicas |
| Write path | Commit log + memtable + SSTable | Durability + performance |
| Read path | Bloom filter + SSTable index | Minimize expensive disk I/O |
| Multi-DC | Async cross-DC replication | Tolerate DC-level failures |
Interview Questions & Answers
Q: Explain the difference between strong and eventual consistency. When would you choose each?
A: Strong consistency: every read returns the most recent write (W + R > N, e.g., W=2, R=2, N=3). No stale reads but higher latency. Choose for financial data, inventory, booking systems. Eventual consistency: replicas converge over time (W + R <= N). Lower latency, higher availability. Choose for social feeds, view counts, user preferences where stale data is acceptable.
Q: How do vector clocks work and what problem do they solve?
A: Vector clocks track causality to detect conflicts in distributed writes. Each write is tagged with pairs of [server, version]. When two writes happen concurrently (network partition), the vector clocks donβt have a dominance relationship, revealing a conflict. The system can then resolve it (client merge or last-write-wins). Without vector clocks, you might silently lose data or use the wrong version.
Q: Why do we use virtual nodes in consistent hashing?
A: Without virtual nodes, servers may end up with very uneven portions of the ring due to random hash positions. Virtual nodes give each physical server multiple positions on the ring, resulting in a more even data distribution. They also make it easy to scale: when adding a new server, it takes a share of virtual nodes from each existing server proportionally.
Q: What is the gossip protocol and why is it used for failure detection?
A: Each node maintains a membership list with heartbeat counters. Periodically, each node picks a few random peers and exchanges its list. Recipients take the max heartbeat for each node. If a nodeβs heartbeat hasnβt updated in T seconds, itβs suspected as failed. This avoids O(NΒ²) all-to-all heartbeats, scales well, and has no single point of failure. The cost is it takes a few seconds to detect failures.
Q: Explain the write path in a distributed key-value store.
A: 1) Write to commit log (WAL) first β sequential disk append, fast and durable. 2) Write to memtable β in-memory sorted structure, fast reads/writes. 3) When memtable exceeds threshold, flush to SSTable on disk β immutable, sorted. 4) Background compaction merges SSTables β removes stale/deleted data, improves read performance. The commit log ensures durability even if the server crashes before the memtable is flushed.
Q: What is a Bloom filter and how does it improve read performance?
A: A Bloom filter is a probabilistic data structure that answers βis this key possibly in this SSTable?β It can say βdefinitely NOT in setβ (no disk read needed) or βmaybe in setβ (read the SSTable). False positives are possible but false negatives are not, so we never skip a key thatβs actually there. With many SSTables on disk, Bloom filters avoid expensive seeks for missing keys, dramatically improving read latency.
Q: How does sloppy quorum differ from strict quorum? Whatβs the trade-off?
A: Strict quorum requires writes/reads to go to the N designated replica nodes for a key. If one is down, the operation may fail. Sloppy quorum uses the first W/R available healthy nodes instead, which may not be in the designated N. This improves availability during failures. However, a hinted handoff is required to propagate data back to the designated node once it recovers. The trade-off is: better availability but temporarily relaxed consistency guarantees.
Key Takeaways
- CAP theorem forces a choice: Since partition tolerance is required, every distributed KV store chooses CP (strong consistency) or AP (high availability). Most production KV stores (Cassandra, DynamoDB) choose AP with tunable consistency.
- Consistent hashing solves data distribution: Virtual nodes ensure even load distribution and minimize key remapping when servers join or leave.
- W + R > N guarantees strong consistency: The quorum overlap ensures at least one node read was also written. W=2, R=2, N=3 is the most common balanced choice.
- Vector clocks detect but donβt resolve conflicts: They reveal concurrent writes; resolution is up to the client or a policy (LWW). Vector clocks preserve causality.
- Gossip protocol for failure detection scales: O(log N) rounds to propagate status across the cluster. No central authority needed.
- Sloppy quorum + hinted handoff maintain availability: During temporary node failures, writes go to available nodes with a hint. Data is synced back when the node recovers.
- Merkle trees enable efficient anti-entropy: Instead of comparing all data, only diverged subtrees are synced. Proportional to differences, not total data.
- Write path optimization: Commit log (durability) + memtable (fast writes) + SSTable (persistent). Bloom filters + sparse indexes make reads fast across many SSTables.
Related Resources
- ch05-consistent-hashing β Deep dive on the ring and virtual nodes
- ch04-rate-limiter β Redis as a distributed state store
- ch07-unique-id-generator β Another distributed systems design
- distributed-system-components > CAP-Theorem β Theory reference
- key-patterns > Consistent-Hashing β Pattern overview
Practice this design! Very common hard interview question. Be ready to:
- Draw the consistent hashing ring with virtual nodes
- Explain quorum with N, W, R values and their trade-offs
- Walk through vector clock conflict detection step by step
- Describe the full write and read paths
- Discuss CAP theorem choice for your system
Last Updated: 2026-04-13
Status: Hard β Core distributed systems β Must know!