Chapter 6: Design a Key-Value Store

volume1 key-value-store distributed-systems storage

Status: 🟩 Interview ready
Difficulty: Hard (Core distributed systems)
Time to complete: 60 min read + practice

Overview

A key-value store is a non-relational database that stores data as unique key-value pairs. The key is a unique identifier (string or hash), and the value can be anything (string, blob, JSON, binary).

Real-world systems:

Redis: In-memory, single-server + cluster mode, widely used for caching
DynamoDB: AWS managed, fully distributed, high availability
Cassandra: Wide-column, highly scalable, tunable consistency
Memcached: In-memory caching, simpler than Redis

Why this matters:

Fundamental distributed systems design problem
Tests knowledge of CAP theorem, consistent hashing, quorum, and conflict resolution
Core building block behind many large-scale systems (caches, session stores, feature flags)

Problem Statement

Design a key-value store that supports:

get(key) — retrieve the value for a given key
put(key, value) — insert or update a key-value pair

Context: Must handle millions of requests per second at internet scale, with high availability and partition tolerance.

Step 1: Requirements & Scope (5 min)

Functional Requirements

Clarifying questions:

Single-server or distributed? → Distributed (internet scale)
Strong or eventual consistency? → Tunable (trade-off, discussable)
High availability or strong consistency? → High availability preferred
Data size? → Small objects (< 10 KB per value)
How much data? → Store big data (hundreds of TB)
What operations? → get(key) and put(key, value) only

Scope:

Key-value pair size: < 10 KB
Store big data sets (hundreds of TB)
High availability (low downtime)
High scalability (supports large data set)
Automatic scaling (add/remove servers based on traffic)
Tunable consistency
Low latency

Non-Functional Requirements

High availability: System should always be up even during failures
High scalability: Scale to support large datasets
Automatic scaling: Servers added/removed automatically
Tunable consistency: Configurable consistency level
Low latency: Reads and writes should be fast
Partition tolerance: System functions even with network partitions

Step 2: High-Level Design (10 min)

Single Server Key-Value Store

If all data fits on one server (simple approach):

Use an in-memory hash map
Full data fits in memory: Fast but limited to one machine
Compress data + store frequently used in memory, rest on disk

Limitations:

Memory is expensive and limited
Single point of failure
Not scalable

CAP Theorem — The Core Trade-off

            Consistency
                /\
               /  \
              /    \
             / CA   \
            /--------\
           /  CP  AP  \
          /____________\
   Availability   Partition Tolerance

CAP Theorem: A distributed system can only guarantee 2 of 3:

Consistency — every read gets the most recent write
Availability — every request gets a response (not necessarily the latest data)
Partition tolerance — system continues despite network partition

In practice: Network partitions WILL happen. So you must choose between CP or AP.

System	Choice	Reasoning
Zookeeper, HBase	CP	Financial data — consistency critical
Cassandra, DynamoDB	AP	High availability preferred
Redis (cluster)	CP or AP	Configurable per use case

Key insight for interviews: Since P is always required, you choose between CP (sacrifice availability during partition) or AP (sacrifice consistency during partition).

Data Partition — Consistent Hashing

Problem: How to distribute data across multiple servers?

Naive hashing: hash(key) % N

Problem: When N changes (add/remove server), almost all keys remap → massive data movement

Consistent Hashing (the solution):

         0
       /   \
     330    30
    /          \
  300            60
    \          /
     270    90
       \   /
        240—120
           |
          180

Servers placed at positions on ring.
Keys map clockwise to next server.
Adding/removing server only moves ~K/N keys.

Virtual nodes (vnodes): Each server has multiple positions on the ring

Server A: positions 10, 110, 210, 310
Server B: positions 50, 150, 250, 350
Server C: positions 90, 190, 290, 30

Benefits:
- More balanced distribution
- Non-uniform ring when using real positions
- Easy to scale: New server gets subset of virtual nodes

Consistent hashing formula:

K = total keys, N = number of servers
When server added/removed: Only K/N keys need to be remapped
Without consistent hashing: K × (N-1)/N keys need remapping

Data Replication

Why replicate? Fault tolerance and availability.

N-replica strategy:

After hashing a key to a position, walk clockwise and store on next N unique physical servers
Example with N=3:

Ring (clockwise):
  Key "user123" → hash position 100
  Walk clockwise → Server A (pos 110) ← Primary
                → Server B (pos 150) ← Replica 1
                → Server C (pos 190) ← Replica 2

Note: Virtual nodes on the same physical server are skipped — must be N physical servers.

Consistency Models

Model	Description	Example
Strong	Read always returns latest write	Bank account balance
Eventual	Given no new updates, all replicas converge	Social media likes
Weak	No guarantee when data becomes consistent	Video views counter

Step 3: Deep Dive (30 min)

Quorum Consensus — The Core Mechanism

Variables:

N = number of replicas
W = write quorum (minimum replicas that must acknowledge a write)
R = read quorum (minimum replicas that must respond to a read)

Rule: W + R > N guarantees strong consistency

Why? If W + R > N, at least one replica that was written must also be read. You can’t miss the latest write.

Example: N=3 (3 replicas)

                  W=1, R=3 (Fast writes, slow reads)
                  ┌──────────────────────────────────┐
                  │  Write to 1 replica              │
                  │  Read from all 3 (find latest)   │
                  │  → Optimized for write-heavy     │
                  └──────────────────────────────────┘

                  W=3, R=1 (Slow writes, fast reads)
                  ┌──────────────────────────────────┐
                  │  Write to all 3 replicas         │
                  │  Read from 1 (guaranteed latest) │
                  │  → Optimized for read-heavy      │
                  └──────────────────────────────────┘

                  W=2, R=2 (Balanced — most common)
                  ┌──────────────────────────────────┐
                  │  Write to 2 replicas             │
                  │  Read from 2 (overlap guaranteed)│
                  │  → Strong consistency balanced   │
                  └──────────────────────────────────┘

Configurations for interview:

N	W	R	W+R	Type	Use Case
3	1	3	4 > 3	Strong	Read-heavy, slow writes OK
3	3	1	4 > 3	Strong	Write-heavy, slow writes
3	2	2	4 > 3	Strong	Balanced, default
3	1	1	2 < 3	Eventual	High availability, low latency

If W + R <= N → eventual consistency (no overlap guaranteed)

Coordinator Node

Client
  |
  v
Coordinator (one of the N nodes, selected by consistent hashing)
  |
  |──→ Replica 1 (write/read)
  |──→ Replica 2 (write/read)
  |──→ Replica 3 (write/read)
  |
  v
Return result to client
(After W writes or R reads acknowledged)

The coordinator acts as a proxy between the client and the N replicas. It does not need to store the data itself.

Versioning and Vector Clocks — Conflict Resolution

The problem: When network partition happens, different replicas may accept conflicting writes.

Normal:    Client → Server A → replicates to B and C
Partition: Client 1 → Server A (version X)
           Client 2 → Server B (version Y)  ← conflict!

Vector Clock — tracks causality:

Format: D([server, version], [server, version], ...)

D([s1, v1]) → D([s1, v1], [s2, v1]) → D([s1, v2], [s2, v1])

Example walkthrough:

1. Client writes to server Sx:
   D([Sx, 1])

2. Client updates via Sy:
   D([Sx, 1], [Sy, 1])

3. Client updates via Sz:
   D([Sx, 1], [Sy, 1], [Sz, 1])

4. Network partition, two concurrent updates:
   Client 1 via Sy: D([Sx, 1], [Sy, 2], [Sz, 1])   ← version A
   Client 2 via Sz: D([Sx, 1], [Sy, 1], [Sz, 2])   ← version B

5. Conflict detected (neither A nor B dominates)
   → Sent to client to resolve
   → Client merges → D([Sx, 1], [Sy, 2], [Sz, 2])

Conflict detection rules:

Version X dominates Y if every counter in X >= every counter in Y
If neither dominates → conflict → needs resolution

Downsides of vector clocks:

Vector clock grows with number of servers
Conflict resolution logic pushed to client
Solution: Limit clock size, evict oldest entries

Last-Write-Wins (LWW) — simpler alternative:

Attach timestamp to each write
On conflict, keep most recent timestamp
Pro: Simple, no client-side merge logic
Con: Can lose data (concurrent writes, one discarded)
Used by: Cassandra (by default)

Failure Detection — Gossip Protocol

Problem: How does a node know if another node is down?

Naive approach: All-to-all heartbeats

N nodes × each pings every other = O(N²) messages → not scalable

Gossip Protocol (the solution):

Each node maintains membership list:
  NodeID | IP     | Heartbeat Counter | Timestamp
  -------|--------|-------------------|-----------
  Node1  | 1.1.1.1|        45         | 12:00:05
  Node2  | 1.1.1.2|        43         | 12:00:03
  Node3  | 1.1.1.3|        40         | 12:00:00  ← stale!

Algorithm:
1. Each node increments own heartbeat counter periodically
2. Randomly picks a few nodes and sends membership list
3. Recipients merge lists (take max heartbeat for each node)
4. If a node's heartbeat not updated for > T seconds → suspect failure
5. If still not updated for > T' seconds → mark offline

Key property: Information propagates exponentially fast (like gossip!)

After k rounds: ~2^k nodes are informed
Scales well: O(log N) rounds to reach all nodes

Handling Temporary Failures — Sloppy Quorum + Hinted Handoff

Problem: A server in the quorum is temporarily down. Do we fail the request?

Sloppy Quorum: Instead of strict quorum (must use the N designated nodes), use the first W/R healthy nodes encountered on the ring.

Hinted Handoff: Write goes to a temporary node, which stores a “hint” that data belongs to the downed server.

Normal quorum: Key X → [Server A, Server B, Server C]
                          (Server A is down)

Sloppy quorum: Key X → [Server B, Server C, Server D]
                               ↑
                     Server D stores hint: "This data belongs to Server A"

When Server A recovers:
  Server D → pushes data back to Server A
  Server D → deletes its copy

Trade-off:

Improves availability (W=1 or R=1 in effect)
Reduces consistency during failure window
Used by: DynamoDB, Cassandra

Handling Permanent Failures — Anti-Entropy with Merkle Trees

Problem: When a server is down for a long time, how do we sync it efficiently?

Naive approach: Compare all key-value pairs between replicas → too slow

Merkle Tree (the solution):

         Root hash
        /         \
    Hash(1-4)   Hash(5-8)
    /     \      /     \
Hash(1-2) Hash(3-4) Hash(5-6) Hash(7-8)
  / \     / \     / \     / \
H1  H2  H3  H4  H5  H6  H7  H8
 |   |   |   |   |   |   |   |
k1  k2  k3  k4  k5  k6  k7  k8

Sync algorithm:

Compare root hashes of two replicas
If equal → in sync, done!
If different → drill down into subtrees
Only sync keys in differing leaf buckets
Amount of data transferred proportional to differences, not total data size

Benefit: O(log N) comparisons instead of O(N)

Handling Data Center Outage — Multi-Datacenter Replication

         US-West DC                    US-East DC
    ┌──────────────────┐          ┌──────────────────┐
    │   [Node A]       │◄────────►│   [Node D]       │
    │   [Node B]       │          │   [Node E]       │
    │   [Node C]       │          │   [Node F]       │
    └──────────────────┘          └──────────────────┘
            ▲                               ▲
            └──────────── ─ ─ ─ ───────────┘
                    Async replication
                    across datacenters

Within DC: Synchronous replication (W=2 or W=3)
Across DC: Asynchronous replication (eventual consistency)

Key point: Cross-datacenter replication is almost always asynchronous to avoid latency. Writes ack’d locally, then propagated to other DCs.

Write Path

Client Write Request
        |
        v
┌──────────────────────────────────────────────────────────────┐
│                    Write Node                                │
│                                                              │
│  1. Write to commit log (WAL — write-ahead log)              │
│     (sequential disk write, very fast, durability)           │
│                                                              │
│  2. Write to memtable (in-memory sorted data structure)      │
│     (fast, all new reads served from here first)             │
│                                                              │
│  3. When memtable reaches threshold:                         │
│     → Flush to disk as SSTable (Sorted String Table)         │
│     → Immutable, sorted by key                               │
│                                                              │
│  4. SSTable files accumulate on disk                         │
│     → Background compaction merges SSTables                  │
│     → Removes deleted/stale entries (tombstones)             │
└──────────────────────────────────────────────────────────────┘

Commit Log (WAL):

Append-only sequential write → very fast
Survives crashes: replay log on restart to rebuild memtable
Deleted once the memtable is flushed to SSTable

Memtable:

In-memory sorted structure (e.g., red-black tree or skip list)
All recent writes live here
Flushed to disk when it exceeds a size threshold (e.g., 32 MB)

SSTable (Sorted String Table):

Immutable, sorted by key, stored on disk
Multiple SSTables may exist for the same key range
Compaction merges SSTables, resolving duplicate keys (newer wins)

Read Path

Client Read Request (key = "user123")
        |
        v
1. Check memtable (in-memory) ─── FOUND? → Return value
        |
        | (not found)
        v
2. Check Bloom Filter for each SSTable
   "Is key user123 possibly in this SSTable?"
   - NO  → Skip this SSTable (save disk I/O)
   - MAYBE → Read SSTable index + data block
        |
        v
3. Read from SSTable on disk
   - Use SSTable index (sparse index) to find block
   - Decompress and read data block
        |
        v
4. Return value to client

Bloom Filter:

Probabilistic data structure
Can say “definitely not in set” or “maybe in set”
False positives possible (wastes one disk read) but no false negatives
Space-efficient: ~10 bits per key for 1% false-positive rate
Avoids expensive disk seeks for non-existent keys

Compaction:

Before compaction:              After compaction:
SSTable 1: [a=1, b=2, c=3]     
SSTable 2: [b=5, d=4]      →   Merged: [a=1, b=5, c=3, d=4]
SSTable 3: [c=deleted]          (c tombstone removed if old enough)

Step 4: CAP Theorem Trade-offs & System Comparison

CAP Trade-offs in Key-Value Stores

CP Systems (sacrifice availability):

Refuse requests during partition to preserve consistency
Use case: Financial data, inventory counts
Examples: HBase, Zookeeper

AP Systems (sacrifice consistency):

Serve potentially stale data during partition
Use case: Shopping carts, social feeds, user preferences
Examples: Cassandra, DynamoDB (default), CouchDB

Key interview insight:

Network partition WILL happen in distributed systems.
You must choose: CP or AP.

"What's worse for your use case?"
  - Show wrong balance: Use AP (show eventual)
  - Show old shopping cart: Use AP (merge on checkout)
  - Show wrong bank balance: Use CP (must be correct)

System Comparison

Feature	Redis	Cassandra	DynamoDB
CAP	CP (cluster)	AP (tunable)	AP (tunable)
Consistency	Strong (single) / Eventual (cluster)	Tunable (quorum)	Eventual (default) / Strong (+50% cost)
Data model	Key-value, sorted sets, lists	Wide-column (key → map of columns)	Key-value + document
Replication	Primary-replica	Multi-master (N replicas)	Multi-region
Scale	Vertical + cluster	Horizontal (add nodes)	Fully managed, auto-scale
Latency	Sub-millisecond	Low (single-digit ms)	Low (single-digit ms)
Persistence	Optional (RDB/AOF)	Always	Always
Use case	Caching, sessions, leaderboards	IoT, time-series, high write	E-commerce, gaming, serverless

Design Summary

Full Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         Client                                      │
└───────────────────────────────┬─────────────────────────────────────┘
                                │ get(key) / put(key, value)
                                v
┌─────────────────────────────────────────────────────────────────────┐
│                      Coordinator Node                               │
│  (Selected by consistent hashing — any node can be coordinator)    │
│                                                                     │
│  Consistent Hash Ring → determine N replica nodes                  │
│  Quorum logic: wait for W writes or R reads                        │
└──────┬─────────────────────────────────────────────────────────────┘
       │  replicate to N nodes
       v
┌──────────────────────────────────────────────────────────────────────┐
│               Distributed Node Ring (Consistent Hashing)            │
│                                                                      │
│   [Node A] ←──Gossip──→ [Node B] ←──Gossip──→ [Node C]              │
│     vnode1              vnode2                 vnode3               │
│     vnode4              vnode5                 vnode6               │
│                                                                      │
│   Each node handles:                                                 │
│   - Write path: commit log → memtable → SSTable                     │
│   - Read path: memtable → bloom filter → SSTable                    │
│   - Failure detection: gossip protocol                              │
│   - Conflict resolution: vector clocks or LWW                      │
│   - Anti-entropy: Merkle trees                                      │
│   - Temp failure: sloppy quorum + hinted handoff                    │
└──────────────────────────────────────────────────────────────────────┘
       │  async replication across DCs
       v
┌──────────────────────────────────────────────────────────────────────┐
│               Remote Datacenter (same ring structure)                │
└──────────────────────────────────────────────────────────────────────┘

Key Decisions Summary

Component	Decision	Reasoning
Partitioning	Consistent hashing with virtual nodes	Even distribution, minimal remapping on add/remove
Replication	N replicas, coordinator + ring walk	Fault tolerance, availability
Consistency	Quorum (W + R > N)	Tunable: trade latency for consistency
Failure detection	Gossip protocol	Scalable, no single point of failure
Conflict resolution	Vector clocks or LWW	Causality tracking or simplicity
Temp failures	Sloppy quorum + hinted handoff	Maintain availability during short outages
Perm failures	Anti-entropy with Merkle trees	Efficient sync of diverged replicas
Write path	Commit log + memtable + SSTable	Durability + performance
Read path	Bloom filter + SSTable index	Minimize expensive disk I/O
Multi-DC	Async cross-DC replication	Tolerate DC-level failures

Interview Questions & Answers

Q: Explain the difference between strong and eventual consistency. When would you choose each?
A: Strong consistency: every read returns the most recent write (W + R > N, e.g., W=2, R=2, N=3). No stale reads but higher latency. Choose for financial data, inventory, booking systems. Eventual consistency: replicas converge over time (W + R <= N). Lower latency, higher availability. Choose for social feeds, view counts, user preferences where stale data is acceptable.

Q: How do vector clocks work and what problem do they solve?
A: Vector clocks track causality to detect conflicts in distributed writes. Each write is tagged with pairs of [server, version]. When two writes happen concurrently (network partition), the vector clocks don’t have a dominance relationship, revealing a conflict. The system can then resolve it (client merge or last-write-wins). Without vector clocks, you might silently lose data or use the wrong version.

Q: Why do we use virtual nodes in consistent hashing?
A: Without virtual nodes, servers may end up with very uneven portions of the ring due to random hash positions. Virtual nodes give each physical server multiple positions on the ring, resulting in a more even data distribution. They also make it easy to scale: when adding a new server, it takes a share of virtual nodes from each existing server proportionally.

Q: What is the gossip protocol and why is it used for failure detection?
A: Each node maintains a membership list with heartbeat counters. Periodically, each node picks a few random peers and exchanges its list. Recipients take the max heartbeat for each node. If a node’s heartbeat hasn’t updated in T seconds, it’s suspected as failed. This avoids O(N²) all-to-all heartbeats, scales well, and has no single point of failure. The cost is it takes a few seconds to detect failures.

Q: Explain the write path in a distributed key-value store.
A: 1) Write to commit log (WAL) first — sequential disk append, fast and durable. 2) Write to memtable — in-memory sorted structure, fast reads/writes. 3) When memtable exceeds threshold, flush to SSTable on disk — immutable, sorted. 4) Background compaction merges SSTables — removes stale/deleted data, improves read performance. The commit log ensures durability even if the server crashes before the memtable is flushed.

Q: What is a Bloom filter and how does it improve read performance?
A: A Bloom filter is a probabilistic data structure that answers “is this key possibly in this SSTable?” It can say “definitely NOT in set” (no disk read needed) or “maybe in set” (read the SSTable). False positives are possible but false negatives are not, so we never skip a key that’s actually there. With many SSTables on disk, Bloom filters avoid expensive seeks for missing keys, dramatically improving read latency.

Q: How does sloppy quorum differ from strict quorum? What’s the trade-off?
A: Strict quorum requires writes/reads to go to the N designated replica nodes for a key. If one is down, the operation may fail. Sloppy quorum uses the first W/R available healthy nodes instead, which may not be in the designated N. This improves availability during failures. However, a hinted handoff is required to propagate data back to the designated node once it recovers. The trade-off is: better availability but temporarily relaxed consistency guarantees.

Key Takeaways

CAP theorem forces a choice: Since partition tolerance is required, every distributed KV store chooses CP (strong consistency) or AP (high availability). Most production KV stores (Cassandra, DynamoDB) choose AP with tunable consistency.
Consistent hashing solves data distribution: Virtual nodes ensure even load distribution and minimize key remapping when servers join or leave.
W + R > N guarantees strong consistency: The quorum overlap ensures at least one node read was also written. W=2, R=2, N=3 is the most common balanced choice.
Vector clocks detect but don’t resolve conflicts: They reveal concurrent writes; resolution is up to the client or a policy (LWW). Vector clocks preserve causality.
Gossip protocol for failure detection scales: O(log N) rounds to propagate status across the cluster. No central authority needed.
Sloppy quorum + hinted handoff maintain availability: During temporary node failures, writes go to available nodes with a hint. Data is synced back when the node recovers.
Merkle trees enable efficient anti-entropy: Instead of comparing all data, only diverged subtrees are synced. Proportional to differences, not total data.
Write path optimization: Commit log (durability) + memtable (fast writes) + SSTable (persistent). Bloom filters + sparse indexes make reads fast across many SSTables.

ch05-consistent-hashing — Deep dive on the ring and virtual nodes
ch04-rate-limiter — Redis as a distributed state store
ch07-unique-id-generator — Another distributed systems design
distributed-system-components > CAP-Theorem — Theory reference
key-patterns > Consistent-Hashing — Pattern overview

Practice this design! Very common hard interview question. Be ready to:

Draw the consistent hashing ring with virtual nodes
Explain quorum with N, W, R values and their trade-offs
Walk through vector clock conflict detection step by step
Describe the full write and read paths
Discuss CAP theorem choice for your system

Last Updated: 2026-04-13
Status: Hard — Core distributed systems — Must know!

Study Notes by Niladri & AI

Explorer

ch06-key-value-store

Chapter 6: Design a Key-Value Store

Overview

Problem Statement

Step 1: Requirements & Scope (5 min)

Functional Requirements

Non-Functional Requirements

Step 2: High-Level Design (10 min)

Single Server Key-Value Store

CAP Theorem — The Core Trade-off

Data Partition — Consistent Hashing

Data Replication

Consistency Models

Step 3: Deep Dive (30 min)

Quorum Consensus — The Core Mechanism

Coordinator Node

Versioning and Vector Clocks — Conflict Resolution

Failure Detection — Gossip Protocol

Handling Temporary Failures — Sloppy Quorum + Hinted Handoff

Handling Permanent Failures — Anti-Entropy with Merkle Trees

Handling Data Center Outage — Multi-Datacenter Replication

Write Path

Read Path

Step 4: CAP Theorem Trade-offs & System Comparison

CAP Trade-offs in Key-Value Stores

System Comparison

Design Summary

Full Architecture

Key Decisions Summary

Interview Questions & Answers

Key Takeaways

Graph View

Table of Contents

Backlinks

Study Notes by Niladri & AI

Explorer

ch06-key-value-store

Chapter 6: Design a Key-Value Store

Overview

Problem Statement

Step 1: Requirements & Scope (5 min)

Functional Requirements

Non-Functional Requirements

Step 2: High-Level Design (10 min)

Single Server Key-Value Store

CAP Theorem — The Core Trade-off

Data Partition — Consistent Hashing

Data Replication

Consistency Models

Step 3: Deep Dive (30 min)

Quorum Consensus — The Core Mechanism

Coordinator Node

Versioning and Vector Clocks — Conflict Resolution

Failure Detection — Gossip Protocol

Handling Temporary Failures — Sloppy Quorum + Hinted Handoff

Handling Permanent Failures — Anti-Entropy with Merkle Trees

Handling Data Center Outage — Multi-Datacenter Replication

Write Path

Read Path

Step 4: CAP Theorem Trade-offs & System Comparison

CAP Trade-offs in Key-Value Stores

System Comparison

Design Summary

Full Architecture

Key Decisions Summary

Interview Questions & Answers

Key Takeaways

Related Resources

Graph View

Table of Contents

Backlinks