Chapter 6: Design a Key-Value Store

volume1 key-value-store distributed-systems storage

Status: 🟩 Interview ready
Difficulty: Hard (Core distributed systems)
Time to complete: 60 min read + practice


Overview

A key-value store is a non-relational database that stores data as unique key-value pairs. The key is a unique identifier (string or hash), and the value can be anything (string, blob, JSON, binary).

Real-world systems:

  • Redis: In-memory, single-server + cluster mode, widely used for caching
  • DynamoDB: AWS managed, fully distributed, high availability
  • Cassandra: Wide-column, highly scalable, tunable consistency
  • Memcached: In-memory caching, simpler than Redis

Why this matters:

  • Fundamental distributed systems design problem
  • Tests knowledge of CAP theorem, consistent hashing, quorum, and conflict resolution
  • Core building block behind many large-scale systems (caches, session stores, feature flags)

Problem Statement

Design a key-value store that supports:

  • get(key) β€” retrieve the value for a given key
  • put(key, value) β€” insert or update a key-value pair

Context: Must handle millions of requests per second at internet scale, with high availability and partition tolerance.


Step 1: Requirements & Scope (5 min)

Functional Requirements

Clarifying questions:

  • Single-server or distributed? β†’ Distributed (internet scale)
  • Strong or eventual consistency? β†’ Tunable (trade-off, discussable)
  • High availability or strong consistency? β†’ High availability preferred
  • Data size? β†’ Small objects (< 10 KB per value)
  • How much data? β†’ Store big data (hundreds of TB)
  • What operations? β†’ get(key) and put(key, value) only

Scope:

  • Key-value pair size: < 10 KB
  • Store big data sets (hundreds of TB)
  • High availability (low downtime)
  • High scalability (supports large data set)
  • Automatic scaling (add/remove servers based on traffic)
  • Tunable consistency
  • Low latency

Non-Functional Requirements

  • High availability: System should always be up even during failures
  • High scalability: Scale to support large datasets
  • Automatic scaling: Servers added/removed automatically
  • Tunable consistency: Configurable consistency level
  • Low latency: Reads and writes should be fast
  • Partition tolerance: System functions even with network partitions

Step 2: High-Level Design (10 min)

Single Server Key-Value Store

If all data fits on one server (simple approach):

  • Use an in-memory hash map
  • Full data fits in memory: Fast but limited to one machine
  • Compress data + store frequently used in memory, rest on disk

Limitations:

  • Memory is expensive and limited
  • Single point of failure
  • Not scalable

CAP Theorem β€” The Core Trade-off

            Consistency
                /\
               /  \
              /    \
             / CA   \
            /--------\
           /  CP  AP  \
          /____________\
   Availability   Partition Tolerance

CAP Theorem: A distributed system can only guarantee 2 of 3:

  • Consistency β€” every read gets the most recent write
  • Availability β€” every request gets a response (not necessarily the latest data)
  • Partition tolerance β€” system continues despite network partition

In practice: Network partitions WILL happen. So you must choose between CP or AP.

SystemChoiceReasoning
Zookeeper, HBaseCPFinancial data β€” consistency critical
Cassandra, DynamoDBAPHigh availability preferred
Redis (cluster)CP or APConfigurable per use case

Key insight for interviews: Since P is always required, you choose between CP (sacrifice availability during partition) or AP (sacrifice consistency during partition).

Data Partition β€” Consistent Hashing

Problem: How to distribute data across multiple servers?

Naive hashing: hash(key) % N

  • Problem: When N changes (add/remove server), almost all keys remap β†’ massive data movement

Consistent Hashing (the solution):

         0
       /   \
     330    30
    /          \
  300            60
    \          /
     270    90
       \   /
        240β€”120
           |
          180

Servers placed at positions on ring.
Keys map clockwise to next server.
Adding/removing server only moves ~K/N keys.

Virtual nodes (vnodes): Each server has multiple positions on the ring

Server A: positions 10, 110, 210, 310
Server B: positions 50, 150, 250, 350
Server C: positions 90, 190, 290, 30

Benefits:
- More balanced distribution
- Non-uniform ring when using real positions
- Easy to scale: New server gets subset of virtual nodes

Consistent hashing formula:

  • K = total keys, N = number of servers
  • When server added/removed: Only K/N keys need to be remapped
  • Without consistent hashing: K Γ— (N-1)/N keys need remapping

Data Replication

Why replicate? Fault tolerance and availability.

N-replica strategy:

  • After hashing a key to a position, walk clockwise and store on next N unique physical servers
  • Example with N=3:
Ring (clockwise):
  Key "user123" β†’ hash position 100
  Walk clockwise β†’ Server A (pos 110) ← Primary
                β†’ Server B (pos 150) ← Replica 1
                β†’ Server C (pos 190) ← Replica 2

Note: Virtual nodes on the same physical server are skipped β€” must be N physical servers.

Consistency Models

ModelDescriptionExample
StrongRead always returns latest writeBank account balance
EventualGiven no new updates, all replicas convergeSocial media likes
WeakNo guarantee when data becomes consistentVideo views counter

Step 3: Deep Dive (30 min)

Quorum Consensus β€” The Core Mechanism

Variables:

  • N = number of replicas
  • W = write quorum (minimum replicas that must acknowledge a write)
  • R = read quorum (minimum replicas that must respond to a read)

Rule: W + R > N guarantees strong consistency

Why? If W + R > N, at least one replica that was written must also be read. You can’t miss the latest write.

Example: N=3 (3 replicas)

                  W=1, R=3 (Fast writes, slow reads)
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚  Write to 1 replica              β”‚
                  β”‚  Read from all 3 (find latest)   β”‚
                  β”‚  β†’ Optimized for write-heavy     β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                  W=3, R=1 (Slow writes, fast reads)
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚  Write to all 3 replicas         β”‚
                  β”‚  Read from 1 (guaranteed latest) β”‚
                  β”‚  β†’ Optimized for read-heavy      β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                  W=2, R=2 (Balanced β€” most common)
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚  Write to 2 replicas             β”‚
                  β”‚  Read from 2 (overlap guaranteed)β”‚
                  β”‚  β†’ Strong consistency balanced   β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Configurations for interview:

NWRW+RTypeUse Case
3134 > 3StrongRead-heavy, slow writes OK
3314 > 3StrongWrite-heavy, slow writes
3224 > 3StrongBalanced, default
3112 < 3EventualHigh availability, low latency

If W + R <= N β†’ eventual consistency (no overlap guaranteed)

Coordinator Node

Client
  |
  v
Coordinator (one of the N nodes, selected by consistent hashing)
  |
  |──→ Replica 1 (write/read)
  |──→ Replica 2 (write/read)
  |──→ Replica 3 (write/read)
  |
  v
Return result to client
(After W writes or R reads acknowledged)

The coordinator acts as a proxy between the client and the N replicas. It does not need to store the data itself.

Versioning and Vector Clocks β€” Conflict Resolution

The problem: When network partition happens, different replicas may accept conflicting writes.

Normal:    Client β†’ Server A β†’ replicates to B and C
Partition: Client 1 β†’ Server A (version X)
           Client 2 β†’ Server B (version Y)  ← conflict!

Vector Clock β€” tracks causality:

Format: D([server, version], [server, version], ...)

D([s1, v1]) β†’ D([s1, v1], [s2, v1]) β†’ D([s1, v2], [s2, v1])

Example walkthrough:

1. Client writes to server Sx:
   D([Sx, 1])

2. Client updates via Sy:
   D([Sx, 1], [Sy, 1])

3. Client updates via Sz:
   D([Sx, 1], [Sy, 1], [Sz, 1])

4. Network partition, two concurrent updates:
   Client 1 via Sy: D([Sx, 1], [Sy, 2], [Sz, 1])   ← version A
   Client 2 via Sz: D([Sx, 1], [Sy, 1], [Sz, 2])   ← version B

5. Conflict detected (neither A nor B dominates)
   β†’ Sent to client to resolve
   β†’ Client merges β†’ D([Sx, 1], [Sy, 2], [Sz, 2])

Conflict detection rules:

  • Version X dominates Y if every counter in X >= every counter in Y
  • If neither dominates β†’ conflict β†’ needs resolution

Downsides of vector clocks:

  • Vector clock grows with number of servers
  • Conflict resolution logic pushed to client
  • Solution: Limit clock size, evict oldest entries

Last-Write-Wins (LWW) β€” simpler alternative:

  • Attach timestamp to each write
  • On conflict, keep most recent timestamp
  • Pro: Simple, no client-side merge logic
  • Con: Can lose data (concurrent writes, one discarded)
  • Used by: Cassandra (by default)

Failure Detection β€” Gossip Protocol

Problem: How does a node know if another node is down?

Naive approach: All-to-all heartbeats

  • N nodes Γ— each pings every other = O(NΒ²) messages β†’ not scalable

Gossip Protocol (the solution):

Each node maintains membership list:
  NodeID | IP     | Heartbeat Counter | Timestamp
  -------|--------|-------------------|-----------
  Node1  | 1.1.1.1|        45         | 12:00:05
  Node2  | 1.1.1.2|        43         | 12:00:03
  Node3  | 1.1.1.3|        40         | 12:00:00  ← stale!

Algorithm:
1. Each node increments own heartbeat counter periodically
2. Randomly picks a few nodes and sends membership list
3. Recipients merge lists (take max heartbeat for each node)
4. If a node's heartbeat not updated for > T seconds β†’ suspect failure
5. If still not updated for > T' seconds β†’ mark offline

Key property: Information propagates exponentially fast (like gossip!)

  • After k rounds: ~2^k nodes are informed
  • Scales well: O(log N) rounds to reach all nodes

Handling Temporary Failures β€” Sloppy Quorum + Hinted Handoff

Problem: A server in the quorum is temporarily down. Do we fail the request?

Sloppy Quorum: Instead of strict quorum (must use the N designated nodes), use the first W/R healthy nodes encountered on the ring.

Hinted Handoff: Write goes to a temporary node, which stores a β€œhint” that data belongs to the downed server.

Normal quorum: Key X β†’ [Server A, Server B, Server C]
                          (Server A is down)

Sloppy quorum: Key X β†’ [Server B, Server C, Server D]
                               ↑
                     Server D stores hint: "This data belongs to Server A"

When Server A recovers:
  Server D β†’ pushes data back to Server A
  Server D β†’ deletes its copy

Trade-off:

  • Improves availability (W=1 or R=1 in effect)
  • Reduces consistency during failure window
  • Used by: DynamoDB, Cassandra

Handling Permanent Failures β€” Anti-Entropy with Merkle Trees

Problem: When a server is down for a long time, how do we sync it efficiently?

Naive approach: Compare all key-value pairs between replicas β†’ too slow

Merkle Tree (the solution):

         Root hash
        /         \
    Hash(1-4)   Hash(5-8)
    /     \      /     \
Hash(1-2) Hash(3-4) Hash(5-6) Hash(7-8)
  / \     / \     / \     / \
H1  H2  H3  H4  H5  H6  H7  H8
 |   |   |   |   |   |   |   |
k1  k2  k3  k4  k5  k6  k7  k8

Sync algorithm:

  1. Compare root hashes of two replicas
  2. If equal β†’ in sync, done!
  3. If different β†’ drill down into subtrees
  4. Only sync keys in differing leaf buckets
  5. Amount of data transferred proportional to differences, not total data size

Benefit: O(log N) comparisons instead of O(N)

Handling Data Center Outage β€” Multi-Datacenter Replication

         US-West DC                    US-East DC
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   [Node A]       │◄────────►│   [Node D]       β”‚
    β”‚   [Node B]       β”‚          β”‚   [Node E]       β”‚
    β”‚   [Node C]       β”‚          β”‚   [Node F]       β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β–²                               β–²
            └──────────── ─ ─ ─ β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    Async replication
                    across datacenters

Within DC: Synchronous replication (W=2 or W=3)
Across DC: Asynchronous replication (eventual consistency)

Key point: Cross-datacenter replication is almost always asynchronous to avoid latency. Writes ack’d locally, then propagated to other DCs.

Write Path

Client Write Request
        |
        v
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Write Node                                β”‚
β”‚                                                              β”‚
β”‚  1. Write to commit log (WAL β€” write-ahead log)              β”‚
β”‚     (sequential disk write, very fast, durability)           β”‚
β”‚                                                              β”‚
β”‚  2. Write to memtable (in-memory sorted data structure)      β”‚
β”‚     (fast, all new reads served from here first)             β”‚
β”‚                                                              β”‚
β”‚  3. When memtable reaches threshold:                         β”‚
β”‚     β†’ Flush to disk as SSTable (Sorted String Table)         β”‚
β”‚     β†’ Immutable, sorted by key                               β”‚
β”‚                                                              β”‚
β”‚  4. SSTable files accumulate on disk                         β”‚
β”‚     β†’ Background compaction merges SSTables                  β”‚
β”‚     β†’ Removes deleted/stale entries (tombstones)             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Commit Log (WAL):

  • Append-only sequential write β†’ very fast
  • Survives crashes: replay log on restart to rebuild memtable
  • Deleted once the memtable is flushed to SSTable

Memtable:

  • In-memory sorted structure (e.g., red-black tree or skip list)
  • All recent writes live here
  • Flushed to disk when it exceeds a size threshold (e.g., 32 MB)

SSTable (Sorted String Table):

  • Immutable, sorted by key, stored on disk
  • Multiple SSTables may exist for the same key range
  • Compaction merges SSTables, resolving duplicate keys (newer wins)

Read Path

Client Read Request (key = "user123")
        |
        v
1. Check memtable (in-memory) ─── FOUND? β†’ Return value
        |
        | (not found)
        v
2. Check Bloom Filter for each SSTable
   "Is key user123 possibly in this SSTable?"
   - NO  β†’ Skip this SSTable (save disk I/O)
   - MAYBE β†’ Read SSTable index + data block
        |
        v
3. Read from SSTable on disk
   - Use SSTable index (sparse index) to find block
   - Decompress and read data block
        |
        v
4. Return value to client

Bloom Filter:

  • Probabilistic data structure
  • Can say β€œdefinitely not in set” or β€œmaybe in set”
  • False positives possible (wastes one disk read) but no false negatives
  • Space-efficient: ~10 bits per key for 1% false-positive rate
  • Avoids expensive disk seeks for non-existent keys

Compaction:

Before compaction:              After compaction:
SSTable 1: [a=1, b=2, c=3]     
SSTable 2: [b=5, d=4]      β†’   Merged: [a=1, b=5, c=3, d=4]
SSTable 3: [c=deleted]          (c tombstone removed if old enough)

Step 4: CAP Theorem Trade-offs & System Comparison

CAP Trade-offs in Key-Value Stores

CP Systems (sacrifice availability):

  • Refuse requests during partition to preserve consistency
  • Use case: Financial data, inventory counts
  • Examples: HBase, Zookeeper

AP Systems (sacrifice consistency):

  • Serve potentially stale data during partition
  • Use case: Shopping carts, social feeds, user preferences
  • Examples: Cassandra, DynamoDB (default), CouchDB

Key interview insight:

Network partition WILL happen in distributed systems.
You must choose: CP or AP.

"What's worse for your use case?"
  - Show wrong balance: Use AP (show eventual)
  - Show old shopping cart: Use AP (merge on checkout)
  - Show wrong bank balance: Use CP (must be correct)

System Comparison

FeatureRedisCassandraDynamoDB
CAPCP (cluster)AP (tunable)AP (tunable)
ConsistencyStrong (single) / Eventual (cluster)Tunable (quorum)Eventual (default) / Strong (+50% cost)
Data modelKey-value, sorted sets, listsWide-column (key β†’ map of columns)Key-value + document
ReplicationPrimary-replicaMulti-master (N replicas)Multi-region
ScaleVertical + clusterHorizontal (add nodes)Fully managed, auto-scale
LatencySub-millisecondLow (single-digit ms)Low (single-digit ms)
PersistenceOptional (RDB/AOF)AlwaysAlways
Use caseCaching, sessions, leaderboardsIoT, time-series, high writeE-commerce, gaming, serverless

Design Summary

Full Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Client                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚ get(key) / put(key, value)
                                v
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Coordinator Node                               β”‚
β”‚  (Selected by consistent hashing β€” any node can be coordinator)    β”‚
β”‚                                                                     β”‚
β”‚  Consistent Hash Ring β†’ determine N replica nodes                  β”‚
β”‚  Quorum logic: wait for W writes or R reads                        β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚  replicate to N nodes
       v
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               Distributed Node Ring (Consistent Hashing)            β”‚
β”‚                                                                      β”‚
β”‚   [Node A] ←──Gossip──→ [Node B] ←──Gossip──→ [Node C]              β”‚
β”‚     vnode1              vnode2                 vnode3               β”‚
β”‚     vnode4              vnode5                 vnode6               β”‚
β”‚                                                                      β”‚
β”‚   Each node handles:                                                 β”‚
β”‚   - Write path: commit log β†’ memtable β†’ SSTable                     β”‚
β”‚   - Read path: memtable β†’ bloom filter β†’ SSTable                    β”‚
β”‚   - Failure detection: gossip protocol                              β”‚
β”‚   - Conflict resolution: vector clocks or LWW                      β”‚
β”‚   - Anti-entropy: Merkle trees                                      β”‚
β”‚   - Temp failure: sloppy quorum + hinted handoff                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚  async replication across DCs
       v
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               Remote Datacenter (same ring structure)                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Decisions Summary

ComponentDecisionReasoning
PartitioningConsistent hashing with virtual nodesEven distribution, minimal remapping on add/remove
ReplicationN replicas, coordinator + ring walkFault tolerance, availability
ConsistencyQuorum (W + R > N)Tunable: trade latency for consistency
Failure detectionGossip protocolScalable, no single point of failure
Conflict resolutionVector clocks or LWWCausality tracking or simplicity
Temp failuresSloppy quorum + hinted handoffMaintain availability during short outages
Perm failuresAnti-entropy with Merkle treesEfficient sync of diverged replicas
Write pathCommit log + memtable + SSTableDurability + performance
Read pathBloom filter + SSTable indexMinimize expensive disk I/O
Multi-DCAsync cross-DC replicationTolerate DC-level failures

Interview Questions & Answers

Q: Explain the difference between strong and eventual consistency. When would you choose each?
A: Strong consistency: every read returns the most recent write (W + R > N, e.g., W=2, R=2, N=3). No stale reads but higher latency. Choose for financial data, inventory, booking systems. Eventual consistency: replicas converge over time (W + R <= N). Lower latency, higher availability. Choose for social feeds, view counts, user preferences where stale data is acceptable.

Q: How do vector clocks work and what problem do they solve?
A: Vector clocks track causality to detect conflicts in distributed writes. Each write is tagged with pairs of [server, version]. When two writes happen concurrently (network partition), the vector clocks don’t have a dominance relationship, revealing a conflict. The system can then resolve it (client merge or last-write-wins). Without vector clocks, you might silently lose data or use the wrong version.

Q: Why do we use virtual nodes in consistent hashing?
A: Without virtual nodes, servers may end up with very uneven portions of the ring due to random hash positions. Virtual nodes give each physical server multiple positions on the ring, resulting in a more even data distribution. They also make it easy to scale: when adding a new server, it takes a share of virtual nodes from each existing server proportionally.

Q: What is the gossip protocol and why is it used for failure detection?
A: Each node maintains a membership list with heartbeat counters. Periodically, each node picks a few random peers and exchanges its list. Recipients take the max heartbeat for each node. If a node’s heartbeat hasn’t updated in T seconds, it’s suspected as failed. This avoids O(NΒ²) all-to-all heartbeats, scales well, and has no single point of failure. The cost is it takes a few seconds to detect failures.

Q: Explain the write path in a distributed key-value store.
A: 1) Write to commit log (WAL) first β€” sequential disk append, fast and durable. 2) Write to memtable β€” in-memory sorted structure, fast reads/writes. 3) When memtable exceeds threshold, flush to SSTable on disk β€” immutable, sorted. 4) Background compaction merges SSTables β€” removes stale/deleted data, improves read performance. The commit log ensures durability even if the server crashes before the memtable is flushed.

Q: What is a Bloom filter and how does it improve read performance?
A: A Bloom filter is a probabilistic data structure that answers β€œis this key possibly in this SSTable?” It can say β€œdefinitely NOT in set” (no disk read needed) or β€œmaybe in set” (read the SSTable). False positives are possible but false negatives are not, so we never skip a key that’s actually there. With many SSTables on disk, Bloom filters avoid expensive seeks for missing keys, dramatically improving read latency.

Q: How does sloppy quorum differ from strict quorum? What’s the trade-off?
A: Strict quorum requires writes/reads to go to the N designated replica nodes for a key. If one is down, the operation may fail. Sloppy quorum uses the first W/R available healthy nodes instead, which may not be in the designated N. This improves availability during failures. However, a hinted handoff is required to propagate data back to the designated node once it recovers. The trade-off is: better availability but temporarily relaxed consistency guarantees.


Key Takeaways

  1. CAP theorem forces a choice: Since partition tolerance is required, every distributed KV store chooses CP (strong consistency) or AP (high availability). Most production KV stores (Cassandra, DynamoDB) choose AP with tunable consistency.
  2. Consistent hashing solves data distribution: Virtual nodes ensure even load distribution and minimize key remapping when servers join or leave.
  3. W + R > N guarantees strong consistency: The quorum overlap ensures at least one node read was also written. W=2, R=2, N=3 is the most common balanced choice.
  4. Vector clocks detect but don’t resolve conflicts: They reveal concurrent writes; resolution is up to the client or a policy (LWW). Vector clocks preserve causality.
  5. Gossip protocol for failure detection scales: O(log N) rounds to propagate status across the cluster. No central authority needed.
  6. Sloppy quorum + hinted handoff maintain availability: During temporary node failures, writes go to available nodes with a hint. Data is synced back when the node recovers.
  7. Merkle trees enable efficient anti-entropy: Instead of comparing all data, only diverged subtrees are synced. Proportional to differences, not total data.
  8. Write path optimization: Commit log (durability) + memtable (fast writes) + SSTable (persistent). Bloom filters + sparse indexes make reads fast across many SSTables.


Practice this design! Very common hard interview question. Be ready to:

  1. Draw the consistent hashing ring with virtual nodes
  2. Explain quorum with N, W, R values and their trade-offs
  3. Walk through vector clock conflict detection step by step
  4. Describe the full write and read paths
  5. Discuss CAP theorem choice for your system

Last Updated: 2026-04-13
Status: Hard β€” Core distributed systems β€” Must know!