Chapter 5 Flashcards - Replication
Basic Concepts
Why do systems use replication and what are the three main reasons?
?
- Latency: Place data geographically close to users (lower read latency)
- Availability: System continues working even when some nodes fail
- Throughput: Spread read load across multiple replicas (scale reads)
Trade-off: All three goals can conflict with consistency (ensuring all replicas show the same data at the same time)
What are the three replication architectures?
?
- Single-leader (master-slave): One node accepts writes; followers replicate and serve reads
- Simplest, no write conflicts
- Multi-leader (multi-master): Multiple nodes accept writes; sync async
- Better write latency across DCs; risk of write conflicts
- Leaderless (Dynamo-style): All nodes accept writes; quorum determines reads
- Highest availability; weakest consistency guarantees
What is the difference between synchronous and asynchronous replication?
?
Synchronous:
- Leader waits for follower to confirm write before acknowledging to client
- ✅ Follower always has up-to-date copy (no data loss on failover)
- ❌ Writes block if follower is slow or unavailable
- Semi-synchronous (common): One follower is sync, rest are async
Asynchronous:
- Leader acknowledges write immediately, sends to followers in background
- ✅ High write throughput, tolerates follower lag
- ❌ Can lose writes if leader fails before followers catch up (durability risk)
Typical choice: Semi-synchronous for durability without full blocking
Single-Leader Replication
What are the steps to add a new follower to a running single-leader system?
?
- Take a consistent snapshot of leader’s database (without locking, if DB supports it)
- Copy the snapshot to the new follower node
- Follower connects to leader and requests all changes since the snapshot (using snapshot’s replication log position)
- Follower processes the backlog until it has caught up to leader
- Follower can now serve reads
Key: The snapshot must be tagged with the exact replication log position (log sequence number) so the follower knows where to resume
What are the steps in a leader failover and what problems can occur?
?
Failover steps:
- Detect leader failure (timeout, e.g., no heartbeat for 30s)
- Elect new leader (usually: most up-to-date follower by replication offset)
- Reconfigure: clients and followers switch to new leader
Problems:
- Split brain: Old leader recovers and thinks it’s still leader → two leaders → data conflicts
- Solution: STONITH (fencing) — forcefully shut down old leader
- Lost writes: New leader may not have all writes from old leader (async replication lag)
- GitHub 2012 incident: MySQL follower promoted, recent data lost
- Wrong timeout: Too short → false positives under load; too long → long unavailability
- Cascading failures: Failover under high load can overwhelm new leader
Replication Lag
What is eventual consistency and what problems does replication lag cause?
?
- Eventual consistency: If writes stop, all replicas will eventually converge to the same state
- During lag, followers show stale data — three consistency problems arise:
-
Read-your-writes problem: User writes, immediately reads from stale follower — sees old data
- Fix: Route user’s own-profile reads to leader; or track last-write timestamp
-
Monotonic reads: User reads fresh data from replica A, then stale data from replica B — time goes backward
- Fix: Always route same user to same replica
-
Consistent prefix reads: Writes on different partitions appear out of order to observers
- Fix: Causally related writes to same partition
What is read-after-write (read-your-writes) consistency and how do you implement it?
?
- Problem: User writes a post, refreshes, reads from stale follower — post seems to disappear
- Guarantee: User always sees their own writes, even if replicas are stale
Implementation options:
- Read from leader for 1 minute after any write from that user
- Track last write timestamp per user; only read from replicas with offset ≥ that timestamp
- Route user’s reads of their own data to leader (e.g., reading your own profile = leader)
- Sticky session: Route user’s requests to same replica; replica receives updates first
Cross-device complication: User writes on phone, reads on laptop — different devices may hit different replicas
Multi-Leader Replication
When does multi-leader replication make sense and what is its main risk?
?
Use cases:
- Multiple datacenters: One leader per DC; writes go to local leader (low latency); async sync across DCs
- Offline clients: Each device has local DB (like CouchDB); syncs when online
- Collaborative editing: Google Docs, conflict-free editing
Main risk: Write conflicts
- Two users edit same record in different DCs simultaneously
- Both writes are accepted locally; conflict discovered during async sync
- No natural “latest” — must resolve conflict
Conflict resolution strategies:
- Last Write Wins (LWW): timestamp-based, risks data loss
- CRDTs: automatic merge (no data loss, but limited data types)
- Explicit tracking: store all versions, application/user resolves
What are CRDTs and when are they used for conflict resolution?
?
- CRDT = Conflict-free Replicated Data Type
- Definition: Data structures whose merge operation is commutative, associative, and idempotent
- Property: Any two replicas with the same set of updates will always converge to same state, regardless of order of operations
CRDT types:
- G-Counter: Grow-only counter (each replica has own counter; merge = max per replica)
- PN-Counter: Increment and decrement (two G-counters: P and N)
- G-Set: Grow-only set (add-only; merge = union)
- LWW-Element-Set: Set with timestamps; add/remove both tracked
- CRDT Map: Recursive composition of CRDTs
Used in: Riak 2.0, Redis Cluster, Figma, Linear (collaborative editing), Apple Notes sync
Leaderless Replication
What is a quorum in leaderless replication and what does w + r > n guarantee?
?
- n = total number of replicas
- w = number of nodes that must confirm a write
- r = number of nodes queried for a read
Rule: w + r > n
- The write set and read set must overlap by at least one node
- That overlapping node has the latest written value
- Therefore, at least one node in the read set has the up-to-date value
Example: n=3, w=2, r=2
- Write confirmed by 2 nodes
- Read from 2 nodes
- Overlap = at least 1 → guaranteed to read a node with the write
Not a perfect guarantee: Sloppy quorums, concurrent writes, and network partitions can still cause stale reads
What is a sloppy quorum and what is hinted handoff?
?
-
Scenario: Network partition isolates some nodes; fewer than w home nodes are reachable
-
Sloppy quorum: Accept writes on available nodes even if they’re not in the “home” node set
- Maintains write availability during partition
- Trades off strict consistency guarantees
-
Hinted handoff: Once network partition heals, temporary nodes send the written data back to the correct home nodes
- Eventually, data ends up where it belongs
- “Hint” = note attached to write saying “this belongs to node X”
-
Trade-off: Sloppy quorum makes writes always available but
w + r > nno longer guarantees reading the latest value (data might be on “outsider” nodes not in your read quorum)
How does anti-entropy work in leaderless systems?
?
- Problem: After a node fails and recovers, it may be missing writes that occurred while it was down
- Read repair: When a read detects a stale replica (via version comparison), the reader writes the up-to-date value back to that replica
- Works for frequently read data; stale rarely-read data may persist
- Anti-entropy process: Background process that constantly compares replicas and syncs differences
- Merkle tree (hash tree) used to efficiently identify which subtrees differ
- Cassandra, Dynamo, Riak all implement this
- Does not guarantee any ordering (unlike leader’s replication log)
Detecting Concurrent Writes
What is a version vector (vector clock) and how does it detect concurrent writes?
?
-
Version vector: A counter per replica, tracking how many writes each replica has seen
- Example:
{A: 3, B: 2}means “seen 3 writes from A, 2 from B”
- Example:
-
Happened-before: Event X happened before Y if Y’s version vector is component-wise ≥ X’s
{A:3, B:2}happened before{A:3, B:4}(B increased)
-
Concurrent: Neither version vector dominates the other
{A:3, B:2}vs{A:2, B:4}— A is ahead on A, B is ahead on B → CONCURRENT
-
On concurrent detection: Must store multiple versions (siblings); application or CRDT merges them
-
LWW alternative: Just use timestamp; simple but can silently discard writes
Modern Context (2026)
How has the Raft consensus algorithm changed replication in modern databases?
?
-
Raft (2014, Ongaro and Ousterhout): Consensus algorithm designed for understandability
-
Replaces: Bespoke Paxos implementations (complex, hard to implement correctly)
-
How it works: Leader election + log replication in one clean protocol
- Leader elected by majority vote
- All writes go through leader; committed when majority of nodes acknowledge
- Strongly consistent by construction (no split brain)
-
Used in (2026):
- etcd (Kubernetes config store)
- TiKV (TiDB’s storage layer)
- CockroachDB, YugabyteDB
- Consul, Nomad
- Apache Kafka (KRaft mode — replacing ZooKeeper)
-
Why matters: Near-zero data loss guarantees with automatic leader election
How do cloud databases like Aurora handle replication differently from traditional approaches?
?
-
Traditional: Leader ships WAL log to followers; each follower applies log to its local storage
-
Aurora (AWS):
- 6 copies across 3 AZs; quorum write (4/6), quorum read (3/6)
- Only WAL shipped — no page writes; storage nodes apply WAL independently
- Storage layer is shared (not per-node) → instant read replica catch-up
- Read replicas apply WAL from shared storage → single-digit ms lag typically
-
Key innovation: Separating compute from storage + consensus inside storage layer
-
AlloyDB (GCP): Similar approach with columnar cache layer for HTAP
-
Implication: Traditional replication concepts still apply at the logical level, but the physical implementation is radically different
Interview Scenarios
Design replication for a social media app with users in North America, Europe, and Asia.
?
Architecture: Multi-leader (one leader per region) + leaderless within region
Regional design:
- 3 regions (NA, EU, APAC) each with a leader
- Within each region: 3 replicas with Raft (strong consistency intra-region)
Cross-region sync:
- Async replication between regions (cross-region write latency is ~100ms)
- Accept eventual consistency for cross-region reads
Conflict resolution:
- User profile updates: Last Write Wins (LWW) with timestamp is acceptable
- Social interactions (likes, follows): CRDTs (G-Counter for likes)
- Messages: Causal ordering required → use vector clocks or single leader for DMs
Read-your-writes: Route user’s reads to their home region after any write
A distributed database is experiencing split-brain. How do you detect and resolve it?
?
Detection:
- Two nodes both accept writes for the same key
- Monitoring: detect two nodes claiming to be leader (heartbeat collision)
- Version vector or timestamp conflict in stored data
Prevention (better than cure):
- STONITH (Shoot The Other Node In The Head): When new leader elected, fence old leader
- Power off, network isolate, or storage-block the old leader
- Must happen before new leader accepts writes
- Majority quorum: Only proceed with operations when majority available
- If only 2/5 nodes reachable → refuse to be leader (minority can’t form quorum)
- Lease-based leadership: Leader holds a time-bounded lease; must renew; expired = step down
Resolution if it happened:
- Compare version vectors, identify conflicting writes
- Apply conflict resolution strategy (manual or CRDT)
- Recalibrate clocks and leader election
Quick Facts
What was the GitHub MySQL failover incident and what was its lesson?
?
- Year: 2012
- What happened: During MySQL leader failover, a follower was promoted that was behind on replication
- Result: Recently written data was lost (follower hadn’t yet received those writes)
- Lesson: Asynchronous replication has a real risk of data loss on failover; must account for replication lag when promoting followers
Best practices since:
- Measure replication lag before promoting follower
- Semi-synchronous replication for at-least-one-sync-follower guarantee
- Consider using synchronous replication for critical data
What quorum configuration does Cassandra use by default and what does it guarantee?
?
-
Default: Cassandra typically uses
n=3(3 replicas per key) -
Common config:
QUORUMconsistency level = majority (2 out of 3)- Write: confirmed by 2/3 nodes
- Read: queried from 2/3 nodes
- Guarantees:
w + r > n→ 2 + 2 > 3 → at least 1 overlap
-
Other Cassandra consistency levels:
ONE: 1 node (fastest, weakest)QUORUM: majority (balanced)ALL: all nodes (slowest, strongest)LOCAL_QUORUM: majority within local DC only (good for multi-DC)
-
Trade-off: Higher consistency = lower availability and higher latency
Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13