Chapter 5 Flashcards - Replication

Basic Concepts

Why do systems use replication and what are the three main reasons?
?

Latency: Place data geographically close to users (lower read latency)
Availability: System continues working even when some nodes fail
Throughput: Spread read load across multiple replicas (scale reads)

Trade-off: All three goals can conflict with consistency (ensuring all replicas show the same data at the same time)

What are the three replication architectures?
?

Single-leader (master-slave): One node accepts writes; followers replicate and serve reads
- Simplest, no write conflicts
Multi-leader (multi-master): Multiple nodes accept writes; sync async
- Better write latency across DCs; risk of write conflicts
Leaderless (Dynamo-style): All nodes accept writes; quorum determines reads
- Highest availability; weakest consistency guarantees

What is the difference between synchronous and asynchronous replication?
?
Synchronous:

Leader waits for follower to confirm write before acknowledging to client
✅ Follower always has up-to-date copy (no data loss on failover)
❌ Writes block if follower is slow or unavailable
Semi-synchronous (common): One follower is sync, rest are async

Asynchronous:

Leader acknowledges write immediately, sends to followers in background
✅ High write throughput, tolerates follower lag
❌ Can lose writes if leader fails before followers catch up (durability risk)

Typical choice: Semi-synchronous for durability without full blocking

Single-Leader Replication

What are the steps to add a new follower to a running single-leader system?
?

Take a consistent snapshot of leader’s database (without locking, if DB supports it)
Copy the snapshot to the new follower node
Follower connects to leader and requests all changes since the snapshot (using snapshot’s replication log position)
Follower processes the backlog until it has caught up to leader
Follower can now serve reads

Key: The snapshot must be tagged with the exact replication log position (log sequence number) so the follower knows where to resume

What are the steps in a leader failover and what problems can occur?
?
Failover steps:

Detect leader failure (timeout, e.g., no heartbeat for 30s)
Elect new leader (usually: most up-to-date follower by replication offset)
Reconfigure: clients and followers switch to new leader

Problems:

Split brain: Old leader recovers and thinks it’s still leader → two leaders → data conflicts
- Solution: STONITH (fencing) — forcefully shut down old leader
Lost writes: New leader may not have all writes from old leader (async replication lag)
- GitHub 2012 incident: MySQL follower promoted, recent data lost
Wrong timeout: Too short → false positives under load; too long → long unavailability
Cascading failures: Failover under high load can overwhelm new leader

Replication Lag

What is eventual consistency and what problems does replication lag cause?
?

Eventual consistency: If writes stop, all replicas will eventually converge to the same state
During lag, followers show stale data — three consistency problems arise:

Read-your-writes problem: User writes, immediately reads from stale follower — sees old data
- Fix: Route user’s own-profile reads to leader; or track last-write timestamp
Monotonic reads: User reads fresh data from replica A, then stale data from replica B — time goes backward
- Fix: Always route same user to same replica
Consistent prefix reads: Writes on different partitions appear out of order to observers
- Fix: Causally related writes to same partition

What is read-after-write (read-your-writes) consistency and how do you implement it?
?

Problem: User writes a post, refreshes, reads from stale follower — post seems to disappear
Guarantee: User always sees their own writes, even if replicas are stale

Implementation options:

Read from leader for 1 minute after any write from that user
Track last write timestamp per user; only read from replicas with offset ≥ that timestamp
Route user’s reads of their own data to leader (e.g., reading your own profile = leader)
Sticky session: Route user’s requests to same replica; replica receives updates first

Cross-device complication: User writes on phone, reads on laptop — different devices may hit different replicas

Multi-Leader Replication

When does multi-leader replication make sense and what is its main risk?
?
Use cases:

Multiple datacenters: One leader per DC; writes go to local leader (low latency); async sync across DCs
Offline clients: Each device has local DB (like CouchDB); syncs when online
Collaborative editing: Google Docs, conflict-free editing

Main risk: Write conflicts

Two users edit same record in different DCs simultaneously
Both writes are accepted locally; conflict discovered during async sync
No natural “latest” — must resolve conflict

Conflict resolution strategies:

Last Write Wins (LWW): timestamp-based, risks data loss
CRDTs: automatic merge (no data loss, but limited data types)
Explicit tracking: store all versions, application/user resolves

What are CRDTs and when are they used for conflict resolution?
?

CRDT = Conflict-free Replicated Data Type
Definition: Data structures whose merge operation is commutative, associative, and idempotent
Property: Any two replicas with the same set of updates will always converge to same state, regardless of order of operations

CRDT types:

G-Counter: Grow-only counter (each replica has own counter; merge = max per replica)
PN-Counter: Increment and decrement (two G-counters: P and N)
G-Set: Grow-only set (add-only; merge = union)
LWW-Element-Set: Set with timestamps; add/remove both tracked
CRDT Map: Recursive composition of CRDTs

Used in: Riak 2.0, Redis Cluster, Figma, Linear (collaborative editing), Apple Notes sync

Leaderless Replication

What is a quorum in leaderless replication and what does w + r > n guarantee?
?

n = total number of replicas
w = number of nodes that must confirm a write
r = number of nodes queried for a read

Rule: w + r > n

The write set and read set must overlap by at least one node
That overlapping node has the latest written value
Therefore, at least one node in the read set has the up-to-date value

Example: n=3, w=2, r=2

Write confirmed by 2 nodes
Read from 2 nodes
Overlap = at least 1 → guaranteed to read a node with the write

Not a perfect guarantee: Sloppy quorums, concurrent writes, and network partitions can still cause stale reads

What is a sloppy quorum and what is hinted handoff?
?

Scenario: Network partition isolates some nodes; fewer than w home nodes are reachable
Sloppy quorum: Accept writes on available nodes even if they’re not in the “home” node set
- Maintains write availability during partition
- Trades off strict consistency guarantees
Hinted handoff: Once network partition heals, temporary nodes send the written data back to the correct home nodes
- Eventually, data ends up where it belongs
- “Hint” = note attached to write saying “this belongs to node X”
Trade-off: Sloppy quorum makes writes always available but w + r > n no longer guarantees reading the latest value (data might be on “outsider” nodes not in your read quorum)

How does anti-entropy work in leaderless systems?
?

Problem: After a node fails and recovers, it may be missing writes that occurred while it was down
Read repair: When a read detects a stale replica (via version comparison), the reader writes the up-to-date value back to that replica
- Works for frequently read data; stale rarely-read data may persist
Anti-entropy process: Background process that constantly compares replicas and syncs differences
- Merkle tree (hash tree) used to efficiently identify which subtrees differ
- Cassandra, Dynamo, Riak all implement this
- Does not guarantee any ordering (unlike leader’s replication log)

Detecting Concurrent Writes

What is a version vector (vector clock) and how does it detect concurrent writes?
?

Version vector: A counter per replica, tracking how many writes each replica has seen
- Example: {A: 3, B: 2} means “seen 3 writes from A, 2 from B”
Happened-before: Event X happened before Y if Y’s version vector is component-wise ≥ X’s
- {A:3, B:2} happened before {A:3, B:4} (B increased)
Concurrent: Neither version vector dominates the other
- {A:3, B:2} vs {A:2, B:4} — A is ahead on A, B is ahead on B → CONCURRENT
On concurrent detection: Must store multiple versions (siblings); application or CRDT merges them
LWW alternative: Just use timestamp; simple but can silently discard writes

Modern Context (2026)

How has the Raft consensus algorithm changed replication in modern databases?
?

Raft (2014, Ongaro and Ousterhout): Consensus algorithm designed for understandability
Replaces: Bespoke Paxos implementations (complex, hard to implement correctly)
How it works: Leader election + log replication in one clean protocol
- Leader elected by majority vote
- All writes go through leader; committed when majority of nodes acknowledge
- Strongly consistent by construction (no split brain)
Used in (2026):
- etcd (Kubernetes config store)
- TiKV (TiDB’s storage layer)
- CockroachDB, YugabyteDB
- Consul, Nomad
- Apache Kafka (KRaft mode — replacing ZooKeeper)
Why matters: Near-zero data loss guarantees with automatic leader election

How do cloud databases like Aurora handle replication differently from traditional approaches?
?

Traditional: Leader ships WAL log to followers; each follower applies log to its local storage
Aurora (AWS):
- 6 copies across 3 AZs; quorum write (4/6), quorum read (3/6)
- Only WAL shipped — no page writes; storage nodes apply WAL independently
- Storage layer is shared (not per-node) → instant read replica catch-up
- Read replicas apply WAL from shared storage → single-digit ms lag typically
Key innovation: Separating compute from storage + consensus inside storage layer
AlloyDB (GCP): Similar approach with columnar cache layer for HTAP
Implication: Traditional replication concepts still apply at the logical level, but the physical implementation is radically different

Interview Scenarios

Design replication for a social media app with users in North America, Europe, and Asia.
?
Architecture: Multi-leader (one leader per region) + leaderless within region

Regional design:

3 regions (NA, EU, APAC) each with a leader
Within each region: 3 replicas with Raft (strong consistency intra-region)

Cross-region sync:

Async replication between regions (cross-region write latency is ~100ms)
Accept eventual consistency for cross-region reads

Conflict resolution:

User profile updates: Last Write Wins (LWW) with timestamp is acceptable
Social interactions (likes, follows): CRDTs (G-Counter for likes)
Messages: Causal ordering required → use vector clocks or single leader for DMs

Read-your-writes: Route user’s reads to their home region after any write

A distributed database is experiencing split-brain. How do you detect and resolve it?
?
Detection:

Two nodes both accept writes for the same key
Monitoring: detect two nodes claiming to be leader (heartbeat collision)
Version vector or timestamp conflict in stored data

Prevention (better than cure):

STONITH (Shoot The Other Node In The Head): When new leader elected, fence old leader
- Power off, network isolate, or storage-block the old leader
- Must happen before new leader accepts writes
Majority quorum: Only proceed with operations when majority available
- If only 2/5 nodes reachable → refuse to be leader (minority can’t form quorum)
Lease-based leadership: Leader holds a time-bounded lease; must renew; expired = step down

Resolution if it happened:

Compare version vectors, identify conflicting writes
Apply conflict resolution strategy (manual or CRDT)
Recalibrate clocks and leader election

Quick Facts

What was the GitHub MySQL failover incident and what was its lesson?
?

Year: 2012
What happened: During MySQL leader failover, a follower was promoted that was behind on replication
Result: Recently written data was lost (follower hadn’t yet received those writes)
Lesson: Asynchronous replication has a real risk of data loss on failover; must account for replication lag when promoting followers

Best practices since:

Measure replication lag before promoting follower
Semi-synchronous replication for at-least-one-sync-follower guarantee
Consider using synchronous replication for critical data

What quorum configuration does Cassandra use by default and what does it guarantee?
?

Default: Cassandra typically uses n=3 (3 replicas per key)
Common config: QUORUM consistency level = majority (2 out of 3)
- Write: confirmed by 2/3 nodes
- Read: queried from 2/3 nodes
- Guarantees: w + r > n → 2 + 2 > 3 → at least 1 overlap
Other Cassandra consistency levels:
- ONE: 1 node (fastest, weakest)
- QUORUM: majority (balanced)
- ALL: all nodes (slowest, strongest)
- LOCAL_QUORUM: majority within local DC only (good for multi-DC)
Trade-off: Higher consistency = lower availability and higher latency

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch05-flashcards