Chapter 5: Replication

Overview

Replication means keeping a copy of the same data on multiple machines connected via a network. The reasons: latency (place data geographically close to users), availability (tolerate node failures), and throughput (scale out reads across multiple replicas). This chapter examines the challenges of replication—particularly replication lag and write conflict resolution—through three architectures: single-leader, multi-leader, and leaderless.

Core tension: All replication strategies must choose between consistency (all replicas show the same data) and availability/performance (always respond quickly even under failures or lag).

Key Concepts

Leaders and Followers (Single-Leader Replication)

Architecture:

One replica designated as leader (master/primary): handles all writes
All other replicas are followers (slaves/secondaries/read replicas): replicate from leader
Clients write to leader; can read from any replica

Replication log / change stream:

Leader writes changes to a replication log
Followers apply changes in the same order
Replication log forms: WAL shipping, row-based log, statement-based log, logical (row-based) log

Synchronous vs Asynchronous Replication:

Synchronous: Leader waits for follower to confirm before acknowledging write
- Pro: follower always has up-to-date data; durable if leader fails
- Con: if follower is slow/dead, writes must block
Asynchronous: Leader sends to follower but doesn’t wait
- Pro: leader continues even if all followers are behind
- Con: can lose writes if leader fails before followers catch up
Semi-synchronous: One follower is synchronous, rest are async (common practice)

Setting up new followers:

Take consistent snapshot of leader (without locking if possible)
Copy snapshot to new follower
Follower requests replication log changes since snapshot
Follower is now caught up (in replication)

Handling Node Outages

Follower failure (catch-up recovery):

Follower keeps log of received replication events
On restart, requests all changes since last processed event

Leader failure (failover):

Detect failure (timeout-based, typically 30s)
Elect new leader (consensus among followers; choose most up-to-date)
Reconfigure clients and followers to use new leader

Failover problems:

Split brain: Two nodes both believe they are leader → data corruption
- Solution: STONITH (Shoot The Other Node In The Head) or fencing
Replication lag at failover: New leader may not have received all writes → writes lost
- GitHub incident (2012): lost data from MySQL follower promotion
Wrong timeout: Too short = false positives (overloaded, not dead); too long = slow recovery

Replication Lag Problems

Eventual consistency: After writes stop, followers will eventually converge to same state as leader. But during replication lag, followers are stale.

Problem 1: Reading Your Own Writes (Read-after-write consistency):

User writes data, then reads immediately — may hit stale follower
Solution: Read own writes from leader (at least for 1 minute after write)
Or: track timestamp of last write; don’t read from replica behind that timestamp

Problem 2: Monotonic Reads:

User reads from follower A (recent data), then reads from follower B (stale data) → time goes backwards
Solution: Route same user’s reads to same replica consistently

Problem 3: Consistent Prefix Reads:

In partitioned DB, writes to different partitions may be observed out of order
User A writes question; User B writes answer; Observer sees answer before question
Solution: Causally related writes go to same partition, or track causal dependencies

Multi-Leader Replication

When to use:

Multiple datacenters (one leader per datacenter, each syncs to others)
Clients with offline operation (calendar app with local DB + sync)
Collaborative editing (Google Docs)

Advantages:

Writes accepted at local datacenter (low latency)
Continues working if one datacenter is down

Big problem: Write conflicts:

Two users edit the same record simultaneously in different datacenters
Both writes are “accepted” locally, then conflict detected during sync

Conflict resolution strategies:

Last Write Wins (LWW): Assign timestamp, higher timestamp wins — risk of data loss
Merge: Combine both values (e.g., CRDT data structures)
Explicit conflict tracking: Store all versions, let user (or app) resolve
Conflict-free Replicated Data Types (CRDTs): Data structures designed to merge automatically

Topologies (how leaders sync with each other):

All-to-all: Every leader sends to every other leader (most resilient)
Star: Central leader forwards to all others
Circular: Each node sends to next in ring

Leaderless Replication

Architecture (Dynamo-style: Amazon DynamoDB, Cassandra, Riak):

No single leader; writes sent to multiple replicas simultaneously
Reads also sent to multiple replicas; version numbers used to detect stale values

Quorum reads/writes:

n replicas, write acknowledged by w replicas, read from r replicas
As long as w + r > n, you’re guaranteed to read at least one up-to-date value
Typical: n=3, w=2, r=2 (majority for both)

Sloppy Quorum:

If fewer than w/r replicas available, accept writes on available nodes (even outside home nodes)
Once home nodes recover, hinted handoff transfers the data back
More available, but weaker guarantees

Anti-entropy:

Background process that compares replicas and fixes differences
Merkle tree used to efficiently compare large datasets

Replication repair mechanisms:

Read repair: When read detects stale replica, write back the up-to-date value
Anti-entropy process: Background compares and syncs all replicas

Limitations of leaderless:

Even with w + r > n, edge cases exist where stale reads are possible
Concurrent writes → need LWW or version vectors to resolve
Doesn’t support multi-row transactions natively

Detecting Concurrent Writes

Version vectors (vector clocks):

Track causal history: which writes have been seen by each replica
Can determine if two writes are concurrent (neither has seen the other)
If concurrent: conflict to resolve; if one “happened before” the other: safe to overwrite

Last Write Wins (LWW):

Each write has a timestamp; highest timestamp wins
Problem: clocks aren’t perfectly synchronized; writes can be lost
Acceptable only when losing writes is OK (e.g., caching)

CRDTs (Conflict-free Replicated Data Types):

Data structures that can always be merged automatically
Counters, sets, maps with special merge semantics
Used in Riak, Redis Cluster, collaborative editing

Important Points

Replication lag is invisible until it causes a bug: Design for it explicitly.
Failover is harder than it looks: Split brain, lost writes, wrong timeouts are all real problems.
Multi-leader is powerful but risky: Write conflicts require a conflict resolution strategy upfront.
Leaderless quorums don’t eliminate inconsistency: Even with w + r > n, edge cases exist (sloppy quorum, concurrent writes).
Eventual consistency is a spectrum: From “may lag by seconds” to “may lag indefinitely.”
Version vectors, not timestamps: For detecting concurrent writes, logical clocks are more reliable than wall clocks.

Examples & Case Studies

GitHub Failover Incident (2012)
- MySQL follower promoted to leader during failover
- Follower was behind on replication; some recent data lost
- Led to changes in failover procedures
Amazon Dynamo (Leaderless)
- Designed for “always writable” shopping cart
- Used LWW + vector clocks for conflict detection
- Sloppy quorum for high availability during network partitions
CouchDB / PouchDB (Multi-leader offline)
- CouchDB replication designed for multi-master sync
- PouchDB runs in browser, syncs to CouchDB when online
- Conflicts shown to users to resolve manually
Google Spanner (External Consistency)
- TrueTime API: GPS + atomic clocks give bounded time uncertainty
- Allows truly linearizable distributed transactions
- Solves “consistent prefix reads” across datacenters

Questions

What are the trade-offs between synchronous and asynchronous replication?
How does leader election work and what are its failure modes?
What are the three consistency guarantees for replication lag?
When does multi-leader replication make sense and how do you handle conflicts?
What is a quorum in leaderless replication and what does it guarantee?
What is split-brain and how do you prevent it?
How do version vectors help resolve concurrent writes?
When would you use CRDTs vs LWW vs explicit conflict tracking?

Modern Context (2026)

Cloud-native replication:

Aurora (AWS): single writer, up to 15 read replicas via shared storage layer
AlloyDB (GCP): pooled write buffers with instant read replica catch-up
PlanetScale: MySQL with Vitess sharding + online schema changes
Neon: PostgreSQL with copy-on-write storage, instant branching

Raft-everywhere:

Raft consensus algorithm now used in virtually all distributed databases
etcd (Kubernetes), TiKV, CockroachDB, Dgraph all use Raft
Replaces older Paxos implementations with cleaner leader election

CRDTs in production:

Redis Cluster: CRDT-based conflict resolution
Riak 2.0: CRDTs (Counters, Sets, Maps, Registers)
Figma, Linear, Notion: CRDT-based collaborative editing

Active-active multi-region:

Standard pattern for global applications (2026)
Requires careful conflict resolution strategy (usually CRDTs or application-level)
Latency cost: ~100ms cross-region write propagation

Status: Notes complete
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch05-replication