Chapter 5: Replication
Overview
Replication means keeping a copy of the same data on multiple machines connected via a network. The reasons: latency (place data geographically close to users), availability (tolerate node failures), and throughput (scale out reads across multiple replicas). This chapter examines the challenges of replication—particularly replication lag and write conflict resolution—through three architectures: single-leader, multi-leader, and leaderless.
Core tension: All replication strategies must choose between consistency (all replicas show the same data) and availability/performance (always respond quickly even under failures or lag).
Key Concepts
Leaders and Followers (Single-Leader Replication)
Architecture:
- One replica designated as leader (master/primary): handles all writes
- All other replicas are followers (slaves/secondaries/read replicas): replicate from leader
- Clients write to leader; can read from any replica
Replication log / change stream:
- Leader writes changes to a replication log
- Followers apply changes in the same order
- Replication log forms: WAL shipping, row-based log, statement-based log, logical (row-based) log
Synchronous vs Asynchronous Replication:
- Synchronous: Leader waits for follower to confirm before acknowledging write
- Pro: follower always has up-to-date data; durable if leader fails
- Con: if follower is slow/dead, writes must block
- Asynchronous: Leader sends to follower but doesn’t wait
- Pro: leader continues even if all followers are behind
- Con: can lose writes if leader fails before followers catch up
- Semi-synchronous: One follower is synchronous, rest are async (common practice)
Setting up new followers:
- Take consistent snapshot of leader (without locking if possible)
- Copy snapshot to new follower
- Follower requests replication log changes since snapshot
- Follower is now caught up (in replication)
Handling Node Outages
Follower failure (catch-up recovery):
- Follower keeps log of received replication events
- On restart, requests all changes since last processed event
Leader failure (failover):
- Detect failure (timeout-based, typically 30s)
- Elect new leader (consensus among followers; choose most up-to-date)
- Reconfigure clients and followers to use new leader
Failover problems:
- Split brain: Two nodes both believe they are leader → data corruption
- Solution: STONITH (Shoot The Other Node In The Head) or fencing
- Replication lag at failover: New leader may not have received all writes → writes lost
- GitHub incident (2012): lost data from MySQL follower promotion
- Wrong timeout: Too short = false positives (overloaded, not dead); too long = slow recovery
Replication Lag Problems
Eventual consistency: After writes stop, followers will eventually converge to same state as leader. But during replication lag, followers are stale.
Problem 1: Reading Your Own Writes (Read-after-write consistency):
- User writes data, then reads immediately — may hit stale follower
- Solution: Read own writes from leader (at least for 1 minute after write)
- Or: track timestamp of last write; don’t read from replica behind that timestamp
Problem 2: Monotonic Reads:
- User reads from follower A (recent data), then reads from follower B (stale data) → time goes backwards
- Solution: Route same user’s reads to same replica consistently
Problem 3: Consistent Prefix Reads:
- In partitioned DB, writes to different partitions may be observed out of order
- User A writes question; User B writes answer; Observer sees answer before question
- Solution: Causally related writes go to same partition, or track causal dependencies
Multi-Leader Replication
When to use:
- Multiple datacenters (one leader per datacenter, each syncs to others)
- Clients with offline operation (calendar app with local DB + sync)
- Collaborative editing (Google Docs)
Advantages:
- Writes accepted at local datacenter (low latency)
- Continues working if one datacenter is down
Big problem: Write conflicts:
- Two users edit the same record simultaneously in different datacenters
- Both writes are “accepted” locally, then conflict detected during sync
Conflict resolution strategies:
- Last Write Wins (LWW): Assign timestamp, higher timestamp wins — risk of data loss
- Merge: Combine both values (e.g., CRDT data structures)
- Explicit conflict tracking: Store all versions, let user (or app) resolve
- Conflict-free Replicated Data Types (CRDTs): Data structures designed to merge automatically
Topologies (how leaders sync with each other):
- All-to-all: Every leader sends to every other leader (most resilient)
- Star: Central leader forwards to all others
- Circular: Each node sends to next in ring
Leaderless Replication
Architecture (Dynamo-style: Amazon DynamoDB, Cassandra, Riak):
- No single leader; writes sent to multiple replicas simultaneously
- Reads also sent to multiple replicas; version numbers used to detect stale values
Quorum reads/writes:
- n replicas, write acknowledged by w replicas, read from r replicas
- As long as
w + r > n, you’re guaranteed to read at least one up-to-date value - Typical: n=3, w=2, r=2 (majority for both)
Sloppy Quorum:
- If fewer than w/r replicas available, accept writes on available nodes (even outside home nodes)
- Once home nodes recover, hinted handoff transfers the data back
- More available, but weaker guarantees
Anti-entropy:
- Background process that compares replicas and fixes differences
- Merkle tree used to efficiently compare large datasets
Replication repair mechanisms:
- Read repair: When read detects stale replica, write back the up-to-date value
- Anti-entropy process: Background compares and syncs all replicas
Limitations of leaderless:
- Even with
w + r > n, edge cases exist where stale reads are possible - Concurrent writes → need LWW or version vectors to resolve
- Doesn’t support multi-row transactions natively
Detecting Concurrent Writes
Version vectors (vector clocks):
- Track causal history: which writes have been seen by each replica
- Can determine if two writes are concurrent (neither has seen the other)
- If concurrent: conflict to resolve; if one “happened before” the other: safe to overwrite
Last Write Wins (LWW):
- Each write has a timestamp; highest timestamp wins
- Problem: clocks aren’t perfectly synchronized; writes can be lost
- Acceptable only when losing writes is OK (e.g., caching)
CRDTs (Conflict-free Replicated Data Types):
- Data structures that can always be merged automatically
- Counters, sets, maps with special merge semantics
- Used in Riak, Redis Cluster, collaborative editing
Important Points
- Replication lag is invisible until it causes a bug: Design for it explicitly.
- Failover is harder than it looks: Split brain, lost writes, wrong timeouts are all real problems.
- Multi-leader is powerful but risky: Write conflicts require a conflict resolution strategy upfront.
- Leaderless quorums don’t eliminate inconsistency: Even with
w + r > n, edge cases exist (sloppy quorum, concurrent writes). - Eventual consistency is a spectrum: From “may lag by seconds” to “may lag indefinitely.”
- Version vectors, not timestamps: For detecting concurrent writes, logical clocks are more reliable than wall clocks.
Examples & Case Studies
-
GitHub Failover Incident (2012)
- MySQL follower promoted to leader during failover
- Follower was behind on replication; some recent data lost
- Led to changes in failover procedures
-
Amazon Dynamo (Leaderless)
- Designed for “always writable” shopping cart
- Used LWW + vector clocks for conflict detection
- Sloppy quorum for high availability during network partitions
-
CouchDB / PouchDB (Multi-leader offline)
- CouchDB replication designed for multi-master sync
- PouchDB runs in browser, syncs to CouchDB when online
- Conflicts shown to users to resolve manually
-
Google Spanner (External Consistency)
- TrueTime API: GPS + atomic clocks give bounded time uncertainty
- Allows truly linearizable distributed transactions
- Solves “consistent prefix reads” across datacenters
Questions
- What are the trade-offs between synchronous and asynchronous replication?
- How does leader election work and what are its failure modes?
- What are the three consistency guarantees for replication lag?
- When does multi-leader replication make sense and how do you handle conflicts?
- What is a quorum in leaderless replication and what does it guarantee?
- What is split-brain and how do you prevent it?
- How do version vectors help resolve concurrent writes?
- When would you use CRDTs vs LWW vs explicit conflict tracking?
Modern Context (2026)
Cloud-native replication:
- Aurora (AWS): single writer, up to 15 read replicas via shared storage layer
- AlloyDB (GCP): pooled write buffers with instant read replica catch-up
- PlanetScale: MySQL with Vitess sharding + online schema changes
- Neon: PostgreSQL with copy-on-write storage, instant branching
Raft-everywhere:
- Raft consensus algorithm now used in virtually all distributed databases
- etcd (Kubernetes), TiKV, CockroachDB, Dgraph all use Raft
- Replaces older Paxos implementations with cleaner leader election
CRDTs in production:
- Redis Cluster: CRDT-based conflict resolution
- Riak 2.0: CRDTs (Counters, Sets, Maps, Registers)
- Figma, Linear, Notion: CRDT-based collaborative editing
Active-active multi-region:
- Standard pattern for global applications (2026)
- Requires careful conflict resolution strategy (usually CRDTs or application-level)
- Latency cost: ~100ms cross-region write propagation
Status: Notes complete
Last Updated: 2026-04-13