Chapter 6: Replication

ddia-2e replication distributed-systems consistency leader-follower consensus

Status: Notes complete

Overview

Replication means keeping copies of the same data on multiple nodes. It serves three purposes: (1) keeping data geographically close to users to reduce read latency, (2) allowing the system to continue serving requests even when some nodes fail, and (3) scaling read throughput across multiple machines. These goals sound simple, but achieving them while maintaining consistency — ensuring that readers see the most recent writes — is the central difficulty in distributed data systems.

The chapter focuses on how distributed databases handle the constant tension between availability (always serve requests) and consistency (always serve correct data). The three main replication topologies — single-leader, multi-leader, and leaderless — represent different positions on this trade-off spectrum. The 2nd edition updates coverage to reflect modern cloud database architectures (CockroachDB, DynamoDB, PlanetScale) and expanded treatment of Change Data Capture (CDC) as a first-class replication mechanism.

Key Concepts

Why Replication is Hard

The difficulty is not copying data — it is keeping copies consistent while the system continues to accept writes. Every write must propagate to all replicas, but networks are unreliable, nodes fail, and replication is not instantaneous. During the window when different replicas have different data, concurrent readers may see different versions of the truth — this is replication lag.

The fundamental challenge: if the leader accepts a write and immediately crashes before replicating to any follower, that write is lost. If it waits for all replicas to confirm before acknowledging the write, it is unavailable if any replica is slow or offline. Every replication design navigates this tension.

Single-Leader Replication

The dominant replication architecture for relational and most NoSQL databases.

Architecture:

One replica is designated the leader (master/primary). All writes must go through the leader.
Other replicas are followers (slaves/replicas/read replicas). They receive changes from the leader and apply them in the same order.
Reads can be served from any replica (leader or followers).

Single-Leader Replication:

          ┌────────────┐
 Writes → │   LEADER   │ ← Only node that accepts writes
          └─────┬──────┘
          replication log
         ┌──────┴──────┐
         ↓             ↓
    ┌──────────┐  ┌──────────┐
    │ Follower │  │ Follower │  ← Serve reads; lag possible
    │  (sync)  │  │ (async)  │
    └──────────┘  └──────────┘

Used by: PostgreSQL (streaming replication), MySQL, Oracle Data Guard, MongoDB, Kafka, Elasticsearch.

Synchronous vs Asynchronous Replication

This is the most fundamental trade-off in single-leader replication.

Dimension	Synchronous	Asynchronous	Semi-Synchronous
Durability guarantee	Strong: write durable on all sync replicas	Weak: leader may fail before replicating	Medium: at least one follower confirms
Write latency	High: must wait for slowest sync replica	Low: leader acknowledges immediately	Medium: waits for one follower
Availability	Low: write blocked if any sync replica unavailable	High: writes proceed even if all followers down	High: only blocked if that one follower down
Read consistency	Sync followers always up-to-date	Followers may lag by seconds to hours	Semi-consistent
When to use	Safety-critical data, banking	High-throughput writes, geo-distributed	PostgreSQL default for 1 sync + N async
Examples	Synchronous MySQL Group Replication	Standard MySQL async replication	PostgreSQL `synchronous_standby_names`

Fully synchronous replication is impractical if there are many followers — one slow or failed follower blocks all writes indefinitely.

Practical configuration: One synchronous follower (semi-synchronous) — the leader waits for exactly one follower to acknowledge before returning success. All other followers are async. This guarantees at least two up-to-date copies (leader + one sync follower) while limiting write latency impact.

Setting Up New Followers

To add a new follower without downtime:

Take a consistent snapshot of the leader (using DB’s snapshot mechanism — e.g., pg_basebackup for PostgreSQL, InnoDB hot backup for MySQL).
Copy snapshot to the new follower node.
Follower connects to leader and requests all changes since the snapshot was taken (using the replication log position embedded in the snapshot — PostgreSQL calls this the WAL LSN, MySQL calls it the binlog position).
Once follower has processed all the backlog and caught up, it’s operational.

Handling Node Outages

Follower Failure: Catch-Up Recovery

Each follower maintains a log of the changes it has received and applied from the leader. On restart, the follower knows the last transaction it processed and requests all subsequent changes from the leader.

Leader Failure: Failover

Leader failover is much more complex:

Detect leader failure: Usually via heartbeat timeout (e.g., 30 seconds of no response). False positives (network hiccups causing healthy leader to be considered dead) are common.
Elect a new leader: Via consensus (Raft, Paxos) or by choosing the follower with the most up-to-date replication log. The new leader must be agreed upon by a quorum of nodes.
Reconfigure system: Clients are redirected to the new leader; old leader (if it recovers) must become a follower and accept the new leader’s authority.

Failover failure modes (things that can go catastrophically wrong):

Lost writes: The new leader may not have received all writes from the old leader. If the old leader rejoins, its un-replicated writes conflict with the new leader’s state. The usual resolution: discard old leader’s un-replicated writes — but this violates durability guarantees that were given to the client.
Split brain: Both the old and new leader believe they are the current leader and accept writes simultaneously. Two nodes accepting conflicting writes with no coordination → data corruption. Solution: STONITH (Shoot The Other Node In The Head) — fencing mechanism to force the old leader offline.
Timeout misconfiguration: Too short → frequent unnecessary failovers during temporary overload. Too long → long outage window before failover.

Because of these complexities, some operations teams prefer to perform failover manually.

Implementation of Replication Logs

How the leader communicates changes to followers:

Statement-Based Replication

The leader logs every write statement (INSERT, UPDATE, DELETE) as text and sends to followers who execute it.

Problems:

Non-deterministic functions: NOW(), RAND(), UUID() produce different results on followers → diverging state.
Auto-increment columns: statement must execute in the same order on all replicas.
Triggers and stored procedures may have side effects that behave differently.
Used by: MySQL (before version 5.1 as default), some NoSQL systems.

Write-Ahead Log (WAL) Shipping

The leader ships its low-level write-ahead log (WAL) to followers. Followers apply the same physical byte-level changes.

Advantages: Exact byte-level replication; no determinism problems.
Disadvantages: WAL describes disk blocks and byte positions — tightly coupled to storage engine version. Follower must run the same storage engine version. Upgrading the database requires all replicas to be the same version simultaneously — zero-downtime upgrades are difficult.
Used by: PostgreSQL streaming replication.

Logical (Row-Based) Log Replication

A separate logical log format captures row-level changes (which rows were inserted/updated/deleted with their values) decoupled from the physical storage format.

Advantages:

Decoupled from storage engine internals — followers can run different versions.
Can be read by external systems (Change Data Capture, analytics pipelines, auditing).
Human-readable enough for debugging.
Disadvantages: Slightly more overhead than WAL shipping.
Used by: MySQL binlog (row-based format), PostgreSQL logical replication.

Trigger-Based Replication

Application-level replication: database triggers fire on each change and log to a separate table; a custom process reads from that table and replicates to followers.

Advantages: Flexible — can replicate a subset of data, transform data, replicate between different database types.
Disadvantages: High overhead, more error-prone, more moving parts.
Used by: Bucardo (PostgreSQL), Tungsten Replicator (MySQL).

Replication log method comparison:

Method	Determinism	Storage Coupling	CDC-Friendly	Used By
Statement-based	Risky (non-det functions)	None	No	MySQL legacy
WAL shipping	Exact	Tight (version-coupled)	No	PostgreSQL streaming
Logical/row-based	Safe	Loose	Yes	MySQL binlog (row), PG logical
Trigger-based	App-controlled	None	Yes (custom)	Bucardo, Tungsten

Replication Lag

In asynchronous replication, followers may be behind the leader. Reading from a lagging follower may return stale data. This is replication lag — usually milliseconds in a healthy system, but can be seconds to minutes under load or on slow network connections.

Read-After-Write Consistency

Problem: A user submits a profile update (write to leader), then immediately navigates to their profile page (read from a lagging follower). They see their old data and think the update was lost.

Solutions:

Always read from the leader for data the user may have just modified (e.g., the user’s own profile is always read from leader; other users’ profiles can be read from followers).
Track the timestamp of the user’s last write; for some period after a write, route their reads to the leader.
Client tracks the replication log position of its last write; followers only serve this client once they’ve applied up to that position.

Monotonic Reads

Problem: A user makes two reads to different followers at different lag levels. The second read returns data from a follower that is further behind than the first. The user sees time go “backwards” — a newer-looking query returns older data.

Solution: Route each user to the same replica for all their reads within a session. If that replica fails, re-route and potentially observe a brief backwards jump.

Consistent Prefix Reads

Problem: In a partitioned (sharded) database, writes to different partitions may be applied in different orders. A reader may observe the answer to a question before the question was asked.

Partition A (writes by Alice):  "What time is it?" written at T=1
Partition B (writes by Bob):    "3pm" written at T=2

Reader sees: "3pm" (from fast partition B) before "What time?" (from slow partition A)
→ Effect appears before cause

Solution: Always write causally related writes to the same partition. Use logical timestamps (Lamport clocks, version vectors) to ensure readers see writes in a causally consistent order.

The Five Consistency Levels (Hierarchy)

From strongest to weakest:

Level	Guarantee	Cost
Linearizability (strong)	Single-copy semantics: reads always see most recent write	Highest latency (wait for quorum); unavailable during partitions
Read-after-write	User sees their own writes immediately	Route user’s reads to leader or track write position
Monotonic reads	User never sees time go backwards within a session	Sticky session routing to same replica
Consistent prefix reads	Causality preserved: if A happened before B, readers see A before B	Logical timestamps; causally related writes to same partition
Eventual consistency	Replicas converge eventually if writes stop; no ordering guarantees	Lowest latency; highest availability

Multi-Leader Replication

In single-leader replication, all writes must go through one datacenter — adding latency for geographically distributed users. Multi-leader replication allows writes to be accepted at multiple nodes, each acting as both a leader for local writes and a follower for remote writes.

Use Cases

Multi-datacenter operation:

Multi-Leader Multi-Datacenter:

  Datacenter 1              Datacenter 2
  ┌──────────────┐          ┌──────────────┐
  │    Leader A  │ ←──────→ │    Leader B  │
  │  (accepts    │  async   │  (accepts    │
  │   writes)    │  repl.   │   writes)    │
  └──────┬───────┘          └──────┬───────┘
         │ sync                    │ sync
    ┌────┴────┐               ┌────┴────┐
    │ Follower│               │ Follower│
    └─────────┘               └─────────┘

Benefits:

Lower write latency: each user writes to the nearest datacenter.
Tolerates datacenter outages: other datacenters continue accepting writes.

Offline clients: Mobile and desktop applications where the device itself acts as a leader for locally buffered writes (calendar app, note-taking app). Sync happens when connectivity is restored.

Collaborative editing: Google Docs-style real-time collaboration where each participant’s changes are accepted locally and merged asynchronously.

Write Conflict Handling

The fundamental problem: Two users write to the same record concurrently in different datacenters. Both writes succeed locally. When the replication propagates, the two leaders have conflicting versions.

Leader A: User 1 sets title = "Alice"  (T=1)
Leader B: User 2 sets title = "Bob"    (T=1, concurrent)

When A and B synchronize: CONFLICT — which value wins?

Unlike single-leader replication (where the second write simply overwrites the first on the single leader), multi-leader replication has no single point of truth.

Conflict avoidance (best strategy):

Route all writes for a specific record through the same leader. Eliminates conflicts for that record. Works until the designated leader must change (e.g., due to failure).

Conflict Resolution Strategies

Strategy	How It Works	Pros	Cons
Last Write Wins (LWW)	Each write gets a timestamp; highest timestamp wins	Simple, eventually convergent	Clock skew means earlier-timestamped writes can win over later ones; data loss
Highest replica ID wins	Assign numeric IDs to replicas; higher ID wins conflicts	Simple	Arbitrary: no semantic meaning
Merge values	Store both conflicting values; application resolves	Preserves all data	Application must implement merge logic
Custom conflict resolution logic	Application-defined callback runs on conflict	Full control	Developer burden; must be carefully tested
CRDTs	Data structures designed to be merged deterministically	Mathematically correct; automatic	Limited to specific data types; complex

Last Write Wins (LWW) problems:

Requires synchronized clocks — NTP is not precise enough to resolve millisecond-level concurrent writes.
A write with an “earlier” timestamp (due to clock skew) silently discards a “later” write — data loss.
Used by default in Cassandra and DynamoDB. Fine for append-only data; dangerous for update-in-place.

CRDTs (Conflict-Free Replicated Data Types):

Mathematical data structures that define a merge operation guaranteeing: commutativity (A⊕B = B⊕A), associativity ((A⊕B)⊕C = A⊕(B⊕C)), and idempotency (A⊕A = A).
Examples: G-Counter (grow-only counter), PN-Counter (increment/decrement), OR-Set (observed-remove set), LWW-Register, RGA (Replicated Growable Array for text).
Used in: Riak (Basho), Redis (for distributed counters), Automerge (CRDT-based collaborative document editing).
Limitation: CRDTs only work naturally for specific data types. Arbitrary business logic usually requires custom conflict resolution.

Operational Transforms are used in Google Docs for collaborative text editing — a specific algorithm for transforming concurrent operations to apply them in any order with the same result.

Multi-Leader Replication Topologies

How leaders replicate to each other:

CIRCULAR TOPOLOGY:              STAR TOPOLOGY:           ALL-TO-ALL:
  A → B → C → A                    B                      A ↔ B
(each has one upstream,            ↑                      ↑ ↗
 one downstream)               A → C → D                  C
Problem: single node failure         ↓
breaks the ring                      E
                               (central node = bottleneck)

ALL-TO-ALL TOPOLOGY:
  Every leader sends changes to every other leader
  Problem: writes may arrive out of order (version vectors needed)

All-to-all topology is most resilient but requires careful handling of causal ordering — a write from A that depends on an earlier write from B may arrive at C before B’s write, causing a causal violation. Version vectors (one counter per leader) track causal dependencies and ensure correct ordering.

Leaderless Replication

Popularized by Amazon Dynamo (2007 paper). Used in Cassandra, Riak, Voldemort. No leader: any replica can accept writes. Reads and writes go to multiple replicas simultaneously.

Quorum Reads and Writes

With n replicas:

w = number of replicas that must confirm a write
r = number of replicas that must respond to a read

For consistency guarantee: w + r > n

This ensures that the set of replicas confirming the write and the set responding to the read must overlap — at least one replica in the read set has the latest write.

Example: n=5, w=3, r=3 (w+r=6 > 5=n)
                                  
Write: must reach 3 of 5 replicas  Read: must query 3 of 5 replicas
        ┌─────┐                          ┌─────┐
        │  A  │ ✓ write confirmed        │  A  │ ✓ queried
        │  B  │ ✓ write confirmed        │  B  │ ✓ queried (has latest)
        │  C  │ ✓ write confirmed        │  C  │ ✓ queried (has latest)
        │  D  │ × write didn't reach     │  D  │ × not queried
        │  E  │ × write didn't reach     │  E  │ × not queried
        └─────┘                          └─────┘
Read will always contact at least one of {A,B,C} → returns latest value

Typical configurations:

n=3, w=2, r=2 — tolerates 1 failed node
n=5, w=3, r=3 — tolerates 2 failed nodes
n=5, w=5, r=1 — very durable writes, fast reads (but write availability very low)
n=5, w=1, r=5 — fast writes, slow reads, high write availability

Adjusting for use cases:

Read-heavy workload: lower r, higher w
Write-heavy: lower w, higher r
High availability over consistency: low w, low r (w+r may be < n — eventual consistency only)

Sloppy Quorums and Hinted Handoff

Sloppy quorum: When the usual quorum nodes are unreachable (e.g., due to network partition), allow writes to temporarily land on different nodes (“home” for the data at that moment). The write is accepted on w nodes, but they may not be the “correct” n nodes for that key.

Hinted handoff: When the temporary nodes receive the hint (the write), they hold it and forward it to the correct home node when it comes back online.

Trade-off: Increases write availability (writes succeed even during partition) but weakens durability guarantee — if the temporary nodes fail before handing off, the write is lost.

Used by: Cassandra (optional; can be configured), DynamoDB (always).

Detecting Concurrent Writes

In leaderless (and multi-leader) systems, multiple clients may write to the same key concurrently. The system must detect which writes are concurrent and which are causally ordered.

Version numbers (single replica):

Each write to a key increments a version counter.
When reading, the client gets all current values and version numbers.
When writing, the client sends the version number it read (to indicate “I’m based on this version”) plus the new value.
The server can then tell if the new write is based on a superseded version (concurrent) or the latest version (causally follows).

Version vectors (multiple replicas):

Each replica maintains its own version counter per key.
A version vector (or vector clock) is the set of version numbers from all replicas.
Enables detection of concurrent writes across replicas.
If version vector A dominates B (all A[i] ≥ B[i]), A happened after B.
If neither dominates (some A[i] > B[i], some B[i] > A[i]), they are concurrent.

Merging concurrent writes:

For commutative operations (e.g., adding to a set): merge is straightforward (union).
For non-commutative (e.g., overwriting a value): must either use LWW (lossy), CRDT, or application-level merge.

Change Data Capture (CDC)

CDC is the process of capturing every change to a database as a stream of events. It is the logical/row-based replication mechanism turned into a general-purpose data integration tool.

How it works (PostgreSQL example):

PostgreSQL writes every change to the WAL (Write-Ahead Log).
A CDC tool (Debezium) connects to PostgreSQL as a replication client and reads the WAL.
Debezium transforms WAL entries into structured events and publishes to Kafka.
Downstream consumers (search index, data warehouse, cache) consume the Kafka events and update their derived datasets.

CDC Architecture:
              Debezium
              (CDC tool)
┌───────────┐    ↑          ┌─────────┐    ┌────────────┐
│ PostgreSQL│────┤WAL stream │  Kafka  │───→│Elasticsearch│
│ (source)  │    │           │  topic  │    │(search idx)│
└───────────┘    │           └─────────┘───→│Data Warehouse│
              reads WAL             │    └─────────────┘
                              (events    ───→│ Redis Cache │
                              retained)  └─────────────┘

CDC use cases:

Keeping derived data systems (search indexes, caches, analytics databases) in sync with the source of truth.
Building event sourcing systems on top of existing relational databases (the DB is the system of record; events are derived).
Audit logging: every change captured with full before/after state.
Microservice data synchronization without tight coupling.

CDC tools: Debezium (open source, Kafka-based), AWS Database Migration Service (DMS), Fivetran, Airbyte.

Comparison Tables

Replication Topology Comparison

SINGLE-LEADER:                    MULTI-LEADER:                LEADERLESS:
     Leader                        Leader A ←→ Leader B         Any replica
     /     \                        /    \       /    \          accepts writes
  Follower Follower             Fol   Fol   Fol   Fol         ┌─────────────┐
                                                               │ w replicas  │
Writes: leader only              Writes: any leader           │ confirm write│
Reads: any replica               Reads: any replica           └─────────────┘
Conflicts: impossible            Conflicts: possible          Reads: r replicas
Availability: leader SPOF        Availability: high           Conflicts: possible

Full Replication Comparison

Dimension	Single-Leader	Multi-Leader	Leaderless
Write location	Leader only	Any leader	Any replica
Read location	Any replica	Any replica	Any replica (r replicas)
Write conflicts	Impossible (single ordering)	Possible	Possible
Conflict resolution	N/A	Required	Required
Geographic distribution	One “home” region for writes	Multi-region writes	Multi-region native
Write availability	Reduced if leader down	High (other leaders)	High (w replicas)
Consistency model	Strong (sync) to eventual (async)	Eventual	Tunable (w+r>n)
Failover complexity	High	Medium	None needed
Examples	PostgreSQL, MySQL, MongoDB	CouchDB, multi-DC MySQL	Cassandra, DynamoDB, Riak

Important Points Summary

Replication solves availability, latency, and read scalability — but introduces the hard problem of keeping replicas consistent under concurrent writes and network failures.
Synchronous replication guarantees durability but kills availability: blocked if any sync replica is slow. Semi-synchronous (one sync follower) is the practical compromise.
Leader failover is the hardest part of single-leader replication: lost writes, split brain, and timeout misconfiguration are all common production failure modes.
WAL shipping is exact but storage-engine-coupled: makes zero-downtime upgrades difficult. Logical/row-based replication is more flexible and enables CDC.
Replication lag creates consistency anomalies: read-after-write, monotonic reads, and consistent prefix are three named anomalies with known solutions.
Multi-leader replication enables multi-datacenter writes but creates write conflicts — requiring conflict resolution strategies (LWW, CRDTs, application merge).
CRDTs provide mathematically provable merge semantics but only for specific data types. Arbitrary data requires application-level logic.
Quorum formula w + r > n guarantees overlap: at least one replica in the read set has seen the latest write — but only when nodes and network are healthy. Sloppy quorums weaken this.
Version vectors detect concurrent writes across multiple replicas by tracking each replica’s change count per key.
CDC transforms replication logs into integration streams: enables keeping search indexes, data warehouses, and caches in sync without tight coupling to the source database.

Modern Context (2026)

Cloud-native replication:

Amazon Aurora uses a novel shared-storage replication model: all replicas share the same distributed storage layer, with only redo log records shipped (not full pages). This reduces replication lag to typically under 100ms across 15 replicas.
CockroachDB and YugabyteDB use Raft consensus per shard for synchronous multi-region replication — achieving strong consistency across datacenters at the cost of cross-region write latency.
PlanetScale (Vitess-based) uses MySQL single-leader replication per shard with automated horizontal scaling and zero-downtime schema migrations via shadow writes.

Raft becoming standard:

Raft consensus algorithm (Ongaro & Ousterhout, 2014) has largely replaced Paxos in new implementations due to its understandability and completeness.
Used in: etcd (Kubernetes), CockroachDB, TiKV, YugabyteDB, FoundationDB, Consul.
Raft provides strong consistency with automatic leader election — essentially implementing synchronous multi-copy replication correctly.

CDC and data integration:

Debezium (Red Hat) has become the standard open-source CDC tool, supporting PostgreSQL, MySQL, MongoDB, Cassandra, SQL Server, and more.
CDC-to-Kafka pipelines are now standard practice for building data meshes and keeping microservice data in sync.
Materialize and RisingWave are streaming databases that incrementally maintain SQL views over CDC streams — essentially turning replication into a general query computation engine.

Conflict resolution maturity:

Automerge and Y.js are production CRDT libraries for collaborative document editing, used in products like Linear and Notion.
Electric SQL uses CRDT-based conflict resolution for offline-first PostgreSQL applications, enabling local-first data with eventual sync.
The academic field of convergent replicated data types has matured significantly; most commercial collaborative tools now use CRDTs.

Globally distributed databases:

Google Spanner uses TrueTime (GPS + atomic clocks) to provide linearizability across global datacenters — accepting slightly higher latency in exchange for strong consistency guarantees at planetary scale.
Amazon DynamoDB Global Tables uses multi-leader replication with LWW conflict resolution across regions — prioritizing availability over consistency.
The choice between Spanner-style (strong) and DynamoDB-style (eventual) global replication is now a standard architectural decision for global applications.

Questions for Reflection

A user submits a tweet. It is written to the leader. One second later, they refresh their feed and hit a follower that is 2 seconds behind. They see the old feed without their new tweet. Which consistency anomaly is this? What solution would you apply?
A database uses asynchronous single-leader replication. The leader fails and a failover occurs. The new leader was 5 seconds behind. The client received a “success” acknowledgment for 3 writes that are now lost. How could semi-synchronous replication have prevented this? What are the trade-offs?
You are designing a global e-commerce platform where users in Europe and Asia must both be able to place orders. Orders must never be lost. Would you use single-leader, multi-leader, or leaderless replication for the orders table? Justify your choice.
Explain why LWW (Last Write Wins) is problematic for a shopping cart in a leaderless system. What alternative would you use?
Two developers are collaborating in a Google Docs-like editor. Both make edits simultaneously in different parts of the document. What replication mechanism enables this, and why can’t you simply use LWW or standard database replication?
A company wants to keep their PostgreSQL database, Elasticsearch search index, and Redis cache in sync. Explain how a CDC pipeline using Debezium and Kafka achieves this without tight coupling between services.

ch07-sharding — Sharding interacts with replication: each shard is replicated independently; secondary index sharding adds complexity
ch08-transactions — Distributed transactions across replicated nodes; 2PC and its interaction with replication
ch10-consistency-and-consensus — Linearizability, Raft, and Paxos — the theoretical foundation for strong consistency in replication
ch12-stream-processing — CDC streams as input to stream processing; Kafka as the backbone of CDC pipelines
ch05-replication — 1st edition Ch5 replication coverage for comparison

Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch06-replication