Chapter 6 Flashcards — Replication

flashcards ddia-2e chapter6 replication consistency distributed-systems

Definitions and Mechanisms

What are the three reasons to replicate data, and what fundamental problem does replication introduce?
?
Three reasons to replicate:

Latency: Keep data geographically close to users to reduce read latency.
Availability / fault tolerance: Continue serving requests even if some nodes fail.
Read scalability: Spread read load across multiple replicas.

Fundamental problem introduced: Keeping replicas consistent. Every write must propagate to all replicas, but networks are unreliable and replication is not instantaneous. During the propagation window, different replicas have different data — this is replication lag — and readers may see stale or inconsistent data.
The core tension: waiting for all replicas before acknowledging a write maximizes consistency but reduces availability (blocked if any replica is slow/offline). Acknowledging immediately maximizes availability but risks data loss if the leader fails before replicating.

How does single-leader replication work, and what databases use it?
?
Architecture:

One replica is the leader (primary/master). All write requests must go to the leader.
All other replicas are followers (replicas/read replicas). They subscribe to the leader’s replication log and apply changes in the same order.
Read requests can go to any replica — the leader or any follower.

Replication flow: Leader receives write → applies locally → logs the change → sends log entry to followers → followers apply in order.

Used by: PostgreSQL (streaming replication), MySQL (binary log replication), Oracle Data Guard, MongoDB (replica set), Kafka (partition leader/follower), Elasticsearch.

Key limitation: Write throughput is bounded by the single leader. All writes must serialize through one node.

What is the difference between synchronous and asynchronous replication? What are the trade-offs?
?
Synchronous replication: The leader waits for the follower(s) to confirm the write was applied before returning “success” to the client.

Pro: Strong durability guarantee — at least two up-to-date copies exist.
Con: Write latency increases (must wait for slowest sync replica). Write availability decreases — if any sync replica is slow or offline, the leader is blocked.

Asynchronous replication: The leader returns “success” immediately after writing locally, without waiting for followers.

Pro: Low write latency; writes always succeed even if all followers are offline.
Con: Weak durability — if the leader fails before replicating, those writes are lost even though the client got “success.”

Semi-synchronous (practical middle ground): One follower is synchronous; all others are asynchronous. Guarantees at least two copies (leader + one sync follower) with limited latency impact.

PostgreSQL implementation: synchronous_standby_names = 'follower1'

What are the steps to add a new follower to a single-leader system without downtime?
?

Take a consistent snapshot of the leader’s data at a specific point in time (using the database’s snapshot mechanism, e.g., pg_basebackup for PostgreSQL, mysqldump --single-transaction for MySQL).
Copy the snapshot to the new follower node.
The follower connects to the leader and requests all changes since the snapshot’s replication log position (PostgreSQL: WAL LSN, MySQL: binlog filename + offset embedded in snapshot).
The follower processes the backlog of changes to catch up to the present.
Once caught up and applying changes in real-time, the follower is operational.

Key insight: The snapshot has an associated log position — this is what makes online add-follower possible without locking the database.

What are the failure modes during leader failover, and how does split-brain occur?
?
Three major failover failure modes:

Lost writes: The new leader may not have received all writes from the old leader. After failover, those un-replicated writes are typically discarded — the client received “success” but the data is gone. This violates durability guarantees.
Split brain: Both the old and new leader believe they are the current leader and accept writes simultaneously. Since there’s no coordination between them, conflicting writes are applied — leading to data corruption or inconsistency. Mitigation: STONITH (Shoot The Other Node In The Head) — a fencing mechanism that forces the old leader offline (e.g., via IPMI power off, cloud instance shutdown, revoke storage access) before the new leader accepts writes.
Timeout misconfiguration: Too short → false positives, unnecessary failovers under load. Too long → extended outage window. There is no universally correct timeout.

Conclusion: Failover is sufficiently complex that many teams prefer manual failover to automated failover for critical databases.

Replication Log Methods

What are the four replication log implementation methods, and which is best for CDC?
?

Statement-based replication: Leader logs SQL statements (INSERT/UPDATE/DELETE) and followers re-execute them.
- Problem: non-deterministic functions (NOW(), RAND(), UUID()) produce different results on followers → replica divergence.
WAL (Write-Ahead Log) shipping: Leader ships raw WAL bytes; followers apply byte-level changes.
- Pro: Exact replication, no determinism issues.
- Con: Tightly coupled to storage engine version — replicas must run identical software versions; zero-downtime upgrades are hard.
Logical (row-based) replication: A separate logical log captures row-level changes (which rows changed and their values), decoupled from physical storage.
- Pro: Followers can run different software versions; human-readable enough for debugging; best for CDC.
- Used by: MySQL binlog (row format), PostgreSQL logical replication.
Trigger-based replication: DB triggers fire on changes and write to a replication table; a custom process replicates.
- Pro: Very flexible (replicate subset, transform, different DB type).
- Con: High overhead, more complexity.

Best for CDC: Logical/row-based — tools like Debezium read this format and publish to Kafka.

Replication Lag and Consistency

What is the “read-after-write” consistency problem and how do you solve it?
?
Problem: A user submits a write (e.g., posts a comment) to the leader. They immediately refresh the page, but their read hits a lagging follower that hasn’t received the write yet. The user sees their old data and concludes their action failed — even though the write succeeded.

Solutions:

Always read the user’s own data from the leader: The user’s profile page is always served from leader; other users’ profiles can use followers.
Timestamp-based routing: After any write, for the next N seconds, route this user’s reads to the leader.
Replication log position tracking: Client tracks the log position (LSN) of its last write. Followers only serve this client once they’ve applied up to that position.
Cross-device: If the user may write on one device and read on another, the log position must be stored server-side (per user), not just in the client.

What is the “monotonic reads” problem and how do you solve it?
?
Problem: A user makes two reads in sequence. Read 1 goes to Follower A (slightly lagged). Read 2 goes to Follower B (more lagged than A). The user sees a state from Read 2 that appears to be earlier than what they saw in Read 1 — time appears to go backwards. Example: a social feed where a comment disappears between refreshes because the second read hit a more-lagged follower.

Solution: Sticky session routing — route each user’s reads to the same replica within a session. The user always sees the same (or later) version of the data. If that replica fails, re-route and accept a brief potential backwards jump.

Not the same as read-after-write: Monotonic reads is about ordering across multiple reads; read-after-write is specifically about a user seeing their own writes.

What is the “consistent prefix reads” problem and when does it occur?
?
Problem: In a partitioned (sharded) database, writes to different partitions are replicated independently. A reader may observe a more recent write on one partition while seeing a stale state on another — making effects appear before their causes.

Classic example:

Alice writes “What time is it?” to Partition A (T=1).
Bob writes “It is 3pm” to Partition B (T=2, in response to Alice).
A reader sees Bob’s answer from fast Partition B before Alice’s question from slow Partition A → reads “3pm” with no question visible.

Solution: Ensure causally related writes go to the same partition. Use logical timestamps (Lamport clocks, vector clocks) to track causality and ensure readers see writes in causal order before applying them.

What are the five consistency levels from weakest to strongest?
?

Level	Guarantee	How to Achieve
Eventual consistency	Replicas converge eventually if writes stop; no ordering guarantees	Default in async replication
Consistent prefix reads	Causality preserved: if A happened before B, readers always see A before B	Logical clocks; causal writes to same partition
Monotonic reads	Within a session, user never sees time go backwards	Sticky replica routing per user
Read-after-write	User always sees their own most recent writes	Leader reads for own data; write LSN tracking
Linearizability (strong)	Single-copy illusion: reads always return the most recently written value, globally	Synchronous quorum; Raft/Paxos consensus

Key insight: These are not binary (consistent vs not) — they form a spectrum. Most systems provide some guarantees weaker than linearizability but stronger than pure eventual consistency.

Multi-Leader Replication

What are the three main use cases for multi-leader replication?
?

Multi-datacenter operation: Each datacenter has its own leader. Users write to their local datacenter’s leader (low latency). Changes are replicated asynchronously to other datacenters’ leaders. Tolerates entire datacenter outages.
- Example: A global e-commerce platform with leaders in US-East and EU-West.
Offline clients: Mobile/desktop applications where the device acts as a “leader” for locally buffered writes. Sync happens when connectivity is restored.
- Example: Google Calendar — changes made offline are synced when back online. If two devices modified the same event offline, that’s a multi-leader conflict.
Collaborative editing: Multiple users edit the same document simultaneously. Each participant’s local copy acts as a leader; changes are merged in real-time.
- Example: Google Docs — uses Operational Transform (OT) or CRDTs for conflict resolution.

Trade-off: All these use cases gain write availability and low latency at the cost of write conflict complexity.

What are the main conflict resolution strategies for multi-leader replication, with their pros and cons?
?

Strategy	Mechanism	Pros	Cons
Conflict avoidance	Route all writes for a key to one leader	No conflicts; simple code	Fails during leader change; limits geographic distribution
LWW (Last Write Wins)	Each write timestamped; highest wins	Simple; eventually converges	Data loss; clock skew means wrong winner; not safe for important data
CRDT	Mathematically defined merge (commutative, associative, idempotent)	Automatic; provably correct; no data loss	Only works for specific types (counters, sets, text with OT)
Merge values	Keep all concurrent values; application merges on read	No automatic data loss	Application must implement merge logic
Custom conflict handler	Application callback on detected conflict	Full semantic control	Developer burden; must handle all edge cases

LWW problems specifically: Clock skew (NTP is not precise enough to resolve millisecond-level concurrent writes); a write with an “earlier” clock timestamp silently discards a “later” write. Cassandra uses LWW by default — acceptable for time-series but dangerous for account balances or inventory.

What are CRDTs, and give three concrete examples?
?
CRDT (Conflict-Free Replicated Data Type): A data structure with a merge operation that is:

Commutative: A merge B = B merge A (order of merging doesn’t matter)
Associative: (A merge B) merge C = A merge (B merge C) (grouping doesn’t matter)
Idempotent: A merge A = A (merging with yourself changes nothing)

These properties guarantee that any two replicas that have seen the same set of updates will converge to the same value, regardless of the order in which updates were applied.

Concrete examples:

G-Counter (grow-only counter): Each replica maintains its own counter; the merged value is the max of each replica’s counter. Used for: page views, like counts.
OR-Set (observed-remove set): Supports add and remove. Each element tagged with a unique ID; removals remove specific tagged instances, not all copies. Used for: shopping carts, collaborative to-do lists.
RGA (Replicated Growable Array): CRDT for sequences/text; each character has a unique position. Concurrent insertions are resolved deterministically. Used for: collaborative document editors (Automerge, Y.js).

Leaderless Replication

How does leaderless replication with quorums work? Explain with n=5, w=3, r=3.
?
Architecture: No designated leader. Clients write to multiple replicas in parallel; reads also query multiple replicas.

Consistency guarantee: w + r > n

With n=5, w=3, r=3:

A write must reach at least 3 of 5 replicas to be acknowledged.
A read must query at least 3 of 5 replicas.
w + r = 6 > n = 5 → the write set and read set must overlap by at least 1 replica.
The overlapping replica has seen the write → returns the latest version.

Why it works: If write reached replicas {A, B, C} and read queries {B, C, D}, then B and C are in both sets — at least one of them returns the latest value. The client takes the most recent version (by version number or timestamp).

Fault tolerance: Tolerates floor((n-1)/2) simultaneous failures. With n=5, tolerates 2 failures.

When guarantee breaks: Network partition that routes write to {A, B, C} and read to {C, D, E} — only C overlaps. If C is slow, the read may miss the latest write (returned by fast D and E with stale values).

What are sloppy quorums and hinted handoff?
?
Sloppy quorum: During a network partition or node failures, the “home” nodes for a key may be unreachable. A sloppy quorum allows the write to land on w any available nodes — not necessarily the nodes that normally own that key.

Hinted handoff: The temporary node that accepted the write on behalf of the home node stores the data with a “hint” indicating which node it belongs to. When the home node recovers and becomes reachable, the temporary node forwards (“hands off”) the data.

Trade-off:

Pro: Higher write availability — writes succeed even if home nodes are down.
Con: Weaker durability — if the temporary node fails before handing off, the write is lost. The usual quorum guarantee (w + r > n) no longer holds during the partition.

Used by: Cassandra (optional per table), DynamoDB (always uses sloppy quorums with hinted handoff).

How do version vectors detect concurrent writes in leaderless systems?
?
Problem: Multiple clients write to the same key on different replicas simultaneously. The system must distinguish: did write B happen after write A (supersedes it), or were they concurrent (both must be kept)?

Single-replica version numbers: Each write increments a version number. A client reads value + version number. When writing, it sends the version number it read. Server can tell: new write based on current version (causal follow) vs old version (concurrent).

Version vectors (multi-replica): A version vector is a set of version numbers, one per replica: {A:3, B:5, C:2}. Each replica increments its own counter per write.

If V1 dominates V2 (every V1[i] ≥ V2[i]): V1 happened after V2.
If neither dominates (some V1[i] > V2[i], some V2[i] > V1[i]): they are concurrent → conflict.

Concurrent write handling: For concurrent writes, either: (a) keep both values and return to the application to merge (Amazon Dynamo’s approach), (b) apply a CRDT merge operation, or (c) apply LWW (lossy).

Change Data Capture

What is Change Data Capture (CDC) and what are its main use cases?
?
CDC is the process of capturing every change made to a database (inserts, updates, deletes) as a stream of events, making those changes available to downstream consumers.

How it works: CDC tools (Debezium) connect to the database’s replication stream (PostgreSQL logical replication slot, MySQL binlog) and transform low-level replication entries into structured change events, typically published to Kafka.

Main use cases:

Search index sync: Keep Elasticsearch in sync with PostgreSQL without polling.
Cache invalidation: Update Redis cache entries exactly when the source record changes.
Data warehouse ETL: Stream OLTP changes to OLAP systems in near real-time.
Microservice data sync: Service B maintains a local copy of Service A’s data via CDC, without tight coupling or distributed transactions.
Audit logging: Every change captured with full before/after state.
Event sourcing layer: Add event sourcing semantics to a relational database without changing the application.

Key property: At-least-once delivery; CDC tools typically preserve transaction boundaries and schema change events.

Comparison Questions

Single-leader vs multi-leader vs leaderless — when would you choose each?
?

Scenario	Best Choice	Why
Standard web application, single region	Single-leader	Simple, strong consistency, no conflicts
Multi-datacenter with low write latency globally	Multi-leader	Each DC accepts writes locally; cross-DC async
High write availability during partitions	Leaderless	Sloppy quorums; no single point of write failure
Offline-capable mobile app	Multi-leader	Device is its own leader; sync on reconnect
Collaborative document editing	Multi-leader + CRDT	Concurrent writes on same document; CRDT merges
High read throughput, tolerate stale reads	Single-leader (async followers)	Many read replicas; low write overhead
Strong consistency required (banking)	Single-leader (sync) or Raft	Single serialization point or consensus

Key insight: Single-leader is the default choice — use multi-leader or leaderless only when specific requirements (multi-region writes, offline-first, extreme availability) justify the added conflict complexity.

Modern Context (2026)

How does Amazon Aurora’s replication differ from traditional single-leader replication?
?
Traditional single-leader: Leader writes data to local disk AND sends data pages or WAL to followers. Followers apply the changes to their own local disk copies. Each replica has its own complete copy of the data.

Aurora’s shared-storage model:

All replicas (up to 1 writer + 15 read replicas) share the same distributed storage layer.
The storage layer is itself replicated across 6 copies in 3 Availability Zones.
The leader only ships redo log records (not full pages) to the storage layer and to read replicas.
Read replicas apply redo log records to reconstruct the page from shared storage — much less data transmitted.

Result:

Replication lag typically under 100ms across all 15 replicas (vs seconds for traditional MySQL async replication).
Failover is fast (~30 seconds vs minutes) because replicas already have the data in shared storage.
No storage duplication overhead per replica.

Why has Raft replaced Paxos as the dominant consensus algorithm in new systems?
?
Paxos (Lamport, 1989) was the theoretical foundation for consensus in distributed systems, but it’s notoriously difficult to understand and implement correctly.

Raft (Ongaro & Ousterhout, 2014) was designed explicitly for understandability. It decomposes consensus into three relatively independent subproblems:

Leader election: Candidates start election if no heartbeat received.
Log replication: Leader appends entries to its log; replicates to followers; commits once majority confirms.
Safety: Strict rules about which candidates can become leader (must have most up-to-date log).

Why Raft won in practice:

Simpler to implement correctly (proven by the number of correct implementations).
More complete specification — Paxos left many practical details underspecified.
Better tooling and documentation.

Adopted by: etcd (Kubernetes), CockroachDB, TiKV, YugabyteDB, FoundationDB, Consul, Kafka (KRaft mode replacing ZooKeeper).

Numbers and Precision

What is the typical replication lag in healthy systems, and when does it become a problem?
?
Typical replication lag values:

Same datacenter: <1 millisecond to ~10ms under normal load
Cross-datacenter, same region: 10-50ms (limited by network distance)
Cross-continental: 50-150ms (limited by speed of light + network hops)

When lag becomes a problem:

Exceeds seconds: Under load, follower falls behind; read-after-write anomalies become visible to users.
Exceeds minutes: Follower is significantly behind; if leader fails, significant data loss. May indicate follower is overloaded or network is degraded.
Exceeds hours: Follower is probably misconfigured or in a loop; requires investigation.

Monitoring: Track seconds_behind_master (MySQL) or pg_stat_replication.write_lag (PostgreSQL). Alert if lag exceeds 30 seconds in production.

Recovery: If a follower falls far behind, it may need to be rebuilt from a fresh snapshot rather than catching up via log — the log retention window may have expired.

What common quorum configurations are used in production, and what do they tolerate?
?

Configuration	Failures Tolerated	Write Availability	Read Availability	Common Use
n=3, w=2, r=2	1 node	Medium	Medium	Standard Cassandra LOCAL_QUORUM
n=5, w=3, r=3	2 nodes	Medium	Medium	Critical data, balanced
n=3, w=3, r=1	0 nodes (all needed)	Low	High	High read throughput, max durability
n=3, w=1, r=3	0 nodes for reads	High	Low	High write throughput
n=5, w=1, r=1	— (no consistency)	Maximum	Maximum	Eventual consistency only

Rule of thumb: For production workloads requiring consistency, use w = quorum, r = quorum where quorum = floor(n/2) + 1. For n=3: quorum=2; for n=5: quorum=3.

Total Cards: 26
Review Time: ~25-30 minutes
Priority: HIGH
Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch06-flashcards