Chapter 9 Flashcards - Consistency and Consensus

flashcards chapter-9 ddia

Basic Concepts

What is linearizability and why is it the strongest consistency model?
?

Linearizability: Operations on a distributed system appear to execute atomically, at a single point in time, as if there’s only one copy of the data
Recency guarantee: After a write completes, every subsequent read by any client anywhere must return the written value (or a later write)
Total order: All operations have a global order consistent with real time
Strongest because: It makes a distributed system indistinguishable from a single-node system for any observer
Cost: Requires coordination (consensus) on every operation → high latency
Where needed: Distributed locks, leader election, unique constraint enforcement
Where NOT needed: Social media feeds, caches, approximate analytics — eventual/causal consistency suffices

What is the difference between linearizability and serializability?
?

Linearizability: Recency guarantee on individual objects
- After write completes, all reads see the new value
- Requires real-time ordering
- Single-object property
Serializability: Isolation guarantee for transactions (multi-operation)
- Transactions appear to have executed in some serial order
- Any serial order is acceptable — even “in the past” (no recency requirement)
- Multi-object property (transactions touch multiple rows)
Example difference: Serializability allows a transaction that reads before a write to appear to have run in a valid serial order, even if the real-time sequence differs. Linearizability doesn’t allow this — if write completed before your read, you must see it.
Strict Serializability = both properties = linearizable + serializable = strongest possible

CAP and PACELC

What does the CAP theorem say and what does it mean in practice?
?

CAP (Brewer 2000): In the presence of a network Partition, a distributed system cannot guarantee both Consistency (linearizability) and Availability (every request gets a response)
Practical meaning: Network partitions happen — you can’t opt out. So the real choice is:
- CP: During partition, refuse requests (no stale reads), wait for partition to heal
  - Examples: ZooKeeper, etcd, HBase, Spanner
- AP: During partition, serve stale data or accept writes (may conflict)
  - Examples: Cassandra, DynamoDB (multi-region), CouchDB
Common misunderstandings:
1. “C” in CAP = linearizability, NOT the C in ACID
2. “Availability” in CAP = every non-failing node responds, not just “system is up”
3. You can have both C and A during normal operation (no partition)

What is the PACELC theorem and why is it more useful than CAP?
?

PACELC (Abadi 2012): Extends CAP with a latency dimension
- During Partition: choose Availability vs Consistency (same as CAP)
- Else (no partition): choose Latency vs Consistency
Why more useful: Even when there’s no network partition, strong consistency requires coordination → higher latency. This trade-off exists always, not just during failures.
System labels (PA/EL, PC/EC, etc.):
- PA/EL: Cassandra — available during partition; low latency normally
- PC/EC: Spanner/etcd — consistent during partition; consistent (higher latency) normally
- PA/EC: Amazon DynamoDB — available during partition; consistent and higher latency normally
Interview tip: PACELC is more nuanced than CAP; shows you understand the ongoing latency cost of consistency, not just the partition behavior

Causal Consistency

What is causal consistency and how does it differ from linearizability and eventual consistency?
?

Causal consistency: If event A causally precedes event B (A happened-before B), then all nodes see A before B. Causally concurrent events may appear in any order.
Stronger than eventual consistency: Eventual has no ordering guarantees at all during lag; causal preserves happened-before relationships
Weaker than linearizability: Linearizability requires total real-time order; causal only requires causal order (concurrent events can differ)
Example:
- User posts question; then posts answer — anyone who sees the answer must also see the question (causal order)
- Two users post different comments simultaneously — concurrent, can be seen in any order
Key advantage: Can be achieved without consensus/coordination; doesn’t sacrifice availability during network partitions
Used in: CouchDB (MVCC with vector clocks), collaborative editing tools

Consensus

What is consensus and what properties must a consensus algorithm satisfy?
?
Consensus: An algorithm that allows multiple nodes to agree on a single value (or a sequence of values)

Three required properties:

Uniform agreement: No two nodes decide on different values
Integrity: No node decides twice (a decision is final)
Validity: If a node decides value v, then v was proposed by some node (can’t invent values)
Termination: Every non-faulty node eventually decides (progress property)

Why it’s hard: FLP impossibility says termination is impossible in a fully asynchronous system if any node can fail. Real algorithms work in practice by making timing assumptions (timeouts).

What consensus enables: Leader election, atomic broadcast (total order broadcast), atomic commit across nodes

How does the Raft consensus algorithm work?
?
Three roles: Follower, Candidate, Leader

Normal operation:

Leader receives all writes; appends to its log
Leader sends AppendEntries RPC to all followers
Followers acknowledge receipt
Once majority acknowledge: entry is committed
Leader tells followers it’s committed (next heartbeat)

Leader election:

Follower times out (no heartbeat from leader) → becomes Candidate
Candidate increments term, votes for itself, sends RequestVote to all nodes
Each node votes for at most one candidate per term (first valid request wins)
Candidate with majority votes → becomes leader for that term
New leader has most up-to-date log (guaranteed: only nodes with complete log can win)

Safety: Committed entries never overwritten (majority + term guarantees); at-most-once decision per term

What is total order broadcast and how does it relate to consensus?
?

Total order broadcast: Every node receives every message in the same order; no message is ever skipped
Properties:
1. Reliable delivery: No message lost (eventually delivered to all non-faulty nodes)
2. Total ordering: All nodes deliver messages in the same order
Equivalence to consensus:
- TOB from Consensus: Use one consensus round per message; position in order = consensus decision
- Consensus from TOB: Broadcast your proposal; first proposal in total order = consensus decision
- They are computationally equivalent problems
Why matters: Consensus is the fundamental primitive; TOB is how you use it:
- ZooKeeper (Zab protocol) implements TOB → applications use ZooKeeper for coordination
- Raft log replication IS total order broadcast (all nodes agree on log order)

Distributed Coordination

What does ZooKeeper provide and what problems does it solve?
?

ZooKeeper: Distributed coordination service using Zab (similar to Raft) for consensus
Data model: Hierarchical namespace (znodes), similar to a filesystem

Key primitive types:

Ephemeral znodes: Deleted when client session ends → used for locks, presence detection
Sequential znodes: Auto-numbered suffix → used for fencing tokens (monotonically increasing), queue ordering
Watches: Callbacks when node changes → used for detecting leader changes, config updates

What applications use ZooKeeper for:

Leader election: Create ephemeral sequential node; lowest number = leader
Distributed lock: Create ephemeral node; success = lock acquired; session end = lock released
Configuration management: Store config in znodes; services watch for changes
Service discovery: Services register ephemeral znodes; clients read to find services

2026: etcd (Raft-based) is often preferred over ZooKeeper (simpler, better-maintained)

What is the relationship between 2PC (two-phase commit) and consensus?
?

2PC IS a consensus algorithm (specifically, atomic commit = consensus on commit/abort)
Unanimous agreement: 2PC requires ALL participants to agree to commit (not just majority)
Why unanimous, not majority: A participant that voted yes must be prepared to commit; if we could commit without their agreement, they’d be in an inconsistent state
How it maps to consensus:
- Coordinator = leader of consensus
- Participants vote yes/no = propose commit/abort
- Coordinator’s phase 2 decision = consensus decision
Why 2PC is “blocking”: It’s a SINGLE-ROUND consensus — coordinator decides and tells everyone. If coordinator fails between prepare and commit, participants hold locks waiting for the decision → in-doubt transactions that block indefinitely
Raft vs 2PC: Raft can elect a new leader if current fails; 2PC coordinator has no built-in leader election → 2PC is not fault-tolerant alone

Modern Context (2026)

Why has Raft largely replaced Paxos in new distributed systems?
?

Paxos (Lamport, 1998): Proven correct, theoretically elegant, but famously complex
- Multi-Paxos (needed for practical log replication): Even harder to implement
- “There are two types of people: those who know nothing about Paxos, and those who are implementing a broken version”
Raft (Ongaro & Ousterhout, 2014): Explicitly designed for understandability
- Same performance/safety guarantees as Multi-Paxos
- Clear separation of concerns: leader election, log replication, safety
- Published with formal correctness proof + TLA+ specification
Why Raft won:
- Easier to implement correctly (fewer subtle cases)
- Easier to teach and reason about
- Widely available open-source implementations: etcd/Raft, tikv/Raft, hashicorp/raft
2026 status: Raft in etcd, CockroachDB, TiDB, YugabyteDB, Kafka KRaft, Consul — essentially ubiquitous in new distributed systems

What is Kafka KRaft and why did Kafka replace ZooKeeper?
?

Problem with ZooKeeper:
- External dependency: operators had to manage two separate systems (Kafka + ZooKeeper)
- ZooKeeper metadata scalability: limited to a few hundred thousand partitions
- Slow controller failover: new controller had to read all metadata from ZooKeeper (~minutes for large clusters)
KRaft (Kafka Raft Metadata mode):
- Kafka manages its own metadata using a built-in Raft consensus protocol
- Metadata stored in a special Kafka topic (__cluster_metadata)
- Controller quorum (3-5 nodes) uses Raft to agree on partition leadership
Advantages:
- Single system to operate
- Faster failover (metadata is local, no ZooKeeper read)
- Can scale to millions of partitions
- Simpler deployment (especially in Kubernetes)
Status: KRaft production-ready in Kafka 3.3+ (2022); ZooKeeper mode deprecated in 4.0 (2024)

Interview Scenarios

When would you use linearizability vs eventual consistency in a distributed system?
?
Use linearizability when:

Distributed locks/leader election: Two processes can’t both think they hold the lock
Unique constraint enforcement: “Only one account per email address” must be globally enforced
Cross-channel consistency: User uploads file, shares link, friend must be able to download it immediately
Financial transactions: Balance can’t temporarily show negative due to stale read

Use eventual consistency when:

Social media feeds: Seeing posts from 2 seconds ago is fine
Approximate analytics: Click counts, view counts — eventual is fine
Caches and CDN: Stale content is tolerable
Shopping cart: Eventual consistency + CRDT merge is fine (worst case: duplicate item)
Any scenario where temporary stale reads are acceptable

Rule of thumb: Ask “what’s the impact if a user sees stale data for 1-5 seconds?” If it’s a business/safety problem → linearizability. If users would shrug → eventual consistency is fine and cheaper.

Design a distributed leader election mechanism. What properties must it satisfy?
?
Requirements:

Safety: At most one leader at any time
Liveness: A leader is eventually elected (even after failures)
Robustness: Works with f node failures in a cluster of 2f+1 nodes

Implementation using etcd/ZooKeeper:

Each candidate tries to create an ephemeral key in etcd with a TTL (e.g., 10 seconds)
Only one can succeed (compare-and-swap semantics)
Winner = leader; renews TTL periodically (keepalive)
Other candidates watch the key; when it disappears (TTL expiry or crash) → retry

Fencing tokens (prevent stale leader actions):

etcd assigns a monotonically increasing revision number
Leader includes revision in all writes to protected resources
Resources reject writes with old revision numbers

Split-brain prevention:

Lease-based: leader can only act while its TTL is valid
If leader pauses (GC), lease expires → another node elected → new revision → old leader’s writes rejected

Real implementation: etcd’s election primitives handle this — use clientv3/concurrency.NewElection() in Go

Quick Facts

What quorum size does Raft require and why?
?

Quorum = majority = ⌊n/2⌋ + 1
- 3-node cluster: quorum = 2 (can tolerate 1 failure)
- 5-node cluster: quorum = 3 (can tolerate 2 failures)
- 7-node cluster: quorum = 4 (can tolerate 3 failures)
Why majority:
- Any two majorities overlap by at least 1 node
- Ensures the elected leader has seen all committed entries (it got a majority of votes)
- Ensures a commit is acknowledged by majority → survives f failures in (2f+1) cluster
Odd numbers preferred: With even nodes, risk of split vote → odd numbers guarantee majority is unambiguous
Minimum for fault tolerance: 3 nodes (can tolerate 1 failure); 2-node clusters are not fault-tolerant (majority = 2 = both nodes required)

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch09-flashcards