Chapter 9 Flashcards - Consistency and Consensus
Basic Concepts
What is linearizability and why is it the strongest consistency model?
?
-
Linearizability: Operations on a distributed system appear to execute atomically, at a single point in time, as if there’s only one copy of the data
-
Recency guarantee: After a write completes, every subsequent read by any client anywhere must return the written value (or a later write)
-
Total order: All operations have a global order consistent with real time
-
Strongest because: It makes a distributed system indistinguishable from a single-node system for any observer
-
Cost: Requires coordination (consensus) on every operation → high latency
-
Where needed: Distributed locks, leader election, unique constraint enforcement
-
Where NOT needed: Social media feeds, caches, approximate analytics — eventual/causal consistency suffices
What is the difference between linearizability and serializability?
?
-
Linearizability: Recency guarantee on individual objects
- After write completes, all reads see the new value
- Requires real-time ordering
- Single-object property
-
Serializability: Isolation guarantee for transactions (multi-operation)
- Transactions appear to have executed in some serial order
- Any serial order is acceptable — even “in the past” (no recency requirement)
- Multi-object property (transactions touch multiple rows)
-
Example difference: Serializability allows a transaction that reads before a write to appear to have run in a valid serial order, even if the real-time sequence differs. Linearizability doesn’t allow this — if write completed before your read, you must see it.
-
Strict Serializability = both properties = linearizable + serializable = strongest possible
CAP and PACELC
What does the CAP theorem say and what does it mean in practice?
?
-
CAP (Brewer 2000): In the presence of a network Partition, a distributed system cannot guarantee both Consistency (linearizability) and Availability (every request gets a response)
-
Practical meaning: Network partitions happen — you can’t opt out. So the real choice is:
- CP: During partition, refuse requests (no stale reads), wait for partition to heal
- Examples: ZooKeeper, etcd, HBase, Spanner
- AP: During partition, serve stale data or accept writes (may conflict)
- Examples: Cassandra, DynamoDB (multi-region), CouchDB
- CP: During partition, refuse requests (no stale reads), wait for partition to heal
-
Common misunderstandings:
- “C” in CAP = linearizability, NOT the C in ACID
- “Availability” in CAP = every non-failing node responds, not just “system is up”
- You can have both C and A during normal operation (no partition)
What is the PACELC theorem and why is it more useful than CAP?
?
-
PACELC (Abadi 2012): Extends CAP with a latency dimension
- During Partition: choose Availability vs Consistency (same as CAP)
- Else (no partition): choose Latency vs Consistency
-
Why more useful: Even when there’s no network partition, strong consistency requires coordination → higher latency. This trade-off exists always, not just during failures.
-
System labels (PA/EL, PC/EC, etc.):
- PA/EL: Cassandra — available during partition; low latency normally
- PC/EC: Spanner/etcd — consistent during partition; consistent (higher latency) normally
- PA/EC: Amazon DynamoDB — available during partition; consistent and higher latency normally
-
Interview tip: PACELC is more nuanced than CAP; shows you understand the ongoing latency cost of consistency, not just the partition behavior
Causal Consistency
What is causal consistency and how does it differ from linearizability and eventual consistency?
?
-
Causal consistency: If event A causally precedes event B (A happened-before B), then all nodes see A before B. Causally concurrent events may appear in any order.
-
Stronger than eventual consistency: Eventual has no ordering guarantees at all during lag; causal preserves happened-before relationships
-
Weaker than linearizability: Linearizability requires total real-time order; causal only requires causal order (concurrent events can differ)
-
Example:
- User posts question; then posts answer — anyone who sees the answer must also see the question (causal order)
- Two users post different comments simultaneously — concurrent, can be seen in any order
-
Key advantage: Can be achieved without consensus/coordination; doesn’t sacrifice availability during network partitions
-
Used in: CouchDB (MVCC with vector clocks), collaborative editing tools
Consensus
What is consensus and what properties must a consensus algorithm satisfy?
?
Consensus: An algorithm that allows multiple nodes to agree on a single value (or a sequence of values)
Three required properties:
- Uniform agreement: No two nodes decide on different values
- Integrity: No node decides twice (a decision is final)
- Validity: If a node decides value v, then v was proposed by some node (can’t invent values)
- Termination: Every non-faulty node eventually decides (progress property)
Why it’s hard: FLP impossibility says termination is impossible in a fully asynchronous system if any node can fail. Real algorithms work in practice by making timing assumptions (timeouts).
What consensus enables: Leader election, atomic broadcast (total order broadcast), atomic commit across nodes
How does the Raft consensus algorithm work?
?
Three roles: Follower, Candidate, Leader
Normal operation:
- Leader receives all writes; appends to its log
- Leader sends
AppendEntriesRPC to all followers - Followers acknowledge receipt
- Once majority acknowledge: entry is committed
- Leader tells followers it’s committed (next heartbeat)
Leader election:
- Follower times out (no heartbeat from leader) → becomes Candidate
- Candidate increments term, votes for itself, sends
RequestVoteto all nodes - Each node votes for at most one candidate per term (first valid request wins)
- Candidate with majority votes → becomes leader for that term
- New leader has most up-to-date log (guaranteed: only nodes with complete log can win)
Safety: Committed entries never overwritten (majority + term guarantees); at-most-once decision per term
What is total order broadcast and how does it relate to consensus?
?
-
Total order broadcast: Every node receives every message in the same order; no message is ever skipped
-
Properties:
- Reliable delivery: No message lost (eventually delivered to all non-faulty nodes)
- Total ordering: All nodes deliver messages in the same order
-
Equivalence to consensus:
- TOB from Consensus: Use one consensus round per message; position in order = consensus decision
- Consensus from TOB: Broadcast your proposal; first proposal in total order = consensus decision
- They are computationally equivalent problems
-
Why matters: Consensus is the fundamental primitive; TOB is how you use it:
- ZooKeeper (Zab protocol) implements TOB → applications use ZooKeeper for coordination
- Raft log replication IS total order broadcast (all nodes agree on log order)
Distributed Coordination
What does ZooKeeper provide and what problems does it solve?
?
- ZooKeeper: Distributed coordination service using Zab (similar to Raft) for consensus
- Data model: Hierarchical namespace (znodes), similar to a filesystem
Key primitive types:
- Ephemeral znodes: Deleted when client session ends → used for locks, presence detection
- Sequential znodes: Auto-numbered suffix → used for fencing tokens (monotonically increasing), queue ordering
- Watches: Callbacks when node changes → used for detecting leader changes, config updates
What applications use ZooKeeper for:
- Leader election: Create ephemeral sequential node; lowest number = leader
- Distributed lock: Create ephemeral node; success = lock acquired; session end = lock released
- Configuration management: Store config in znodes; services watch for changes
- Service discovery: Services register ephemeral znodes; clients read to find services
2026: etcd (Raft-based) is often preferred over ZooKeeper (simpler, better-maintained)
What is the relationship between 2PC (two-phase commit) and consensus?
?
-
2PC IS a consensus algorithm (specifically, atomic commit = consensus on commit/abort)
-
Unanimous agreement: 2PC requires ALL participants to agree to commit (not just majority)
-
Why unanimous, not majority: A participant that voted yes must be prepared to commit; if we could commit without their agreement, they’d be in an inconsistent state
-
How it maps to consensus:
- Coordinator = leader of consensus
- Participants vote yes/no = propose commit/abort
- Coordinator’s phase 2 decision = consensus decision
-
Why 2PC is “blocking”: It’s a SINGLE-ROUND consensus — coordinator decides and tells everyone. If coordinator fails between prepare and commit, participants hold locks waiting for the decision → in-doubt transactions that block indefinitely
-
Raft vs 2PC: Raft can elect a new leader if current fails; 2PC coordinator has no built-in leader election → 2PC is not fault-tolerant alone
Modern Context (2026)
Why has Raft largely replaced Paxos in new distributed systems?
?
-
Paxos (Lamport, 1998): Proven correct, theoretically elegant, but famously complex
- Multi-Paxos (needed for practical log replication): Even harder to implement
- “There are two types of people: those who know nothing about Paxos, and those who are implementing a broken version”
-
Raft (Ongaro & Ousterhout, 2014): Explicitly designed for understandability
- Same performance/safety guarantees as Multi-Paxos
- Clear separation of concerns: leader election, log replication, safety
- Published with formal correctness proof + TLA+ specification
-
Why Raft won:
- Easier to implement correctly (fewer subtle cases)
- Easier to teach and reason about
- Widely available open-source implementations: etcd/Raft, tikv/Raft, hashicorp/raft
-
2026 status: Raft in etcd, CockroachDB, TiDB, YugabyteDB, Kafka KRaft, Consul — essentially ubiquitous in new distributed systems
What is Kafka KRaft and why did Kafka replace ZooKeeper?
?
-
Problem with ZooKeeper:
- External dependency: operators had to manage two separate systems (Kafka + ZooKeeper)
- ZooKeeper metadata scalability: limited to a few hundred thousand partitions
- Slow controller failover: new controller had to read all metadata from ZooKeeper (~minutes for large clusters)
-
KRaft (Kafka Raft Metadata mode):
- Kafka manages its own metadata using a built-in Raft consensus protocol
- Metadata stored in a special Kafka topic (
__cluster_metadata) - Controller quorum (3-5 nodes) uses Raft to agree on partition leadership
-
Advantages:
- Single system to operate
- Faster failover (metadata is local, no ZooKeeper read)
- Can scale to millions of partitions
- Simpler deployment (especially in Kubernetes)
-
Status: KRaft production-ready in Kafka 3.3+ (2022); ZooKeeper mode deprecated in 4.0 (2024)
Interview Scenarios
When would you use linearizability vs eventual consistency in a distributed system?
?
Use linearizability when:
- Distributed locks/leader election: Two processes can’t both think they hold the lock
- Unique constraint enforcement: “Only one account per email address” must be globally enforced
- Cross-channel consistency: User uploads file, shares link, friend must be able to download it immediately
- Financial transactions: Balance can’t temporarily show negative due to stale read
Use eventual consistency when:
- Social media feeds: Seeing posts from 2 seconds ago is fine
- Approximate analytics: Click counts, view counts — eventual is fine
- Caches and CDN: Stale content is tolerable
- Shopping cart: Eventual consistency + CRDT merge is fine (worst case: duplicate item)
- Any scenario where temporary stale reads are acceptable
Rule of thumb: Ask “what’s the impact if a user sees stale data for 1-5 seconds?” If it’s a business/safety problem → linearizability. If users would shrug → eventual consistency is fine and cheaper.
Design a distributed leader election mechanism. What properties must it satisfy?
?
Requirements:
- Safety: At most one leader at any time
- Liveness: A leader is eventually elected (even after failures)
- Robustness: Works with f node failures in a cluster of 2f+1 nodes
Implementation using etcd/ZooKeeper:
- Each candidate tries to create an ephemeral key in etcd with a TTL (e.g., 10 seconds)
- Only one can succeed (
compare-and-swapsemantics) - Winner = leader; renews TTL periodically (keepalive)
- Other candidates watch the key; when it disappears (TTL expiry or crash) → retry
Fencing tokens (prevent stale leader actions):
- etcd assigns a monotonically increasing revision number
- Leader includes revision in all writes to protected resources
- Resources reject writes with old revision numbers
Split-brain prevention:
- Lease-based: leader can only act while its TTL is valid
- If leader pauses (GC), lease expires → another node elected → new revision → old leader’s writes rejected
Real implementation: etcd’s election primitives handle this — use clientv3/concurrency.NewElection() in Go
Quick Facts
What quorum size does Raft require and why?
?
-
Quorum = majority =
⌊n/2⌋ + 1- 3-node cluster: quorum = 2 (can tolerate 1 failure)
- 5-node cluster: quorum = 3 (can tolerate 2 failures)
- 7-node cluster: quorum = 4 (can tolerate 3 failures)
-
Why majority:
- Any two majorities overlap by at least 1 node
- Ensures the elected leader has seen all committed entries (it got a majority of votes)
- Ensures a commit is acknowledged by majority → survives f failures in (2f+1) cluster
-
Odd numbers preferred: With even nodes, risk of split vote → odd numbers guarantee majority is unambiguous
-
Minimum for fault tolerance: 3 nodes (can tolerate 1 failure); 2-node clusters are not fault-tolerant (majority = 2 = both nodes required)
Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13