Chapter 9: Consistency and Consensus
Overview
Chapter 9 explores the most theoretically deep questions in distributed systems: what does it mean for a distributed system to behave “correctly,” and what is fundamentally achievable? The chapter covers consistency models from strongest (linearizability) to weaker (causal consistency), the CAP theorem and its implications, and consensus algorithms (Paxos, Raft, Zab) that underpin distributed coordination.
Core thesis: Many distributed system problems reduce to consensus. If you can solve consensus, you can solve leader election, atomic broadcast, and atomic commits. But consensus is expensive—not all problems require it.
Key Concepts
Linearizability (Strong Consistency)
Definition: The strongest consistency guarantee. Operations appear to execute instantaneously, atomically, at some point between their invocation and completion. Once an operation completes, all subsequent reads must see the new value.
Informal: Make a distributed system behave “as if” there’s a single copy of the data.
Key property: Recency guarantee—after a write completes, all subsequent reads by any client on any replica must return the new value.
Where linearizability is needed:
- Locking and leader election: All nodes must agree on who holds the lock
- Constraints and uniqueness: Enforcing a username is unique across all nodes
- Cross-channel timing: User uploads photo, sends link to friend; friend must be able to read photo
Linearizability vs Serializability:
- Serializability: Isolation property for transactions (can appear in any serial order, even a past one)
- Linearizability: Recency guarantee for individual objects (must appear in real-time order)
- Strict serializability = both = linearizable + serializable (strongest possible)
Cost of linearizability:
- Requires coordination on every read/write
- Performance penalty (round trip to consensus layer)
- If network partition: must give up either linearizability or availability (CAP theorem)
The CAP Theorem
CAP (Brewer 2000):
- Consistency (here meaning Linearizability)
- Availability (every request receives a response, no errors)
- Partition tolerance (system continues working during network partition)
The theorem: In the presence of a network partition, you must choose between Consistency and Availability.
Why partition tolerance is not optional: Network partitions happen in real distributed systems (you can’t opt out). So the choice is really: CP or AP?
-
CP systems (prefer consistency): Return error during partition (e.g., ZooKeeper, HBase, etcd)
- Clients blocked until partition heals
- Data never becomes stale or inconsistent
-
AP systems (prefer availability): Return stale data during partition (e.g., Cassandra, DynamoDB)
- Always return a response (maybe stale)
- May have conflicts to resolve later
CAP is often misunderstood:
- CAP is about network partitions specifically—not about general availability/performance
- CAP only applies to the partition scenario; during normal operation, you can have both
- “Consistency” in CAP = linearizability, not the C in ACID
- Modern perspective: PACELC theorem adds latency trade-off even without partitions
PACELC (Daniel Abadi, 2012):
- During Partitions: Availability vs Consistency (like CAP)
- Else (no partition): Latency vs Consistency
- Even without partitions, strong consistency requires coordination → latency cost
Causal Consistency
Causal consistency: Preserves causality—if event A causally precedes event B, then all nodes see A before B. Concurrent events may appear in any order.
Why weaker than linearizability: Doesn’t require a total ordering of all operations—only causal ordering.
Why stronger than eventual consistency: Preserves the “happened-before” relationship.
Causal dependency tracking:
- Vector clocks track which events a process has seen
- Lamport timestamps provide total order (but not necessarily causal)
- Version vectors (from Ch5) track causally concurrent writes
Consistent prefix reads (from Ch5) is a form of causal consistency: if you see answer B, you must have seen question A that caused it.
Practical importance: Many consistency requirements reduce to causal consistency, not full linearizability. Causal consistency can be achieved without coordination (unlike linearizability).
Consensus Algorithms
What is consensus?
- Nodes must agree on a single value
- Once decided, value cannot be changed
- Every node must eventually decide
Why it matters: Consensus underpins:
- Leader election (consensus on who is leader)
- Atomic broadcast (total order broadcast = consensus on message order)
- Atomic commit (2PC = consensus on whether to commit)
- Locks and leases
Total Order Broadcast:
- All messages delivered to all nodes in the same order
- No message skipped (reliable delivery)
- Equivalent to consensus: each message = one round of consensus
- Used by ZooKeeper (Zab), Kafka (Raft in KRaft mode)
Consensus algorithms:
Paxos (Leslie Lamport, 1998):
- First proven consensus algorithm
- Notoriously difficult to understand and implement
- Single-decree Paxos: agree on one value
- Multi-Paxos: extend to a log of values (requires leader)
- Variants: Fast Paxos, Byzantine Paxos
Raft (Ongaro and Ousterhout, 2014):
- Designed for understandability (explicitly)
- Leader-based: all writes go through leader
- Leader election: candidate needs majority of votes
- Log replication: leader sends entries to followers; committed when majority acknowledge
- More commonly implemented than Paxos in new systems (2026)
Zab (ZooKeeper Atomic Broadcast):
- Used internally by ZooKeeper
- Similar to Raft; leader-based total order broadcast
Multi-Paxos / Raft trade-offs:
- Both require a leader; leader can accept new proposals while followers replicate
- Both require majority quorum for every commit
- Both can make progress as long as majority of nodes are healthy
- Performance: ~100ms for cross-datacenter commit; ~1ms for single DC
Limitations of consensus:
- Requires strictly majority of nodes healthy to make progress
- Synchronous replication → latency increases with quorum latency
- Leader election is expensive (election takes several round trips)
ZooKeeper and etcd
ZooKeeper (Apache) and etcd (CoreOS/CNCF):
- Distributed coordination services built on consensus (Zab and Raft respectively)
- Provide: leader election, distributed locks, distributed configuration, service discovery
ZooKeeper primitives:
- znodes: Files in a tree structure; can be watched for changes
- Ephemeral znodes: Deleted when creating client session ends (for locks, leader election)
- Sequential znodes: Auto-incrementing number appended to name (for fencing tokens)
- Watches: Notify clients when a znode changes
Why application code shouldn’t implement Paxos/Raft directly:
- Extremely complex; subtle bugs in edge cases
- Use ZooKeeper/etcd as a coordination service; application code uses simple primitives
Distributed Transactions and Atomic Commit
Atomic commit across nodes: All nodes commit or all abort.
Why 2PC works as consensus:
- Phase 1 (Prepare): Get unanimous agreement to commit
- Phase 2 (Commit): Coordinator sends final decision
- If any node says “no” → abort (unanimous yes required)
Three-Phase Commit (3PC):
- Adds a “pre-commit” phase to avoid coordinator failure blocking participants
- Still not safe in partially synchronous networks (can’t distinguish slow coordinator from crashed one)
- Not widely used
XA Transactions (eXtended Architecture):
- Standard API for 2PC across heterogeneous data stores
- Support: PostgreSQL, MySQL, Oracle, IBM DB2, ActiveMQ
- Problem: same coordinator failure issues; ties application code to recovery process
Important Points
- Linearizability is expensive: Requires coordination on every operation; imposes real latency costs.
- CAP doesn’t mean “pick two of three”: Partition tolerance is not optional; choice is really CP or AP.
- Causal consistency is often sufficient: Most user-facing consistency requirements are causal, not linearizable.
- Consensus is equivalent to total order broadcast: Any consensus algorithm can implement total order broadcast.
- Raft is Paxos made understandable: Functionally equivalent; Raft is now the preferred implementation.
- ZooKeeper/etcd are your consensus tools: Don’t implement Paxos yourself; use battle-tested libraries.
- 2PC is consensus for commit decisions: But it’s “blocking” consensus (coordinator failure blocks participants).
Examples & Case Studies
-
ZooKeeper Leader Election
- Each node creates sequential ephemeral znode
- Node with lowest sequence number = leader
- Watch previous node for deletion; when it disappears, try to become leader
- Fencing token: sequence number used to order requests
-
Kafka KRaft (Raft-based metadata)
- Kafka 3.x replaced ZooKeeper with internal Raft implementation
- Partition metadata managed by Raft-based “controller quorum”
- Eliminates ZooKeeper as external dependency
-
CockroachDB Distributed Transactions
- Uses Raft within each range (partition) for replication
- Uses atomic commit across ranges (2PC-like) for multi-row transactions
- SSI for isolation; Raft for replication = serializable + linearizable
-
Google Spanner
- Paxos for replication within each tablet group
- TrueTime for clock synchronization
- Achieves external consistency (linearizable + serializable) globally
Questions
- What is linearizability and how does it differ from serializability?
- What does the CAP theorem actually mean in practice?
- When is causal consistency sufficient instead of linearizability?
- How does Raft achieve consensus and what happens during leader election?
- What is total order broadcast and how does it relate to consensus?
- Why is 2PC a “blocking” commit protocol?
- What does ZooKeeper/etcd provide that application code can’t easily implement?
- What is the PACELC theorem and why is it more practical than CAP?
Modern Context (2026)
Raft everywhere (2026 standard):
- etcd (Kubernetes): Raft-based, ~1000+ concurrent key writes/sec
- CockroachDB, TiDB, YugabyteDB: Raft per shard
- Kafka KRaft: replaced ZooKeeper dependency (Kafka 3.3+)
- Virtually all new distributed systems use Raft over Paxos
Consensus-as-a-service:
- etcd: standard Kubernetes coordination service
- Cloud Spanner: fully managed consensus (Google’s TrueTime + Paxos)
- CockroachDB Serverless: managed consensus/replication
Extended Paxos family:
- Flexible Paxos: quorum sizes can vary per request (read vs write)
- Epaxos: leaderless, parallel consensus (used in some geo-distributed systems)
- CASPaxos: single-decree Paxos with CAS (used in distributed state machines)
Consensus for Blockchain:
- Proof of Work (Bitcoin): implicit consensus via chain weight
- BFT consensus (Ethereum 2.0 PoS): validators use BFT-based Casper FFG
- Tendermint/PBFT: BFT consensus for permissioned blockchains (Hyperledger)
Status: Notes complete
Last Updated: 2026-04-13