Chapter 9: Consistency and Consensus

Overview

Chapter 9 explores the most theoretically deep questions in distributed systems: what does it mean for a distributed system to behave “correctly,” and what is fundamentally achievable? The chapter covers consistency models from strongest (linearizability) to weaker (causal consistency), the CAP theorem and its implications, and consensus algorithms (Paxos, Raft, Zab) that underpin distributed coordination.

Core thesis: Many distributed system problems reduce to consensus. If you can solve consensus, you can solve leader election, atomic broadcast, and atomic commits. But consensus is expensive—not all problems require it.

Key Concepts

Linearizability (Strong Consistency)

Definition: The strongest consistency guarantee. Operations appear to execute instantaneously, atomically, at some point between their invocation and completion. Once an operation completes, all subsequent reads must see the new value.

Informal: Make a distributed system behave “as if” there’s a single copy of the data.

Key property: Recency guarantee—after a write completes, all subsequent reads by any client on any replica must return the new value.

Where linearizability is needed:

Locking and leader election: All nodes must agree on who holds the lock
Constraints and uniqueness: Enforcing a username is unique across all nodes
Cross-channel timing: User uploads photo, sends link to friend; friend must be able to read photo

Linearizability vs Serializability:

Serializability: Isolation property for transactions (can appear in any serial order, even a past one)
Linearizability: Recency guarantee for individual objects (must appear in real-time order)
Strict serializability = both = linearizable + serializable (strongest possible)

Cost of linearizability:

Requires coordination on every read/write
Performance penalty (round trip to consensus layer)
If network partition: must give up either linearizability or availability (CAP theorem)

The CAP Theorem

CAP (Brewer 2000):

Consistency (here meaning Linearizability)
Availability (every request receives a response, no errors)
Partition tolerance (system continues working during network partition)

The theorem: In the presence of a network partition, you must choose between Consistency and Availability.

Why partition tolerance is not optional: Network partitions happen in real distributed systems (you can’t opt out). So the choice is really: CP or AP?

CP systems (prefer consistency): Return error during partition (e.g., ZooKeeper, HBase, etcd)
- Clients blocked until partition heals
- Data never becomes stale or inconsistent
AP systems (prefer availability): Return stale data during partition (e.g., Cassandra, DynamoDB)
- Always return a response (maybe stale)
- May have conflicts to resolve later

CAP is often misunderstood:

CAP is about network partitions specifically—not about general availability/performance
CAP only applies to the partition scenario; during normal operation, you can have both
“Consistency” in CAP = linearizability, not the C in ACID
Modern perspective: PACELC theorem adds latency trade-off even without partitions

PACELC (Daniel Abadi, 2012):

During Partitions: Availability vs Consistency (like CAP)
Else (no partition): Latency vs Consistency
Even without partitions, strong consistency requires coordination → latency cost

Causal Consistency

Causal consistency: Preserves causality—if event A causally precedes event B, then all nodes see A before B. Concurrent events may appear in any order.

Why weaker than linearizability: Doesn’t require a total ordering of all operations—only causal ordering.

Why stronger than eventual consistency: Preserves the “happened-before” relationship.

Causal dependency tracking:

Vector clocks track which events a process has seen
Lamport timestamps provide total order (but not necessarily causal)
Version vectors (from Ch5) track causally concurrent writes

Consistent prefix reads (from Ch5) is a form of causal consistency: if you see answer B, you must have seen question A that caused it.

Practical importance: Many consistency requirements reduce to causal consistency, not full linearizability. Causal consistency can be achieved without coordination (unlike linearizability).

Consensus Algorithms

What is consensus?

Nodes must agree on a single value
Once decided, value cannot be changed
Every node must eventually decide

Why it matters: Consensus underpins:

Leader election (consensus on who is leader)
Atomic broadcast (total order broadcast = consensus on message order)
Atomic commit (2PC = consensus on whether to commit)
Locks and leases

Total Order Broadcast:

All messages delivered to all nodes in the same order
No message skipped (reliable delivery)
Equivalent to consensus: each message = one round of consensus
Used by ZooKeeper (Zab), Kafka (Raft in KRaft mode)

Consensus algorithms:

Paxos (Leslie Lamport, 1998):

First proven consensus algorithm
Notoriously difficult to understand and implement
Single-decree Paxos: agree on one value
Multi-Paxos: extend to a log of values (requires leader)
Variants: Fast Paxos, Byzantine Paxos

Raft (Ongaro and Ousterhout, 2014):

Designed for understandability (explicitly)
Leader-based: all writes go through leader
Leader election: candidate needs majority of votes
Log replication: leader sends entries to followers; committed when majority acknowledge
More commonly implemented than Paxos in new systems (2026)

Zab (ZooKeeper Atomic Broadcast):

Used internally by ZooKeeper
Similar to Raft; leader-based total order broadcast

Multi-Paxos / Raft trade-offs:

Both require a leader; leader can accept new proposals while followers replicate
Both require majority quorum for every commit
Both can make progress as long as majority of nodes are healthy
Performance: ~100ms for cross-datacenter commit; ~1ms for single DC

Limitations of consensus:

Requires strictly majority of nodes healthy to make progress
Synchronous replication → latency increases with quorum latency
Leader election is expensive (election takes several round trips)

ZooKeeper and etcd

ZooKeeper (Apache) and etcd (CoreOS/CNCF):

Distributed coordination services built on consensus (Zab and Raft respectively)
Provide: leader election, distributed locks, distributed configuration, service discovery

ZooKeeper primitives:

znodes: Files in a tree structure; can be watched for changes
Ephemeral znodes: Deleted when creating client session ends (for locks, leader election)
Sequential znodes: Auto-incrementing number appended to name (for fencing tokens)
Watches: Notify clients when a znode changes

Why application code shouldn’t implement Paxos/Raft directly:

Extremely complex; subtle bugs in edge cases
Use ZooKeeper/etcd as a coordination service; application code uses simple primitives

Distributed Transactions and Atomic Commit

Atomic commit across nodes: All nodes commit or all abort.

Why 2PC works as consensus:

Phase 1 (Prepare): Get unanimous agreement to commit
Phase 2 (Commit): Coordinator sends final decision
If any node says “no” → abort (unanimous yes required)

Three-Phase Commit (3PC):

Adds a “pre-commit” phase to avoid coordinator failure blocking participants
Still not safe in partially synchronous networks (can’t distinguish slow coordinator from crashed one)
Not widely used

XA Transactions (eXtended Architecture):

Standard API for 2PC across heterogeneous data stores
Support: PostgreSQL, MySQL, Oracle, IBM DB2, ActiveMQ
Problem: same coordinator failure issues; ties application code to recovery process

Important Points

Linearizability is expensive: Requires coordination on every operation; imposes real latency costs.
CAP doesn’t mean “pick two of three”: Partition tolerance is not optional; choice is really CP or AP.
Causal consistency is often sufficient: Most user-facing consistency requirements are causal, not linearizable.
Consensus is equivalent to total order broadcast: Any consensus algorithm can implement total order broadcast.
Raft is Paxos made understandable: Functionally equivalent; Raft is now the preferred implementation.
ZooKeeper/etcd are your consensus tools: Don’t implement Paxos yourself; use battle-tested libraries.
2PC is consensus for commit decisions: But it’s “blocking” consensus (coordinator failure blocks participants).

Examples & Case Studies

ZooKeeper Leader Election
- Each node creates sequential ephemeral znode
- Node with lowest sequence number = leader
- Watch previous node for deletion; when it disappears, try to become leader
- Fencing token: sequence number used to order requests
Kafka KRaft (Raft-based metadata)
- Kafka 3.x replaced ZooKeeper with internal Raft implementation
- Partition metadata managed by Raft-based “controller quorum”
- Eliminates ZooKeeper as external dependency
CockroachDB Distributed Transactions
- Uses Raft within each range (partition) for replication
- Uses atomic commit across ranges (2PC-like) for multi-row transactions
- SSI for isolation; Raft for replication = serializable + linearizable
Google Spanner
- Paxos for replication within each tablet group
- TrueTime for clock synchronization
- Achieves external consistency (linearizable + serializable) globally

Questions

What is linearizability and how does it differ from serializability?
What does the CAP theorem actually mean in practice?
When is causal consistency sufficient instead of linearizability?
How does Raft achieve consensus and what happens during leader election?
What is total order broadcast and how does it relate to consensus?
Why is 2PC a “blocking” commit protocol?
What does ZooKeeper/etcd provide that application code can’t easily implement?
What is the PACELC theorem and why is it more practical than CAP?

Modern Context (2026)

Raft everywhere (2026 standard):

etcd (Kubernetes): Raft-based, ~1000+ concurrent key writes/sec
CockroachDB, TiDB, YugabyteDB: Raft per shard
Kafka KRaft: replaced ZooKeeper dependency (Kafka 3.3+)
Virtually all new distributed systems use Raft over Paxos

Consensus-as-a-service:

etcd: standard Kubernetes coordination service
Cloud Spanner: fully managed consensus (Google’s TrueTime + Paxos)
CockroachDB Serverless: managed consensus/replication

Extended Paxos family:

Flexible Paxos: quorum sizes can vary per request (read vs write)
Epaxos: leaderless, parallel consensus (used in some geo-distributed systems)
CASPaxos: single-decree Paxos with CAS (used in distributed state machines)

Consensus for Blockchain:

Proof of Work (Bitcoin): implicit consensus via chain weight
BFT consensus (Ethereum 2.0 PoS): validators use BFT-based Casper FFG
Tendermint/PBFT: BFT consensus for permissioned blockchains (Hyperledger)

Status: Notes complete
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch09-consistency-and-consensus