Chapter 10 Flashcards — Consistency and Consensus
flashcards ddia-2e chapter-10 consistency consensus linearizability raft
Linearizability
What is linearizability and how does it differ from serializability?
?
Linearizability (atomic consistency):
- A guarantee for individual operations on individual objects
- Every operation appears to execute atomically at a single point in real time between its invocation and completion
- Once an operation completes, all subsequent reads from any node return the new value (recency guarantee)
- Informal: the system behaves as if there’s a single copy of the data on a single, infinitely fast machine
Serializability:
- A guarantee for transactions (groups of operations)
- Concurrent transactions appear to execute in some serial order
- That order can be in the “past” — transactions can be reordered even if they ran concurrently
- Does NOT require the serial order to match real-time order
Key difference:
- Serializability: correct ordering among transactions (can reorder to any serial schedule)
- Linearizability: real-time ordering of individual operations (must respect wall-clock order)
- Strict serializability = both = the strongest guarantee (Spanner, FoundationDB)
Example where they diverge:
Serializability allows: read transaction T1 sees T2’s writes even if T1 started before T2 committed (if TXs are serialized in T2→T1 order). Linearizability forbids this: if T1’s read started after T2 committed, T1 must see T2’s writes.
What are the main use cases that require linearizability?
?
1. Distributed locks and leader election:
- All nodes must agree on exactly one holder of the lock
- Without linearizability, two nodes could both see “lock is available” simultaneously and both acquire it
- Example: Raft-based etcd ensures only one leader exists per term
2. Uniqueness constraints:
- “Only one user can register username X” across all nodes
- Non-linearizable: two nodes check “is username taken?” simultaneously, both see NO, both create it
- Requires atomic read-modify-write (compare-and-swap) across all replicas
3. Cross-channel timing:
- User uploads a photo; sends link to friend via a different channel (email, SMS)
- Friend follows link before the photo has propagated to all replicas → 404 error
- Linearizability ensures: after upload completes, all reads return the photo
4. Balance / invariant enforcement:
- Bank account must never go negative
- Two concurrent withdrawals each check balance independently → both see sufficient funds → both succeed → balance goes negative
- Linearizability makes the check-and-debit atomic
Why does multi-leader replication break linearizability?
?
- Multi-leader: Multiple nodes accept writes concurrently; conflicts resolved later
- Why it breaks linearizability: Linearizability requires operations to appear to execute at a single point in time — as if there’s one copy of the data
- With two leaders, two writes can both complete “at the same time” on different leaders
- The system has no way to determine which write happened “before” the other in real time
- Conflict resolution (e.g., LWW, merge functions) happens asynchronously after both writes committed
Example:
Leader A (Europe): writes x=1 at "time" T1 → acknowledges to client
Leader B (Asia): writes x=2 at "time" T2 → acknowledges to client
Both clients believe their write succeeded (and it did — locally)
But x can only have one value globally — which one "wins"?
There is no linearizable answer if T1 ≈ T2.
Bottom line: Multi-leader replication is fundamentally incompatible with linearizability. Use single-leader or consensus-based replication for linearizable storage.
CAP and Consistency Trade-offs
What does the CAP theorem actually mean, and what is the real choice it presents?
?
CAP (Brewer, 2000):
- C = Consistency (actually means linearizability in CAP)
- A = Availability (every request gets a non-error response)
- P = Partition tolerance (system continues during network partition)
The real insight: Partition tolerance is not optional — network partitions happen in real distributed systems. You cannot opt out. So the real choice during a partition is:
- CP (consistency over availability): Return an error or block until partition heals. Data is never stale. ZooKeeper, etcd, CockroachDB, HBase.
- AP (availability over consistency): Return a response (potentially stale). Always available. Cassandra, DynamoDB (default), Riak.
What CAP does NOT say:
- It only applies during a partition — not normal operation (during normal operation you can have both C and A)
- “C” in CAP ≠ “C” in ACID (completely different concepts)
- CAP says nothing about latency during normal operation
PACELC extends CAP (Abadi, 2012):
- During Partition: A vs C (CAP)
- Else (no partition): Latency vs Consistency
- Even without partitions, strong consistency requires coordination → latency cost
- More practical for system design than CAP alone
What is causal consistency and when is it sufficient instead of linearizability?
?
Causal consistency: If event A causally precedes event B (A “happened before” B), then all nodes see A before B. Concurrent events (no causal relationship) may appear in any order.
Formal property: A → B (A causally precedes B) implies all nodes observe A before B.
Why weaker than linearizability: Doesn’t require a global total ordering of all events — only ordering of causally related events.
Why stronger than eventual consistency: Preserves the “happened-before” relationship; no reversal of causal order.
When causal consistency is sufficient (most cases!):
- Answering a forum post: you must see the question before the answer
- “Read your own writes”: you always see effects of your own previous actions
- Social feeds: new posts appear after older ones from the same user
- Database reads after writes: if you wrote a record, you can read it back
When you actually need linearizability (rare):
- Distributed lock (agree on one holder)
- Global uniqueness constraint (username, email must be unique across all nodes)
- Atomic increment (e.g., counter that multiple nodes increment)
- Cross-channel consistency where an external system observes your writes
Key insight: Most user-facing application consistency requirements reduce to causal consistency, not full linearizability. Implementing causal consistency does not require consensus-level coordination — it’s achievable with vector clocks and careful ordering.
Logical Clocks
What is a Lamport clock and what does it guarantee?
?
Lamport clock: A simple counter-based logical clock (Leslie Lamport, 1978).
Algorithm:
- Each node maintains an integer counter
L, initially 0 - Before any event or send:
L = L + 1 - Include current
Lvalue in every message sent - On receiving message with clock value
m:L = max(L, m) + 1
What it guarantees:
- If A → B (A happened before B, causally): then
L(A) < L(B)— always true - Contrapositive: If
L(A) ≥ L(B)then A did NOT happen before B (A is concurrent with or after B)
What it does NOT guarantee:
L(A) < L(B)does NOT imply A → B — A and B might be causally unrelated (concurrent)- Lamport clocks provide total order (all events can be compared) but not causal precision
Use cases:
- Breaking ties between concurrent events (combine with node ID for total order)
- Distributed mutex algorithms (Ricart-Agrawala)
- Event sequencing in distributed logs
Limitation: To know if two events are concurrent vs causally ordered, you need vector clocks.
What are vector clocks, how do they work, and when should you use them over Lamport clocks?
?
Vector clock: A vector of per-node counters that precisely tracks causal relationships (Fidge/Mattern, 1988).
Algorithm (N nodes, each with vector V[0..N-1]):
- Each node
imaintains vectorV, initially all zeros - Before local event at node
i:V[i]++ - Include current
Vin every message - On receive at node
ifrom nodejwith clockV_msg:V[k] = max(V[k], V_msg[k])for all kV[i]++
Causal comparison:
V(A) ≤ V(B)iffV(A)[k] ≤ V(B)[k]for all k (A happened-before B or equal)V(A) < V(B)iffV(A) ≤ V(B)andV(A) ≠ V(B)(A causally precedes B)- Concurrent: Neither
V(A) ≤ V(B)norV(B) ≤ V(A)(no causal relationship)
Use vector clocks when:
- You need to detect concurrent writes (to decide whether to merge or flag conflict)
- Conflict detection in distributed key-value stores (Amazon Dynamo uses version vectors — a variant)
- Debugging causal event ordering in distributed logs
Don’t use vector clocks when:
- You have many nodes (O(N) space and transmission cost per event)
- You only need total order, not causal precision (Lamport suffices)
Lamport vs Vector:
- Lamport: O(1) space; gives total order; cannot detect concurrency
- Vector: O(N) space; gives causal precision; detects concurrent events
What is a Hybrid Logical Clock (HLC) and why do databases like CockroachDB use it?
?
Hybrid Logical Clock (HLC): A clock that combines a physical wall clock timestamp with a logical counter.
Format: HLC = (physical_time, logical_counter)
Rules:
- On local event:
HLC.physical = max(HLC.physical, wall_clock); HLC.logical++ - On send: same as local event; include HLC in message
- On receive with
HLC_msg:HLC.physical = max(HLC.physical, HLC_msg.physical, wall_clock)HLC.logical = max(HLC.logical, HLC_msg.logical) + 1
Properties:
- Always monotonically increasing (never goes backward, unlike wall clocks)
- Stays within bounded drift of physical wall time
- Captures causal ordering (if A → B, then HLC(A) < HLC(B))
- O(1) space — just one tuple
Why CockroachDB uses HLC:
- Needs to assign transaction timestamps that are approximately physical time (for human-readable queries, time-travel queries)
- Wall clocks can go backward (NTP correction) → would violate MVCC ordering
- HLC gives monotonic timestamps that stay close to physical time → correct MVCC + usable timestamps
Comparison to TrueTime:
- TrueTime (Spanner): physical time with bounded uncertainty; requires GPS + atomic clock hardware
- HLC: software-only; no bounded uncertainty; just stays “close” to wall time
- HLC is the practical alternative for systems without TrueTime hardware
ID Generation
Compare Snowflake IDs, ULIDs, and UUID v4 for distributed ID generation.
?
UUID v4 (random):
550e8400-e29b-41d4-a716-446655440000
128 bits of randomness
- ✅ Globally unique; zero coordination; universally supported
- ❌ Not sortable; random inserts fragment B-tree indexes
- ❌ 36-character string (large storage overhead)
- Use when: simplicity matters more than performance or sortability
Snowflake ID (Twitter, 2010):
64-bit: [41-bit ms since epoch][10-bit node ID][12-bit sequence]
- ✅ Fits in SQL bigint; time-sorted (good B-tree locality)
- ✅ 4096 IDs/ms/node with no cross-node coordination
- ❌ Requires pre-assigned node IDs (coordination for node registration)
- ❌ Sensitive to clock skew (can produce IDs out of order across nodes)
- Use when: high throughput, need 64-bit integer, time-sorted inserts
ULID:
48-bit ms timestamp + 80-bit cryptographic random = 128 bits
Encoded as 26 Crockford base32 characters: 01ARZ3NDEKTSV4RRFFQ69G3BVZB
- ✅ Globally unique; sortable; URL-safe; no coordination
- ✅ Human-readable; lexicographic sort = time sort
- ❌ 128-bit (larger than Snowflake); not a native SQL integer type
- Use when: need sortable UUIDs without pre-assigned node IDs
UUID v7 (RFC 9562, 2024):
48-bit ms timestamp + 74 bits random/version = 128 bits
Standard UUID format: 0192f8a4-e7b5-7abc-9def-123456789abc
- ✅ Standardized; time-sorted; UUID-compatible; no coordination
- ✅ Good choice for new systems needing sortable IDs
- Use when: need a standard, sortable UUID for new systems (preferred over v4)
Consensus
What is consensus and what are its four formal properties?
?
Consensus: A group of nodes must agree on a single value. Once they decide, the decision is final and permanent.
Formal properties:
- Uniform agreement: All non-faulty nodes decide on the same value — no two nodes decide different values
- Validity: The decided value was actually proposed by some node — no value appears from nowhere
- Integrity: Each node decides at most once — once decided, the decision cannot change
- Termination (liveness): All non-faulty nodes eventually decide — the system doesn’t block forever
Practical equivalences (all reducible to consensus):
- Leader election → agree on which node is leader
- Total order broadcast → agree on the order of messages
- Atomic commit (2PC) → agree on whether to commit or abort
- Distributed lock → agree on which client holds the lock
The FLP impossibility: In a fully asynchronous network, consensus is impossible if even one node can fail. This is why all practical consensus algorithms (Raft, Paxos) assume partial synchrony (use timeouts).
How does Raft achieve consensus? Describe leader election and log replication.
?
Raft overview: Leader-based consensus algorithm designed for understandability. All writes go through the elected leader.
Leader election:
- Nodes start as Followers; transition to Candidate on election timeout (no heartbeat received)
- Candidate: increment term; vote for self; send
RequestVoteto all nodes - Becomes Leader if majority responds YES in this term
- If another leader with higher term is seen: step down to Follower
- If split vote (tie): timeout, increment term, re-elect
- Safety guarantee: Only ONE leader per term (two candidates can’t both get majority of N nodes)
Log replication:
- Client sends write to Leader
- Leader appends entry to its log (uncommitted)
- Leader sends
AppendEntriesRPC to all followers - Once majority acknowledges the entry: Leader marks it committed
- Leader applies committed entry to state machine; responds to client
- Followers apply committed entries on next
AppendEntries(via commit index)
Key safety property: A leader always has all committed entries. The election algorithm ensures a candidate can only win if its log is at least as up-to-date as any majority’s logs. This prevents a new leader from missing committed entries.
What is total order broadcast (TOB) and why is it equivalent to consensus?
?
Total Order Broadcast (TOB):
- All messages are delivered to all nodes in the same order
- No message is skipped (reliable delivery)
- Also called: atomic broadcast
Equivalence to consensus:
- TOB = one consensus decision per message
- Each position in the total order = one round of consensus
- If you can solve consensus → you can implement TOB (use consensus to agree on the next message at each position)
- If you can implement TOB → you can build consensus (broadcast a proposed value; the first delivered = the decision)
Practical use:
- Replicated state machine: Apply the same TOB-ordered messages to the same initial state on each replica → all replicas end up in the same state
- ZooKeeper uses Zab (ZooKeeper Atomic Broadcast) — a TOB protocol
- Kafka uses Raft in KRaft mode for its controller log — a TOB protocol
- etcd uses Raft — every key-value write = one TOB message
Why TOB is harder than regular broadcast: Regular broadcast (UDP multicast) doesn’t guarantee ordering or delivery. TOB requires consensus — all nodes must agree on the same ordering, even if some nodes are slow or partitioned.
Coordination Services
What does ZooKeeper provide and how does it implement leader election and fencing tokens?
?
ZooKeeper (Apache): Distributed coordination service built on Zab (ZooKeeper Atomic Broadcast — similar to Raft).
Core primitives:
- Znodes: Files in a hierarchical tree; can store small data (config, status)
- Ephemeral znodes: Automatically deleted when the creating client session ends → used for locks and presence detection
- Sequential znodes: ZooKeeper appends a monotonically increasing number to the name (e.g.,
/election/candidate-00000042) - Watches: Notify a client when a znode is created, deleted, or changed → reactive programming without polling
Leader election with ZooKeeper:
- All candidates create
/election/candidate-0000000N(sequential + ephemeral) - Read all candidates; sort by sequence number
- Node with LOWEST sequence number = current leader
- Each other node watches the node with the NEXT LOWER sequence number (not the leader — avoids thundering herd)
- When watched node is deleted (crash): re-check if now lowest → become leader; else watch new next-lower
Fencing tokens:
- The sequential znode number is a natural fencing token (monotonically increasing)
- Leader includes its sequence number in all writes to the protected resource
- Resource rejects writes with a lower sequence number than max-seen
etcd (CNCF): The Kubernetes standard. Similar to ZooKeeper but uses Raft and a simpler key-value API. Used by Kubernetes for all control plane state (pods, services, configmaps, secrets).
Why should you use etcd or ZooKeeper rather than implementing consensus yourself?
?
Consensus algorithms are deceptively hard to implement correctly:
- Raft has at least 8 distinct sub-cases in leader election alone
- Membership changes (adding/removing nodes) require special handling (joint consensus or single-server changes) — many bugs here
- Log compaction (snapshots) must handle concurrent log entries
- Clock skew and network delays create subtle edge cases
What “production consensus” requires beyond the Raft paper:
- Correct handling of all network failure modes (partition, delay, reorder)
- Pre-vote phase (optimization to avoid spurious elections)
- ReadIndex protocol for linearizable reads without writing to log
- Learner nodes (read-only replicas during join)
- Checkpointing and log truncation without losing committed entries
What etcd/ZooKeeper give you:
- Battle-tested implementation with billions of production hours
- Jepsen-verified (linearizability under fault injection)
- Client libraries in all major languages
- Operational tooling (metrics, snapshots, cluster membership)
- Automatic leader election, distributed locks, watches — all primitives ready to use
Rule: Never implement Paxos/Raft in application code. Use etcd, ZooKeeper, or a managed consensus service.
Modern Context (2026)
What consensus implementations are used by major distributed databases in 2026?
?
| System | Consensus | Notes |
|---|---|---|
| etcd | Raft | Kubernetes control plane; ~10,000 writes/sec |
| CockroachDB | Raft (per range) | Raft group per 64MB range |
| TiKV / TiDB | Raft (per region) | Similar architecture to CockroachDB |
| YugabyteDB | Raft (per tablet) | Same pattern |
| Kafka KRaft | Raft | Replaced ZooKeeper in Kafka 3.3+ |
| MongoDB | Raft-based oplog | Replaced older oplog replication in 4.0+ |
| Google Spanner | Paxos (per tablet group) | TrueTime for external consistency |
| FoundationDB | Paxos (internal) | Deterministic simulation testing |
Key trend: Raft has become the universal standard for new distributed systems. Paxos is used in legacy or Google-specific systems. New systems almost exclusively choose Raft.
Why Raft won: Better documentation, more reference implementations, clearer correctness arguments, and the original Raft thesis was specifically written to be teachable.
How are distributed ID generation standards evolving in 2026?
?
UUID v7 (RFC 9562, standardized 2024):
- Time-ordered UUIDs are now officially standardized
- Expected to become the default UUID type for new applications
- Replaces UUID v4 in cases where sortability matters (most database use cases)
ULID:
- Widely adopted in event-driven systems, event sourcing architectures
- Preferred in Node.js/TypeScript ecosystems
- Crockford base32 encoding is URL-safe and human-readable
Snowflake variants at scale:
- Discord: Incremented Snowflake epoch; uses 4096 shard IDs
- Instagram: Postgres sequence + shard ID (hybrid approach)
- LinkedIn: Uses LiSnowflake (custom variant optimized for their infrastructure)
TSID: Emerging as the preferred 64-bit sortable ID for systems that need to fit in SQL bigint without pre-assigned node IDs.
Industry shift: The days of UUID v4 as the default are ending. UUID v7 or ULID are the recommended defaults for new distributed systems where IDs need to be stored in a database or sorted.
Quick Facts
What is strict serializability and which systems provide it?
?
- Strict serializability = Serializability + Linearizability (the strongest possible guarantee)
- Serializability: Transactions execute in some serial order (for isolation between transactions)
- Linearizability: The serial order respects real-time order (if T1 completed before T2 started, T1 appears before T2)
What this means practically:
- All transactions are isolated (no dirty reads, no phantom reads — full serializability)
- All transactions are immediately visible after commit (linearizability — no stale reads from any replica)
- System behaves exactly like a single-threaded, single-node database
Systems that provide strict serializability:
- Google Spanner: Paxos replication + TrueTime → globally strictly serializable
- FoundationDB: Deterministic serialization + linearizable reads
- CockroachDB: SSI (Serializable Snapshot Isolation) + Raft → serializable + linearizable
Cost: Every read and write requires coordination (consensus-level); high latency, especially cross-datacenter. Appropriate for financial systems, globally consistent config stores, distributed counters.
What is the difference between a consensus algorithm blocking during a partition vs violating safety?
?
Blocking (liveness violation):
- If fewer than majority of nodes are reachable: consensus algorithm stops making progress
- No new writes are accepted; reads may be stale or blocked
- This is intentional — the algorithm refuses to proceed rather than risk split-brain
- Example: 5-node Raft cluster, 2 nodes unreachable → 3 remaining nodes form majority → continues
- Example: 5-node Raft cluster, 3 nodes unreachable → 2 remaining nodes cannot form majority → blocks
Safety violation (never happens in correct Raft/Paxos):
- Two different groups both commit conflicting values (split brain)
- Data corruption — two nodes think different values are “the truth”
- Correct consensus algorithms sacrifice liveness before ever violating safety
Trade-off:
- Raft/Paxos:
n ≥ 2f+1nodes required for progress (f = failed nodes); tolerates up ton/2 - 1failures - BFT (Tendermint): requires
n ≥ 3f+1; even more conservative - The algorithm prefers being unavailable over being incorrect
Practical implication: In a 5-node cluster, tolerate 2 node failures. Beyond that, the system blocks (writes rejected) until at least one node recovers. This is the correct behavior — it prevents data divergence.
Total Cards: 17
Review Time: ~25 minutes
Priority: HIGH
Last Updated: 2026-05-29