Chapter 10 Flashcards — Consistency and Consensus

flashcards ddia-2e chapter-10 consistency consensus linearizability raft

Linearizability

What is linearizability and how does it differ from serializability?
?
Linearizability (atomic consistency):

A guarantee for individual operations on individual objects
Every operation appears to execute atomically at a single point in real time between its invocation and completion
Once an operation completes, all subsequent reads from any node return the new value (recency guarantee)
Informal: the system behaves as if there’s a single copy of the data on a single, infinitely fast machine

Serializability:

A guarantee for transactions (groups of operations)
Concurrent transactions appear to execute in some serial order
That order can be in the “past” — transactions can be reordered even if they ran concurrently
Does NOT require the serial order to match real-time order

Key difference:

Serializability: correct ordering among transactions (can reorder to any serial schedule)
Linearizability: real-time ordering of individual operations (must respect wall-clock order)
Strict serializability = both = the strongest guarantee (Spanner, FoundationDB)

Example where they diverge:
Serializability allows: read transaction T1 sees T2’s writes even if T1 started before T2 committed (if TXs are serialized in T2→T1 order). Linearizability forbids this: if T1’s read started after T2 committed, T1 must see T2’s writes.

What are the main use cases that require linearizability?
?
1. Distributed locks and leader election:

All nodes must agree on exactly one holder of the lock
Without linearizability, two nodes could both see “lock is available” simultaneously and both acquire it
Example: Raft-based etcd ensures only one leader exists per term

2. Uniqueness constraints:

“Only one user can register username X” across all nodes
Non-linearizable: two nodes check “is username taken?” simultaneously, both see NO, both create it
Requires atomic read-modify-write (compare-and-swap) across all replicas

3. Cross-channel timing:

User uploads a photo; sends link to friend via a different channel (email, SMS)
Friend follows link before the photo has propagated to all replicas → 404 error
Linearizability ensures: after upload completes, all reads return the photo

4. Balance / invariant enforcement:

Bank account must never go negative
Two concurrent withdrawals each check balance independently → both see sufficient funds → both succeed → balance goes negative
Linearizability makes the check-and-debit atomic

Why does multi-leader replication break linearizability?
?

Multi-leader: Multiple nodes accept writes concurrently; conflicts resolved later
Why it breaks linearizability: Linearizability requires operations to appear to execute at a single point in time — as if there’s one copy of the data
With two leaders, two writes can both complete “at the same time” on different leaders
The system has no way to determine which write happened “before” the other in real time
Conflict resolution (e.g., LWW, merge functions) happens asynchronously after both writes committed

Example:

Leader A (Europe): writes x=1 at "time" T1 → acknowledges to client
Leader B (Asia):   writes x=2 at "time" T2 → acknowledges to client
Both clients believe their write succeeded (and it did — locally)
But x can only have one value globally — which one "wins"?
There is no linearizable answer if T1 ≈ T2.

Bottom line: Multi-leader replication is fundamentally incompatible with linearizability. Use single-leader or consensus-based replication for linearizable storage.

CAP and Consistency Trade-offs

What does the CAP theorem actually mean, and what is the real choice it presents?
?
CAP (Brewer, 2000):

C = Consistency (actually means linearizability in CAP)
A = Availability (every request gets a non-error response)
P = Partition tolerance (system continues during network partition)

The real insight: Partition tolerance is not optional — network partitions happen in real distributed systems. You cannot opt out. So the real choice during a partition is:

CP (consistency over availability): Return an error or block until partition heals. Data is never stale. ZooKeeper, etcd, CockroachDB, HBase.
AP (availability over consistency): Return a response (potentially stale). Always available. Cassandra, DynamoDB (default), Riak.

What CAP does NOT say:

It only applies during a partition — not normal operation (during normal operation you can have both C and A)
“C” in CAP ≠ “C” in ACID (completely different concepts)
CAP says nothing about latency during normal operation

PACELC extends CAP (Abadi, 2012):

During Partition: A vs C (CAP)
Else (no partition): Latency vs Consistency
Even without partitions, strong consistency requires coordination → latency cost
More practical for system design than CAP alone

What is causal consistency and when is it sufficient instead of linearizability?
?
Causal consistency: If event A causally precedes event B (A “happened before” B), then all nodes see A before B. Concurrent events (no causal relationship) may appear in any order.

Formal property: A → B (A causally precedes B) implies all nodes observe A before B.

Why weaker than linearizability: Doesn’t require a global total ordering of all events — only ordering of causally related events.

Why stronger than eventual consistency: Preserves the “happened-before” relationship; no reversal of causal order.

When causal consistency is sufficient (most cases!):

Answering a forum post: you must see the question before the answer
“Read your own writes”: you always see effects of your own previous actions
Social feeds: new posts appear after older ones from the same user
Database reads after writes: if you wrote a record, you can read it back

When you actually need linearizability (rare):

Distributed lock (agree on one holder)
Global uniqueness constraint (username, email must be unique across all nodes)
Atomic increment (e.g., counter that multiple nodes increment)
Cross-channel consistency where an external system observes your writes

Key insight: Most user-facing application consistency requirements reduce to causal consistency, not full linearizability. Implementing causal consistency does not require consensus-level coordination — it’s achievable with vector clocks and careful ordering.

Logical Clocks

What is a Lamport clock and what does it guarantee?
?
Lamport clock: A simple counter-based logical clock (Leslie Lamport, 1978).

Algorithm:

Each node maintains an integer counter L, initially 0
Before any event or send: L = L + 1
Include current L value in every message sent
On receiving message with clock value m: L = max(L, m) + 1

What it guarantees:

If A → B (A happened before B, causally): then L(A) < L(B) — always true
Contrapositive: If L(A) ≥ L(B) then A did NOT happen before B (A is concurrent with or after B)

What it does NOT guarantee:

L(A) < L(B) does NOT imply A → B — A and B might be causally unrelated (concurrent)
Lamport clocks provide total order (all events can be compared) but not causal precision

Use cases:

Breaking ties between concurrent events (combine with node ID for total order)
Distributed mutex algorithms (Ricart-Agrawala)
Event sequencing in distributed logs

Limitation: To know if two events are concurrent vs causally ordered, you need vector clocks.

What are vector clocks, how do they work, and when should you use them over Lamport clocks?
?
Vector clock: A vector of per-node counters that precisely tracks causal relationships (Fidge/Mattern, 1988).

Algorithm (N nodes, each with vector V[0..N-1]):

Each node i maintains vector V, initially all zeros
Before local event at node i: V[i]++
Include current V in every message
On receive at node i from node j with clock V_msg:
- V[k] = max(V[k], V_msg[k]) for all k
- V[i]++

Causal comparison:

V(A) ≤ V(B) iff V(A)[k] ≤ V(B)[k] for all k (A happened-before B or equal)
V(A) < V(B) iff V(A) ≤ V(B) and V(A) ≠ V(B) (A causally precedes B)
Concurrent: Neither V(A) ≤ V(B) nor V(B) ≤ V(A) (no causal relationship)

Use vector clocks when:

You need to detect concurrent writes (to decide whether to merge or flag conflict)
Conflict detection in distributed key-value stores (Amazon Dynamo uses version vectors — a variant)
Debugging causal event ordering in distributed logs

Don’t use vector clocks when:

You have many nodes (O(N) space and transmission cost per event)
You only need total order, not causal precision (Lamport suffices)

Lamport vs Vector:

Lamport: O(1) space; gives total order; cannot detect concurrency
Vector: O(N) space; gives causal precision; detects concurrent events

What is a Hybrid Logical Clock (HLC) and why do databases like CockroachDB use it?
?
Hybrid Logical Clock (HLC): A clock that combines a physical wall clock timestamp with a logical counter.

Format: HLC = (physical_time, logical_counter)

Rules:

On local event: HLC.physical = max(HLC.physical, wall_clock); HLC.logical++
On send: same as local event; include HLC in message
On receive with HLC_msg:
- HLC.physical = max(HLC.physical, HLC_msg.physical, wall_clock)
- HLC.logical = max(HLC.logical, HLC_msg.logical) + 1

Properties:

Always monotonically increasing (never goes backward, unlike wall clocks)
Stays within bounded drift of physical wall time
Captures causal ordering (if A → B, then HLC(A) < HLC(B))
O(1) space — just one tuple

Why CockroachDB uses HLC:

Needs to assign transaction timestamps that are approximately physical time (for human-readable queries, time-travel queries)
Wall clocks can go backward (NTP correction) → would violate MVCC ordering
HLC gives monotonic timestamps that stay close to physical time → correct MVCC + usable timestamps

Comparison to TrueTime:

TrueTime (Spanner): physical time with bounded uncertainty; requires GPS + atomic clock hardware
HLC: software-only; no bounded uncertainty; just stays “close” to wall time
HLC is the practical alternative for systems without TrueTime hardware

ID Generation

Compare Snowflake IDs, ULIDs, and UUID v4 for distributed ID generation.
?
UUID v4 (random):

550e8400-e29b-41d4-a716-446655440000
128 bits of randomness

✅ Globally unique; zero coordination; universally supported
❌ Not sortable; random inserts fragment B-tree indexes
❌ 36-character string (large storage overhead)
Use when: simplicity matters more than performance or sortability

Snowflake ID (Twitter, 2010):

64-bit: [41-bit ms since epoch][10-bit node ID][12-bit sequence]

✅ Fits in SQL bigint; time-sorted (good B-tree locality)
✅ 4096 IDs/ms/node with no cross-node coordination
❌ Requires pre-assigned node IDs (coordination for node registration)
❌ Sensitive to clock skew (can produce IDs out of order across nodes)
Use when: high throughput, need 64-bit integer, time-sorted inserts

ULID:

48-bit ms timestamp + 80-bit cryptographic random = 128 bits
Encoded as 26 Crockford base32 characters: 01ARZ3NDEKTSV4RRFFQ69G3BVZB

✅ Globally unique; sortable; URL-safe; no coordination
✅ Human-readable; lexicographic sort = time sort
❌ 128-bit (larger than Snowflake); not a native SQL integer type
Use when: need sortable UUIDs without pre-assigned node IDs

UUID v7 (RFC 9562, 2024):

48-bit ms timestamp + 74 bits random/version = 128 bits
Standard UUID format: 0192f8a4-e7b5-7abc-9def-123456789abc

✅ Standardized; time-sorted; UUID-compatible; no coordination
✅ Good choice for new systems needing sortable IDs
Use when: need a standard, sortable UUID for new systems (preferred over v4)

Consensus

What is consensus and what are its four formal properties?
?
Consensus: A group of nodes must agree on a single value. Once they decide, the decision is final and permanent.

Formal properties:

Uniform agreement: All non-faulty nodes decide on the same value — no two nodes decide different values
Validity: The decided value was actually proposed by some node — no value appears from nowhere
Integrity: Each node decides at most once — once decided, the decision cannot change
Termination (liveness): All non-faulty nodes eventually decide — the system doesn’t block forever

Practical equivalences (all reducible to consensus):

Leader election → agree on which node is leader
Total order broadcast → agree on the order of messages
Atomic commit (2PC) → agree on whether to commit or abort
Distributed lock → agree on which client holds the lock

The FLP impossibility: In a fully asynchronous network, consensus is impossible if even one node can fail. This is why all practical consensus algorithms (Raft, Paxos) assume partial synchrony (use timeouts).

How does Raft achieve consensus? Describe leader election and log replication.
?
Raft overview: Leader-based consensus algorithm designed for understandability. All writes go through the elected leader.

Leader election:

Nodes start as Followers; transition to Candidate on election timeout (no heartbeat received)
Candidate: increment term; vote for self; send RequestVote to all nodes
Becomes Leader if majority responds YES in this term
If another leader with higher term is seen: step down to Follower
If split vote (tie): timeout, increment term, re-elect
Safety guarantee: Only ONE leader per term (two candidates can’t both get majority of N nodes)

Log replication:

Client sends write to Leader
Leader appends entry to its log (uncommitted)
Leader sends AppendEntries RPC to all followers
Once majority acknowledges the entry: Leader marks it committed
Leader applies committed entry to state machine; responds to client
Followers apply committed entries on next AppendEntries (via commit index)

Key safety property: A leader always has all committed entries. The election algorithm ensures a candidate can only win if its log is at least as up-to-date as any majority’s logs. This prevents a new leader from missing committed entries.

What is total order broadcast (TOB) and why is it equivalent to consensus?
?
Total Order Broadcast (TOB):

All messages are delivered to all nodes in the same order
No message is skipped (reliable delivery)
Also called: atomic broadcast

Equivalence to consensus:

TOB = one consensus decision per message
Each position in the total order = one round of consensus
If you can solve consensus → you can implement TOB (use consensus to agree on the next message at each position)
If you can implement TOB → you can build consensus (broadcast a proposed value; the first delivered = the decision)

Practical use:

Replicated state machine: Apply the same TOB-ordered messages to the same initial state on each replica → all replicas end up in the same state
ZooKeeper uses Zab (ZooKeeper Atomic Broadcast) — a TOB protocol
Kafka uses Raft in KRaft mode for its controller log — a TOB protocol
etcd uses Raft — every key-value write = one TOB message

Why TOB is harder than regular broadcast: Regular broadcast (UDP multicast) doesn’t guarantee ordering or delivery. TOB requires consensus — all nodes must agree on the same ordering, even if some nodes are slow or partitioned.

Coordination Services

What does ZooKeeper provide and how does it implement leader election and fencing tokens?
?
ZooKeeper (Apache): Distributed coordination service built on Zab (ZooKeeper Atomic Broadcast — similar to Raft).

Core primitives:

Znodes: Files in a hierarchical tree; can store small data (config, status)
Ephemeral znodes: Automatically deleted when the creating client session ends → used for locks and presence detection
Sequential znodes: ZooKeeper appends a monotonically increasing number to the name (e.g., /election/candidate-00000042)
Watches: Notify a client when a znode is created, deleted, or changed → reactive programming without polling

Leader election with ZooKeeper:

All candidates create /election/candidate-0000000N (sequential + ephemeral)
Read all candidates; sort by sequence number
Node with LOWEST sequence number = current leader
Each other node watches the node with the NEXT LOWER sequence number (not the leader — avoids thundering herd)
When watched node is deleted (crash): re-check if now lowest → become leader; else watch new next-lower

Fencing tokens:

The sequential znode number is a natural fencing token (monotonically increasing)
Leader includes its sequence number in all writes to the protected resource
Resource rejects writes with a lower sequence number than max-seen

etcd (CNCF): The Kubernetes standard. Similar to ZooKeeper but uses Raft and a simpler key-value API. Used by Kubernetes for all control plane state (pods, services, configmaps, secrets).

Why should you use etcd or ZooKeeper rather than implementing consensus yourself?
?
Consensus algorithms are deceptively hard to implement correctly:

Raft has at least 8 distinct sub-cases in leader election alone
Membership changes (adding/removing nodes) require special handling (joint consensus or single-server changes) — many bugs here
Log compaction (snapshots) must handle concurrent log entries
Clock skew and network delays create subtle edge cases

What “production consensus” requires beyond the Raft paper:

Correct handling of all network failure modes (partition, delay, reorder)
Pre-vote phase (optimization to avoid spurious elections)
ReadIndex protocol for linearizable reads without writing to log
Learner nodes (read-only replicas during join)
Checkpointing and log truncation without losing committed entries

What etcd/ZooKeeper give you:

Battle-tested implementation with billions of production hours
Jepsen-verified (linearizability under fault injection)
Client libraries in all major languages
Operational tooling (metrics, snapshots, cluster membership)
Automatic leader election, distributed locks, watches — all primitives ready to use

Rule: Never implement Paxos/Raft in application code. Use etcd, ZooKeeper, or a managed consensus service.

Modern Context (2026)

What consensus implementations are used by major distributed databases in 2026?
?

System	Consensus	Notes
etcd	Raft	Kubernetes control plane; ~10,000 writes/sec
CockroachDB	Raft (per range)	Raft group per 64MB range
TiKV / TiDB	Raft (per region)	Similar architecture to CockroachDB
YugabyteDB	Raft (per tablet)	Same pattern
Kafka KRaft	Raft	Replaced ZooKeeper in Kafka 3.3+
MongoDB	Raft-based oplog	Replaced older oplog replication in 4.0+
Google Spanner	Paxos (per tablet group)	TrueTime for external consistency
FoundationDB	Paxos (internal)	Deterministic simulation testing

Key trend: Raft has become the universal standard for new distributed systems. Paxos is used in legacy or Google-specific systems. New systems almost exclusively choose Raft.

Why Raft won: Better documentation, more reference implementations, clearer correctness arguments, and the original Raft thesis was specifically written to be teachable.

How are distributed ID generation standards evolving in 2026?
?
UUID v7 (RFC 9562, standardized 2024):

Time-ordered UUIDs are now officially standardized
Expected to become the default UUID type for new applications
Replaces UUID v4 in cases where sortability matters (most database use cases)

ULID:

Widely adopted in event-driven systems, event sourcing architectures
Preferred in Node.js/TypeScript ecosystems
Crockford base32 encoding is URL-safe and human-readable

Snowflake variants at scale:

Discord: Incremented Snowflake epoch; uses 4096 shard IDs
Instagram: Postgres sequence + shard ID (hybrid approach)
LinkedIn: Uses LiSnowflake (custom variant optimized for their infrastructure)

TSID: Emerging as the preferred 64-bit sortable ID for systems that need to fit in SQL bigint without pre-assigned node IDs.

Industry shift: The days of UUID v4 as the default are ending. UUID v7 or ULID are the recommended defaults for new distributed systems where IDs need to be stored in a database or sorted.

Quick Facts

What is strict serializability and which systems provide it?
?

Strict serializability = Serializability + Linearizability (the strongest possible guarantee)
Serializability: Transactions execute in some serial order (for isolation between transactions)
Linearizability: The serial order respects real-time order (if T1 completed before T2 started, T1 appears before T2)

What this means practically:

All transactions are isolated (no dirty reads, no phantom reads — full serializability)
All transactions are immediately visible after commit (linearizability — no stale reads from any replica)
System behaves exactly like a single-threaded, single-node database

Systems that provide strict serializability:

Google Spanner: Paxos replication + TrueTime → globally strictly serializable
FoundationDB: Deterministic serialization + linearizable reads
CockroachDB: SSI (Serializable Snapshot Isolation) + Raft → serializable + linearizable

Cost: Every read and write requires coordination (consensus-level); high latency, especially cross-datacenter. Appropriate for financial systems, globally consistent config stores, distributed counters.

What is the difference between a consensus algorithm blocking during a partition vs violating safety?
?
Blocking (liveness violation):

If fewer than majority of nodes are reachable: consensus algorithm stops making progress
No new writes are accepted; reads may be stale or blocked
This is intentional — the algorithm refuses to proceed rather than risk split-brain
Example: 5-node Raft cluster, 2 nodes unreachable → 3 remaining nodes form majority → continues
Example: 5-node Raft cluster, 3 nodes unreachable → 2 remaining nodes cannot form majority → blocks

Safety violation (never happens in correct Raft/Paxos):

Two different groups both commit conflicting values (split brain)
Data corruption — two nodes think different values are “the truth”
Correct consensus algorithms sacrifice liveness before ever violating safety

Trade-off:

Raft/Paxos: n ≥ 2f+1 nodes required for progress (f = failed nodes); tolerates up to n/2 - 1 failures
BFT (Tendermint): requires n ≥ 3f+1; even more conservative
The algorithm prefers being unavailable over being incorrect

Practical implication: In a 5-node cluster, tolerate 2 node failures. Beyond that, the system blocks (writes rejected) until at least one node recovers. This is the correct behavior — it prevents data divergence.

Total Cards: 17
Review Time: ~25 minutes
Priority: HIGH
Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch10-flashcards