Chapter 10 Cheat Sheet — Consistency and Consensus

ddia-2e consistency consensus linearizability raft logical-clocks cheatsheet

One-Line Summaries

Concept	One-Liner
Linearizability	Operations appear instantaneous; once done, all readers see new value
Serializability	Transactions appear to execute in some serial order; that order can be “past”
Strict serializability	Both linearizable AND serializable — the strongest possible guarantee
CAP theorem	During a partition: choose linearizability OR availability, not both
PACELC	Even without partitions, consistency costs latency — PACELC measures both
Causal consistency	Causally related operations seen in order by all nodes; concurrent may reorder
Lamport clock	Per-node counter; guarantees if A→B then L(A)<L(B); not vice versa
Vector clock	Per-node vector; precisely tracks causal relationships; O(N) space
HLC	Hybrid of wall time + logical counter; monotonic, causal, near-physical
Snowflake ID	64-bit: 41ms + 10 node + 12 seq; time-sorted, no coordination
ULID	48ms + 80 random bits; sortable, human-readable, no coordination
Consensus	Nodes agree on one value; once decided, cannot change
Raft	Leader-based consensus; leader election + log replication; preferred in 2026
Total order broadcast	All nodes receive messages in same order; equivalent to consensus
etcd / ZooKeeper	Consensus-backed coordination services; use these, don’t implement Raft yourself

Linearizability Illustrated

What linearizability guarantees:

CLIENT 1: ├── write(x=1) ──┤
CLIENT 2:     ├────────────── read(x) ──┤     ← MUST return x=1
CLIENT 3:                    ├── read(x) ──┤   ← MUST return x=1

The write completes at one "linearization point" in real time.
All reads that START after that point MUST return the new value.

NON-LINEARIZABLE scenario:
CLIENT 1: ├── write(x=1) ──────────────────┤
CLIENT 2:                    ├── read(x) ──┤   ← Returns x=0 (VIOLATION!)
The read started WHILE the write was in progress — this is OK.
But if the write COMPLETED before the read started → violation.

Consistency Models (Weakest to Strongest)

WEAKEST                                                         STRONGEST
────────────────────────────────────────────────────────────────────────
Eventual         Monotonic    Read-Your-    Consistent  Causal   Linear-   Strict
Consistency      Reads        Writes        Prefix      Consist. izability Serial.
(Dynamo AP)      (per client) (per session) (ordered    (causal  (atomic   (Spanner)
                                            ops)        order)   ops)
────────────────────────────────────────────────────────────────────────
Coordination:    None ───────────────────────────────────────────────→ Maximum
Latency:         Lowest ──────────────────────────────────────────→ Highest
Availability:    Highest ────────────────────────────────────────→ Lower

CAP and PACELC

CAP (simplified):
  C = Linearizability  A = Availability  P = Partition happens
  
  During partition, choose:
  CP: Block / return error (never serve stale data)
  AP: Serve stale data (always return a response)

  CP systems: ZooKeeper, etcd, HBase, CockroachDB
  AP systems: Cassandra, DynamoDB (default), Riak

PACELC (better model):
  During Partition: A vs C (same as CAP)
  Else (normal): Latency vs Consistency
  
  System         P-choice  E-choice
  ─────────────────────────────────
  Spanner        C         C (pays latency for consistency)
  CockroachDB    C         C
  etcd           C         C
  DynamoDB       A         L (lower latency with eventual)
  Cassandra      A         L (tunable; default AP/EL)

Logical Clock Types

LAMPORT CLOCKS:
  - Each node has counter L
  - On send: L++; include L in message
  - On receive msg with m: L = max(L, m) + 1
  - Property: A → B implies L(A) < L(B)
  - Limitation: L(A) < L(B) does NOT imply A → B
  - Size: O(1) — just one integer
  - Gives: Total order (all events comparable)

VECTOR CLOCKS:
  - Each node has vector V[N] (N = number of nodes)
  - On event at node i: V[i]++
  - On receive from j: V[i] = max(V[i], V_received); V[i][i]++
  - Property: A → B iff V(A) < V(B) (precisely)
  - Can detect concurrent events: neither V(A)≤V(B) nor V(B)≤V(A)
  - Size: O(N) — grows with number of nodes
  - Gives: Precise causal tracking

HYBRID LOGICAL CLOCKS (HLC):
  - Format: (physical_time, logical_counter)
  - Always monotonically increasing
  - Stays within bounded drift of wall clock
  - Used by: CockroachDB, YugabyteDB
  - Gives: Total order + causal + approximate physical time

COMPARISON:
─────────────────────────────────────────────────────────
              Total Order  Causal Precise  Physical  Size
─────────────────────────────────────────────────────────
Lamport       YES          NO             NO        O(1)
Vector        NO           YES            NO        O(N)
Wall clock    YES          NO             YES (±10ms) O(1)
HLC           YES          PARTIAL        APPROX    O(1)
TrueTime      YES          YES            BOUNDED   O(1)
─────────────────────────────────────────────────────────

ID Generation Strategies

AUTO-INCREMENT (MySQL, PostgreSQL SERIAL):
  ┌──────────────────────────────────────┐
  │ 32 or 64-bit integer (per-node)      │
  └──────────────────────────────────────┘
  ✓ Simple, perfectly sequential
  ✗ Not globally unique (sharded systems)
  ✗ Single point of coordination

UUID v4:
  ┌──────────────────────────────────────────────────────────┐
  │ 550e8400-e29b-41d4-a716-446655440000  (128-bit random)   │
  └──────────────────────────────────────────────────────────┘
  ✓ Globally unique; zero coordination
  ✗ Not sortable; random DB index = B-tree fragmentation

UUID v7 (RFC 9562, 2024):
  ┌───────────────────────────┬──────────────────────────────┐
  │ 48-bit millisecond time   │ 74 bits random+version       │
  └───────────────────────────┴──────────────────────────────┘
  ✓ Globally unique; sortable; standardized
  ✓ Good DB index locality (time-ordered inserts)

SNOWFLAKE ID (Twitter, 2010):
  ┌──────────────────────────┬───────────┬──────────────────┐
  │ 41-bit ms since epoch    │ 10-bit    │ 12-bit sequence  │
  │ (Jan 1, 2010)            │ node ID   │ (per ms)         │
  └──────────────────────────┴───────────┴──────────────────┘
  ✓ 64-bit int (fits bigint); 4096 IDs/ms/node
  ✓ Time-sorted within same ms on same node
  ✓ No cross-node coordination
  ✗ Requires pre-assigned node IDs
  Used by: Twitter, Discord, Instagram

ULID:
  ┌──────────────────────┬──────────────────────────────────┐
  │ 48-bit ms timestamp  │ 80-bit cryptographic random      │
  └──────────────────────┴──────────────────────────────────┘
  ✓ Lexicographically sortable; URL-safe Crockford base32
  ✓ No coordination; 128-bit globally unique
  ✗ Larger than Snowflake; random within same ms (not ordered)

TSID (64-bit):
  ┌──────────────────────┬──────────────────────────────────┐
  │ 42-bit timestamp     │ 22 bits random or node+sequence  │
  └──────────────────────┴──────────────────────────────────┘
  ✓ Fits SQL bigint; sortable; no coordination
  ✓ Smaller storage than ULID (64 vs 128 bits)

Raft Leader Election Flow

FOLLOWER: Receives heartbeats from leader.
  No heartbeat within election_timeout (150–300ms)?
  → Transition to CANDIDATE

CANDIDATE: 
  1. Increment current term
  2. Vote for self
  3. Send RequestVote RPC to all other nodes
  4. Wait for votes...
  
  Majority says "yes"?  → Become LEADER
  See higher term?      → Back to FOLLOWER
  Timeout (split vote)? → Start new election (new term)

LEADER:
  1. Immediately send heartbeats to all followers
  2. Accept client writes; append to log
  3. Send AppendEntries RPC (replication)
  4. Once majority acknowledges: commit entry
  5. Apply committed entries to state machine
  
  See higher term from any message? → Step down to FOLLOWER

ELECTION SAFETY: 
  Only ONE leader per term possible.
  A candidate needs MAJORITY votes.
  Two candidates cannot both get majority of N nodes.

Raft Log Replication

                      2. AppendEntries RPC
Client → LEADER ──────────────────────────→ Follower 1  ──┐
request    │    ──────────────────────────→ Follower 2  ──┼── majority
           │    ──────────────────────────→ Follower 3  ──┘
           │
           │    ← Acknowledge from majority
           │
           ↓ 3. Mark entry COMMITTED; apply to state machine
           └────────────────────────────────────────────────→ Client response

COMMIT = "written to majority of nodes"
Even if leader crashes now, majority still has the entry.
A new leader will have this entry (Raft election safety guarantee).

IMPORTANT: Entry is APPLIED (executed) after being COMMITTED.
           Committed = durable + agreed. Applied = result visible to clients.

ZooKeeper Leader Election Pattern

ALL CANDIDATES:
  1. Create sequential ephemeral znode: /election/candidate-0000000N

READ ALL CANDIDATES, sort by sequence number:
  /election/candidate-00000001  ← LOWEST = current leader
  /election/candidate-00000002
  /election/candidate-00000003

EACH NON-LEADER NODE:
  Watch the node with the NEXT LOWER sequence number
  (NOT the leader — avoids "herd effect" of all nodes watching leader)

WHEN WATCHED NODE DELETED (crashed):
  Re-check if I'm now the lowest → become leader
  Or watch the new next-lower node

CRASH RECOVERY:
  Leader crashes → ephemeral znode deleted → watch fires on next node
  Next node discovers it's now lowest → becomes leader
  Fencing token = sequence number (monotonically increasing!)

The Consensus Trade-off

CONSENSUS (Raft, Paxos, Zab):

GUARANTEES:
  ✓ All non-faulty nodes agree on same value
  ✓ Value was proposed by some node (validity)
  ✓ Value decided at most once (integrity)
  ✓ Eventually decides (liveness — under partial synchrony)

COSTS:
  ✗ Requires MAJORITY of nodes healthy to make progress
  ✗ If < majority available: BLOCKS (no writes) — safety over liveness
  ✗ Every write waits for majority acknowledgment
  ✗ Cross-datacenter: ~100ms write latency
  ✗ Single-datacenter: ~1–5ms write latency (good)
  ✗ Leader election: ~1–3 round trips (150–500ms during election)

WHEN TO USE:
  ✓ Leader election (who is the primary?)
  ✓ Distributed lock / lease service
  ✓ Uniqueness constraints across nodes
  ✓ Atomic commit across shards
  ✓ Service discovery / configuration management

WHEN NOT TO USE:
  ✗ Every user-facing read (too expensive; use eventually consistent reads)
  ✗ Event ordering that only needs causal consistency (use HLC/vector clocks)
  ✗ Analytics queries (eventual consistency is fine)

Decision Tree: Choosing Consistency Level

Do you need operations to be ATOMIC across multiple keys/tables?
├── YES → Transactions (Ch8) + possibly linearizable storage
└── NO  → Continue...

Do you need ALL nodes to see ALL writes in REAL-TIME order?
├── YES → Linearizability (consensus; use etcd, CockroachDB, Spanner)
└── NO  → Continue...

Do you need to see CAUSALLY RELATED operations in order?
├── YES → Causal consistency (vector clocks, HLC; no full consensus needed)
└── NO  → Continue...

Do you need to see YOUR OWN previous writes?
├── YES → Read-your-writes / monotonic reads (sticky sessions, read from leader)
└── NO  → Eventual consistency is sufficient (Cassandra AP, DynamoDB default)

ORDERING EVENTS (not key-value ops)?
├── Need physical time:   HLC or TrueTime
├── Need causal order:    Vector clocks
└── Need total order:     Lamport clocks (or Snowflake/ULID timestamps)

Consensus Equivalences

These problems are ALL equivalent (each can implement the others):

CONSENSUS                TOTAL ORDER BROADCAST      ATOMIC COMMIT
─────────────────────────────────────────────────────────────────
Agree on one value    ≡  All nodes receive msgs   ≡  All nodes commit
                          in same order               or all abort
                          
LEADER ELECTION       ≡  REPLICATED STATE MACHINE ≡  DISTRIBUTED LOCK
─────────────────────────────────────────────────────────────────
Agree on one leader   ≡  Same commands, same       ≡  Agree on lock holder
                          order, same state
                          
If you can solve one, you can solve all of them.
If consensus is impossible → all of the above are impossible too.

Red Flags

✗ "Our read quorum (r+w>n) ensures linearizable reads" 
  → Not true; concurrent writes + clock skew can still violate linearizability

✗ "We use multi-leader replication with timestamps for LWW; it's consistent"
  → Multi-leader is never linearizable; LWW with timestamps silently discards writes

✗ "We'll implement our own Raft; it's just log replication"
  → Raft has many subtle correctness requirements; use etcd/ZooKeeper

✗ "UUID v4 is fine for all distributed IDs"
  → Random IDs fragment B-tree indexes; consider UUID v7 or ULID for high-insert workloads

✗ "We need 100% availability during partitions AND linearizability"
  → CAP: impossible. Choose which one to sacrifice during partitions.

✗ "Causal consistency is just weaker eventual consistency"
  → No: causal consistency provides real ordering guarantees (happens-before preserved)

Green Flags

✓ Use etcd or ZooKeeper for leader election / distributed locks (not homegrown)
✓ Document your consistency model per endpoint (which ops are linearizable, which are causal)
✓ Use Snowflake IDs or ULID for distributed ID generation (time-sorted, no coordination)
✓ Use HLC in distributed DBs that need transaction ordering without TrueTime hardware
✓ Run Jepsen tests before shipping new consistency-sensitive features
✓ Make the CAP choice explicit: which behavior during partitions?
✓ Use Raft (not Paxos) for new consensus implementations
✓ Audit which operations TRULY need linearizability vs. causal consistency

Interview Response Templates

”What is linearizability and when do you need it?”

“Linearizability means the system behaves as if there’s one copy of the data on a single, infinitely fast CPU — every operation appears to execute atomically at a single point in real time. Once an operation completes, all subsequent reads from any node return the new value. You need it for distributed locks (no split-brain), unique constraints (no two nodes can both claim the same username), and cross-channel consistency (uploading a photo and immediately sending a link). The cost is that every operation requires consensus-level coordination, and the system may block during a network partition."

"Explain CAP theorem and how you’d decide between CP and AP”

“CAP says that during a network partition, you must choose between linearizability (CP) and availability (AP) — you can’t have both. The choice depends on the use case. For a payment service or lock service, CP is correct — returning incorrect data is worse than returning an error. For a social feed or recommendation engine, AP is better — a slightly stale feed is better than a 500 error. In practice, PACELC is more useful: even without partitions, stronger consistency costs latency. I’d map each data type in the system to its minimum consistency requirement rather than applying one setting globally.”

Quick Revision Time: 7 minutes
Interview Prep: 20 minutes
Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch10-cheatsheet