Chapter 9 Cheat Sheet — The Trouble with Distributed Systems

ddia-2e distributed-systems cheatsheet


One-Line Summaries

ConceptOne-Liner
Partial failureSome nodes work, some don’t — normal state in distributed systems
Timeout ambiguityNo response ≠ failure; the server may have processed your request and crashed after
Synchronous networkReserved bandwidth, bounded latency — only circuit-switched (phone) networks
Asynchronous networkNo bandwidth reservation, unbounded latency — all packet networks (internet, datacenters)
Clock skewDifferent machines drift by milliseconds even with NTP; wall clocks can go backward
Monotonic clockAlways moves forward; safe for measuring elapsed time; meaningless across nodes
GC pauseJVM stop-the-world can freeze a process for seconds; ZGC limits to < 10ms
Fencing tokenMonotonically increasing number with each lock grant; resource rejects stale tokens
Byzantine faultNode doesn’t just crash — it lies, sends contradictory info (malicious or corrupted)
Partially synchronousUsually bounded timing; occasionally violates bounds — the realistic model
TLA+Formal spec language to exhaustively check all possible system interleavings
JepsenFault-injection test framework; verifies linearizability under network partitions
Simulation testingFoundationDB approach: run entire system deterministically; replay any bug

The Network Ambiguity Problem

You sent a request. No response received after timeout. Could mean:

1. Request lost before reaching server    (server never saw it)
2. Server down / unreachable              (request never processed)
3. Server received; processing slowly     (response coming any moment)
4. Server processed; response was lost    (action already taken!)
5. Server processed; crashed immediately  (action taken, state uncertain)
6. Network partition between you and it   (server alive, can't reach you)

You CANNOT distinguish these with certainty.

IMPLICATION: Make all operations IDEMPOTENT.
             Use request IDs to deduplicate retries server-side.

Network Types

SYNCHRONOUS (Circuit-Switched):          ASYNCHRONOUS (Packet-Switched):
─────────────────────────────────        ─────────────────────────────────
Bandwidth: Reserved per call             Bandwidth: Shared, best-effort
Latency:   Bounded (known max)           Latency:   Unbounded (queuing)
Timeouts:  Reliable                      Timeouts:  Only "I gave up waiting"
Queuing:   None (slot reserved)          Queuing:   At every router/switch
Example:   PSTN telephone network        Example:   Internet, datacenter LAN
Algorithm: Can use timing assumptions    Algorithm: Must assume no timing

Clock Types and Their Uses

TIME-OF-DAY CLOCK (Wall Clock):          MONOTONIC CLOCK:
─────────────────────────────────        ─────────────────────────────────
System.currentTimeMillis() [Java]        System.nanoTime() [Java]
time.time() [Python]                     time.monotonic() [Python]
CLOCK_REALTIME [Linux]                   CLOCK_MONOTONIC [Linux]

Returns: current UTC time                Returns: elapsed since arb. epoch

Drift: up to 200ms/day (quartz)
NTP corrects to: ~10ms typical
                 ~100ms under load       Always forward; never backwards

SAFE FOR:                                SAFE FOR:
✓ Displaying time to users               ✓ Measuring timeout duration
✓ Log event timestamps                   ✓ Rate limiting (single process)
✓ Rough ordering (same machine)          ✓ Elapsed time measurement

NEVER USE FOR:                           NEVER USE FOR:
✗ LWW (last write wins) across nodes    ✗ Comparing timestamps across nodes
✗ Distributed lock/lease decisions       ✗ Event ordering in distributed sys
✗ Event ordering across nodes

Clock Accuracy Comparison

Method                  Accuracy        Notes
───────────────────────────────────────────────────────────────
Software NTP            ±10–100ms       Free; default on most Linux
HW-timestamped NTP      ±1–5ms          PTP-capable NIC required
PTP (IEEE 1588)         ±1–100µs        Available on AWS, GCP, Azure
Google TrueTime         ±7ms bounded    GPS + atomic clocks; expensive HW
Hybrid Logical Clocks   Near wall time  Software; monotonic + wall
Lamport / Vector clocks N/A             No physical time; causal only

The GC Pause / Lease Expiry Trap

Time 0:    Node A acquires lease (30s TTL), gets fencing token=10
Time 1:    Node A checks: "lease valid? ✓ 29s remaining"
Time 2:    ┌──── JVM GC PAUSE (process is FROZEN) ────────────────┐
           │ ...15 seconds pass...                                  │
           │ Lease expires at time 17                               │
           │ Node B acquires lease with token=11                    │
Time 18:   └──── GC pause ends; Node A RESUMES ──────────────────┘
Time 18:   Node A thinks it still holds the lease!
           Node A writes to storage.

WITHOUT FENCING TOKENS: ✗ Split brain — two nodes own the resource
WITH FENCING TOKENS:    ✓ Storage sees token=10 < max=11, rejects write

KEY INSIGHT: Safety must be enforced by the RESOURCE, not the lock holder.
             A paused process cannot correct its own mistake.

Other pause sources (beyond GC):

  • VM live migration: 10ms–1s (cloud hypervisor copies VM to new host)
  • OS memory swap: Seconds (process paged to disk during memory pressure)
  • Synchronous disk I/O: 5–100ms (fsync, large read)
  • Hardware interrupt: Microseconds (CPU handles IRQ)

System Model Matrix

                 CRASH-STOP      CRASH-RECOVERY    BYZANTINE
                 ──────────      ──────────────    ─────────
SYNCHRONOUS      Simple theory   Simpler DBs       Aerospace
                 (unrealistic)   (unrealistic)

PARTIALLY        Raft theory     RAFT IN PRACTICE  Tendermint
SYNCHRONOUS      Paxos theory    etcd, ZooKeeper   HotStuff

ASYNCHRONOUS     FLP proof       Complex design    Extremely hard
                 (too strict)    (too strict)      (near impossible)

Most real distributed DBs: PARTIALLY SYNCHRONOUS + CRASH-RECOVERY

Node Failure Models

CRASH-STOP:                    CRASH-RECOVERY:              BYZANTINE:
─────────────                  ────────────────             ──────────
Node halts; never              Node can crash,              Node can lie,
restarts                       restart with durable         send contradictory
                               state intact                 messages, act maliciously

n ≥ 2f+1 for                  n ≥ 2f+1 for                n ≥ 3f+1 for
quorum                         quorum                       quorum (3x overhead)

Example: Used in               Example: Practical           Example: Blockchain,
theory papers                  DBs (Raft, Paxos)           aerospace, BFT

Assumption: Dead               Assumption: Data             Assumption: Any
means gone forever             survives crashes             node could be enemy

Byzantine Faults: When You Need BFT

CRASH MODEL (most datacenter systems):
  Node either works or fails silently
  Trust all messages from live nodes
  n ≥ 2f+1 nodes to tolerate f failures
  Examples: Raft, Paxos, ZooKeeper, etcd

BYZANTINE MODEL (adversarial / untrusted environments):
  Nodes can send incorrect / malicious data
  Must verify all messages cryptographically
  n ≥ 3f+1 nodes to tolerate f failures
  ~3x message overhead per consensus round
  Examples: Tendermint, PBFT, HotStuff (blockchain)

DO YOU NEED BFT?
  YES: Blockchain, multi-organization systems, aerospace, nuclear
  NO:  Corporate datacenter, cloud-hosted DB, internal microservices

Formal Methods Overview

TLA+ / MODEL CHECKING:
  What: Formal spec language; enumerate all possible system states
  How:  TLC model checker checks safety + liveness properties
  Cost: Write abstract spec (not implementation code)
  Finds: Bugs reachable only via rare concurrent failure combinations
  Used: Amazon AWS, Microsoft, MongoDB, CockroachDB
  Famous: AWS found 10 bugs; 7 missed by code review + testing

JEPSEN TESTING:
  What: Fault-injection + linearizability checker
  How:  Inject partitions, kill nodes, check history with Knossos
  Cost: Tests the actual running system (not a model)
  Finds: Real bugs in real implementations under real faults
  Used: MongoDB, Redis, Cassandra, CockroachDB, etc.
  Famous: Found data loss bugs in almost every major distributed DB

SIMULATION TESTING (FoundationDB approach):
  What: Run entire system in deterministic event simulation
  How:  All I/O, time, and network are simulated; single-threaded
  Cost: Requires building simulation mode from the start
  Finds: Any bug that can be reproduced by replaying the same seed
  Used: FoundationDB, TigerBeetle, Antithesis (startup)
  Key: Perfect bug reproducibility — same seed = same execution

Decision Tree: Which Clock / Ordering Mechanism?

Need to ORDER events across nodes?
├── Physical time is not important → USE LAMPORT CLOCKS
│   (only need causal ordering; events may be concurrent)
│
├── Physical time matters; some drift OK → USE HYBRID LOGICAL CLOCKS (HLC)
│   (CockroachDB, YugabyteDB; stays close to wall time)
│
└── Strict physical time with bounded uncertainty → USE TRUETIME (Spanner)
    (requires GPS + atomic clock hardware; only Google Spanner provides this)

Need to MEASURE ELAPSED TIME (timeouts, durations)?
└── Always use MONOTONIC CLOCK (CLOCK_MONOTONIC, System.nanoTime())

Need to DISPLAY TIME to users or write to logs?
└── Use WALL CLOCK, but accept it may be wrong by ~10ms or more

Red Flags

✗ "If no response, request must have failed" → may have already executed
✗ Using wall clock for distributed lock expiry → 10ms drift = wrong expiry
✗ LWW with wall clock timestamps → silently discards writes with lower timestamps
✗ Assuming GC pauses < 10ms (ZGC) means fencing tokens aren't needed
✗ "Our datacenter network never partitions" → all networks partition eventually
✗ "We use NTP so our clocks are synchronized" → NTP is ±10ms, not perfect
✗ Building distributed consensus from scratch → use etcd/ZooKeeper instead

Green Flags

✓ All external operations are idempotent with unique request IDs
✓ Fencing tokens for all lease-based distributed locks
✓ Explicit timeouts on every network call (not zero, not infinite)
✓ Monotonic clock for elapsed time; wall clock only for display/logging
✓ Running Jepsen or equivalent fault-injection tests on releases
✓ TLA+ spec for critical protocol design decisions
✓ System model documented explicitly (timing model + failure model)
✓ Chaos engineering / fault injection in production (circuit breakers)

Network Partition Scenarios

Full connectivity:           All nodes communicate normally

Network partition:           Group A ←─────────────→ Group B
                             (cannot communicate)
                             Risk: both groups elect leaders → split brain

Asymmetric partition:        A ──→ B works, B ──→ A fails
                             B sends but never receives; A hears nothing

Partial partition:           Node C unreachable; A and B communicate fine
                             Looks like node failure, not partition

High latency (gray failure): All packets arrive, but in 5–10 seconds
                             Timeout-based detectors declare C dead
                             C is alive but slow → false positive failover

Interview Response Templates

”How do you handle network failures in your system?”

“All network calls are designed to be idempotent using unique request IDs — the server deduplicates if it sees the same ID twice. I set explicit timeouts on every call, knowing a timeout only means ‘I gave up waiting’ not ‘the operation failed.’ For critical operations, I track the request ID server-side to detect and deduplicate retries. For distributed locks, I use fencing tokens so the storage layer rejects stale writes from paused or slow processes."

"Why are distributed clocks unreliable?”

“Wall clocks are synchronized via NTP, which typically achieves ±10ms accuracy but can be much worse under load or network congestion. Quartz crystals drift ~200ms/day without correction. NTP can jump a clock backward during correction. This means two nodes can disagree about the current time by 10–100ms, which is enough to silently discard writes in a last-write-wins scheme. The safe approach is to use logical clocks (Lamport or vector clocks) for ordering, or Hybrid Logical Clocks for systems that need to stay close to wall time.”


Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-05-29