Chapter 9 Cheat Sheet — The Trouble with Distributed Systems
ddia-2e distributed-systems cheatsheet
One-Line Summaries
| Concept | One-Liner |
|---|---|
| Partial failure | Some nodes work, some don’t — normal state in distributed systems |
| Timeout ambiguity | No response ≠ failure; the server may have processed your request and crashed after |
| Synchronous network | Reserved bandwidth, bounded latency — only circuit-switched (phone) networks |
| Asynchronous network | No bandwidth reservation, unbounded latency — all packet networks (internet, datacenters) |
| Clock skew | Different machines drift by milliseconds even with NTP; wall clocks can go backward |
| Monotonic clock | Always moves forward; safe for measuring elapsed time; meaningless across nodes |
| GC pause | JVM stop-the-world can freeze a process for seconds; ZGC limits to < 10ms |
| Fencing token | Monotonically increasing number with each lock grant; resource rejects stale tokens |
| Byzantine fault | Node doesn’t just crash — it lies, sends contradictory info (malicious or corrupted) |
| Partially synchronous | Usually bounded timing; occasionally violates bounds — the realistic model |
| TLA+ | Formal spec language to exhaustively check all possible system interleavings |
| Jepsen | Fault-injection test framework; verifies linearizability under network partitions |
| Simulation testing | FoundationDB approach: run entire system deterministically; replay any bug |
The Network Ambiguity Problem
You sent a request. No response received after timeout. Could mean:
1. Request lost before reaching server (server never saw it)
2. Server down / unreachable (request never processed)
3. Server received; processing slowly (response coming any moment)
4. Server processed; response was lost (action already taken!)
5. Server processed; crashed immediately (action taken, state uncertain)
6. Network partition between you and it (server alive, can't reach you)
You CANNOT distinguish these with certainty.
IMPLICATION: Make all operations IDEMPOTENT.
Use request IDs to deduplicate retries server-side.
Network Types
SYNCHRONOUS (Circuit-Switched): ASYNCHRONOUS (Packet-Switched):
───────────────────────────────── ─────────────────────────────────
Bandwidth: Reserved per call Bandwidth: Shared, best-effort
Latency: Bounded (known max) Latency: Unbounded (queuing)
Timeouts: Reliable Timeouts: Only "I gave up waiting"
Queuing: None (slot reserved) Queuing: At every router/switch
Example: PSTN telephone network Example: Internet, datacenter LAN
Algorithm: Can use timing assumptions Algorithm: Must assume no timing
Clock Types and Their Uses
TIME-OF-DAY CLOCK (Wall Clock): MONOTONIC CLOCK:
───────────────────────────────── ─────────────────────────────────
System.currentTimeMillis() [Java] System.nanoTime() [Java]
time.time() [Python] time.monotonic() [Python]
CLOCK_REALTIME [Linux] CLOCK_MONOTONIC [Linux]
Returns: current UTC time Returns: elapsed since arb. epoch
Drift: up to 200ms/day (quartz)
NTP corrects to: ~10ms typical
~100ms under load Always forward; never backwards
SAFE FOR: SAFE FOR:
✓ Displaying time to users ✓ Measuring timeout duration
✓ Log event timestamps ✓ Rate limiting (single process)
✓ Rough ordering (same machine) ✓ Elapsed time measurement
NEVER USE FOR: NEVER USE FOR:
✗ LWW (last write wins) across nodes ✗ Comparing timestamps across nodes
✗ Distributed lock/lease decisions ✗ Event ordering in distributed sys
✗ Event ordering across nodes
Clock Accuracy Comparison
Method Accuracy Notes
───────────────────────────────────────────────────────────────
Software NTP ±10–100ms Free; default on most Linux
HW-timestamped NTP ±1–5ms PTP-capable NIC required
PTP (IEEE 1588) ±1–100µs Available on AWS, GCP, Azure
Google TrueTime ±7ms bounded GPS + atomic clocks; expensive HW
Hybrid Logical Clocks Near wall time Software; monotonic + wall
Lamport / Vector clocks N/A No physical time; causal only
The GC Pause / Lease Expiry Trap
Time 0: Node A acquires lease (30s TTL), gets fencing token=10
Time 1: Node A checks: "lease valid? ✓ 29s remaining"
Time 2: ┌──── JVM GC PAUSE (process is FROZEN) ────────────────┐
│ ...15 seconds pass... │
│ Lease expires at time 17 │
│ Node B acquires lease with token=11 │
Time 18: └──── GC pause ends; Node A RESUMES ──────────────────┘
Time 18: Node A thinks it still holds the lease!
Node A writes to storage.
WITHOUT FENCING TOKENS: ✗ Split brain — two nodes own the resource
WITH FENCING TOKENS: ✓ Storage sees token=10 < max=11, rejects write
KEY INSIGHT: Safety must be enforced by the RESOURCE, not the lock holder.
A paused process cannot correct its own mistake.
Other pause sources (beyond GC):
- VM live migration: 10ms–1s (cloud hypervisor copies VM to new host)
- OS memory swap: Seconds (process paged to disk during memory pressure)
- Synchronous disk I/O: 5–100ms (fsync, large read)
- Hardware interrupt: Microseconds (CPU handles IRQ)
System Model Matrix
CRASH-STOP CRASH-RECOVERY BYZANTINE
────────── ────────────── ─────────
SYNCHRONOUS Simple theory Simpler DBs Aerospace
(unrealistic) (unrealistic)
PARTIALLY Raft theory RAFT IN PRACTICE Tendermint
SYNCHRONOUS Paxos theory etcd, ZooKeeper HotStuff
ASYNCHRONOUS FLP proof Complex design Extremely hard
(too strict) (too strict) (near impossible)
Most real distributed DBs: PARTIALLY SYNCHRONOUS + CRASH-RECOVERY
Node Failure Models
CRASH-STOP: CRASH-RECOVERY: BYZANTINE:
───────────── ──────────────── ──────────
Node halts; never Node can crash, Node can lie,
restarts restart with durable send contradictory
state intact messages, act maliciously
n ≥ 2f+1 for n ≥ 2f+1 for n ≥ 3f+1 for
quorum quorum quorum (3x overhead)
Example: Used in Example: Practical Example: Blockchain,
theory papers DBs (Raft, Paxos) aerospace, BFT
Assumption: Dead Assumption: Data Assumption: Any
means gone forever survives crashes node could be enemy
Byzantine Faults: When You Need BFT
CRASH MODEL (most datacenter systems):
Node either works or fails silently
Trust all messages from live nodes
n ≥ 2f+1 nodes to tolerate f failures
Examples: Raft, Paxos, ZooKeeper, etcd
BYZANTINE MODEL (adversarial / untrusted environments):
Nodes can send incorrect / malicious data
Must verify all messages cryptographically
n ≥ 3f+1 nodes to tolerate f failures
~3x message overhead per consensus round
Examples: Tendermint, PBFT, HotStuff (blockchain)
DO YOU NEED BFT?
YES: Blockchain, multi-organization systems, aerospace, nuclear
NO: Corporate datacenter, cloud-hosted DB, internal microservices
Formal Methods Overview
TLA+ / MODEL CHECKING:
What: Formal spec language; enumerate all possible system states
How: TLC model checker checks safety + liveness properties
Cost: Write abstract spec (not implementation code)
Finds: Bugs reachable only via rare concurrent failure combinations
Used: Amazon AWS, Microsoft, MongoDB, CockroachDB
Famous: AWS found 10 bugs; 7 missed by code review + testing
JEPSEN TESTING:
What: Fault-injection + linearizability checker
How: Inject partitions, kill nodes, check history with Knossos
Cost: Tests the actual running system (not a model)
Finds: Real bugs in real implementations under real faults
Used: MongoDB, Redis, Cassandra, CockroachDB, etc.
Famous: Found data loss bugs in almost every major distributed DB
SIMULATION TESTING (FoundationDB approach):
What: Run entire system in deterministic event simulation
How: All I/O, time, and network are simulated; single-threaded
Cost: Requires building simulation mode from the start
Finds: Any bug that can be reproduced by replaying the same seed
Used: FoundationDB, TigerBeetle, Antithesis (startup)
Key: Perfect bug reproducibility — same seed = same execution
Decision Tree: Which Clock / Ordering Mechanism?
Need to ORDER events across nodes?
├── Physical time is not important → USE LAMPORT CLOCKS
│ (only need causal ordering; events may be concurrent)
│
├── Physical time matters; some drift OK → USE HYBRID LOGICAL CLOCKS (HLC)
│ (CockroachDB, YugabyteDB; stays close to wall time)
│
└── Strict physical time with bounded uncertainty → USE TRUETIME (Spanner)
(requires GPS + atomic clock hardware; only Google Spanner provides this)
Need to MEASURE ELAPSED TIME (timeouts, durations)?
└── Always use MONOTONIC CLOCK (CLOCK_MONOTONIC, System.nanoTime())
Need to DISPLAY TIME to users or write to logs?
└── Use WALL CLOCK, but accept it may be wrong by ~10ms or more
Red Flags
✗ "If no response, request must have failed" → may have already executed
✗ Using wall clock for distributed lock expiry → 10ms drift = wrong expiry
✗ LWW with wall clock timestamps → silently discards writes with lower timestamps
✗ Assuming GC pauses < 10ms (ZGC) means fencing tokens aren't needed
✗ "Our datacenter network never partitions" → all networks partition eventually
✗ "We use NTP so our clocks are synchronized" → NTP is ±10ms, not perfect
✗ Building distributed consensus from scratch → use etcd/ZooKeeper instead
Green Flags
✓ All external operations are idempotent with unique request IDs
✓ Fencing tokens for all lease-based distributed locks
✓ Explicit timeouts on every network call (not zero, not infinite)
✓ Monotonic clock for elapsed time; wall clock only for display/logging
✓ Running Jepsen or equivalent fault-injection tests on releases
✓ TLA+ spec for critical protocol design decisions
✓ System model documented explicitly (timing model + failure model)
✓ Chaos engineering / fault injection in production (circuit breakers)
Network Partition Scenarios
Full connectivity: All nodes communicate normally
Network partition: Group A ←─────────────→ Group B
(cannot communicate)
Risk: both groups elect leaders → split brain
Asymmetric partition: A ──→ B works, B ──→ A fails
B sends but never receives; A hears nothing
Partial partition: Node C unreachable; A and B communicate fine
Looks like node failure, not partition
High latency (gray failure): All packets arrive, but in 5–10 seconds
Timeout-based detectors declare C dead
C is alive but slow → false positive failover
Interview Response Templates
”How do you handle network failures in your system?”
“All network calls are designed to be idempotent using unique request IDs — the server deduplicates if it sees the same ID twice. I set explicit timeouts on every call, knowing a timeout only means ‘I gave up waiting’ not ‘the operation failed.’ For critical operations, I track the request ID server-side to detect and deduplicate retries. For distributed locks, I use fencing tokens so the storage layer rejects stale writes from paused or slow processes."
"Why are distributed clocks unreliable?”
“Wall clocks are synchronized via NTP, which typically achieves ±10ms accuracy but can be much worse under load or network congestion. Quartz crystals drift ~200ms/day without correction. NTP can jump a clock backward during correction. This means two nodes can disagree about the current time by 10–100ms, which is enough to silently discard writes in a last-write-wins scheme. The safe approach is to use logical clocks (Lamport or vector clocks) for ordering, or Hybrid Logical Clocks for systems that need to stay close to wall time.”
Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-05-29