Chapter 8 Cheat Sheet - The Trouble with Distributed Systems
One-Line Summaries
| Concept | One-Liner |
|---|---|
| Network unreliability | Packets can be lost, delayed, or duplicated — no response ≠ node dead |
| Timeout ambiguity | After timeout, you can’t know if request was received/processed |
| Clock skew | Clocks on different machines drift and disagree by milliseconds |
| Wall clock danger | Wall clocks can go backward (NTP sync); unsafe for ordering |
| Monotonic clock | Always forward; good for elapsed time; not comparable across nodes |
| GC pause | JVM stop-the-world can pause process for seconds; invalidates timing |
| Fencing token | Monotonically increasing number with each lock grant; rejects stale writes |
| Byzantine fault | Node behaves arbitrarily/maliciously (not just crashes) |
| FLP impossibility | Consensus impossible in asynchronous network if any node can fail |
| Partial failure | Some nodes work, some don’t — normal in distributed systems |
What You Can’t Know After a Timeout
You sent a request. No response received. This could mean:
1. ❓ Request lost before reaching server (server never saw it)
2. ❓ Server received request, is processing slowly, response coming
3. ❓ Server received request, processed it, response was lost
4. ❓ Server crashed before processing
5. ❓ Server crashed after processing (action already taken!)
6. ❓ Server alive but network between you and it is partitioned
You CANNOT distinguish these cases with certainty.
Implication: Retries must be idempotent (safe to repeat without side effects)
Clock Types and Their Uses
Time-of-Day Clock (Wall Clock): Monotonic Clock:
───────────────────────────────── ────────────────────────────────
System.currentTimeMillis() [Java] System.nanoTime() [Java]
time.time() [Python] time.monotonic() [Python]
Returns: current UTC time Returns: elapsed time since arbitrary point
Problems:
❌ Can jump backward (NTP correction) ✅ Always increases
❌ Can jump forward (large NTP sync) ✅ Good for measuring intervals
❌ Differs between machines ❌ Meaningless to compare across nodes
❌ Unreliable for event ordering ❌ Resets on process restart
USE FOR: USE FOR:
- Displaying time to users - Measuring timeout duration
- Rough time-ordering (with caution) - Measuring elapsed time
- Logging events - Timer/rate measurements
NEVER USE FOR:
- Determining which write happened last (LWW with timestamps = data loss risk)
- Distributed locking/leasing
The GC Pause Trap
DANGER SCENARIO:
─────────────────
Time 0: Node acquires lease (expires in 30s)
Time 1: Node checks: lease valid (29s remaining) ✅
Time 2: JVM GC pause begins [PROCESS FROZEN]
...15 seconds pass...
Time 17: Lease expires (another node takes over as leader)
Time 18: GC pause ends, node resumes
Time 18: Node still thinks it holds lease! → SPLIT BRAIN
FIX: Fencing tokens
──────────────────
Lock service issues fencing token (monotonically increasing number):
Node A gets lease + token=10
Node A pauses
Lease expires; Node B gets lease + token=11
Node A resumes, tries to write with token=10
Storage: "token 10 < 11, reject!" ← Stale write blocked
System Models Summary
Timing Model:
─────────────
Synchronous: Bounded delays; know exactly how long operations take (unrealistic)
Asynchronous: No timing guarantees at all (too pessimistic; consensus impossible)
Partially sync: Usually bounded; occasionally violates bounds (REALISTIC MODEL)
Node Failure Model:
──────────────────
Crash-stop: Node stops and never recovers (safe to assume it's dead)
Crash-recovery: Node can crash and restart; durable state survives (most practical)
Byzantine: Node can lie, be malicious, send contradictory messages (hardest)
Most distributed algorithms assume: Partially synchronous + crash-recovery
Clock Safety in Distributed Systems
UNSAFE: SAFE:
─────────────────────────────── ──────────────────────────────────
LWW with wall clock timestamps Logical clocks (Lamport timestamps)
Node A writes x=1 at t=100 Each event increments a counter
Node B writes x=2 at t=99 (1ms drift) Ordering: causal not temporal
LWW picks x=1 (WRONG! x=2 was later)
Using wall clock for timeout checks Monotonic clock for timeouts
"Was this write within last 100ms?"
(clocks can drift; answer unreliable)
Using wall clock for distributed lock Fencing tokens for locks
"My lease hasn't expired yet" Storage service rejects stale tokens
(your clock may disagree with server's)
Network Failure Modes
Full connectivity: All nodes can communicate — ideal
Network partition: Group A ←---→ Group B (cannot communicate)
Both groups may elect leaders → split brain risk
Asymmetric partition: A→B works, B→A fails (rare but happens)
B sends, never receives; A never hears from B
Switch failure: Subset of nodes unreachable
Often looks like packet loss / high latency first
NIC firmware bug: Intermittent packet drops on one interface
Very hard to diagnose
Congestion: High latency, not loss; TCP retries hide it
Appears as slow responses, not failures
Key Trade-offs
| Decision | Pro | Con | When to Use |
|---|---|---|---|
| Short timeout | Fast failure detection | False positives under load | Low-latency SLAs |
| Long timeout | Fewer false positives | Slow recovery from real failures | Stability over speed |
| Logical clocks | Safe ordering across nodes | No physical time | Event ordering |
| TrueTime (Spanner) | Bounded physical time | Expensive hardware | Global serializable DB |
| Fencing tokens | Prevents stale writes | Extra complexity in storage | Leader election with leases |
| Idempotent retries | Safe to retry | Requires idempotency design | All network operations |
Red Flags
❌ “If we don’t hear back, it means the request failed” (might have succeeded!)
❌ Using wall clock for distributed lock expiry decisions
❌ Thread.sleep() or time-based waits as coordination mechanism
❌ Assuming GC pauses don’t happen (they do, even on modern JVMs)
❌ “Our network never has partitions” (all networks partition eventually)
❌ Using timestamps from multiple machines for LWW without bounded clock sync
Green Flags
✅ Design all external operations to be idempotent (safe to retry)
✅ Use monotonic clocks for measuring elapsed time / timeouts
✅ Fencing tokens for any lease-based distributed locking
✅ Set explicit timeouts on all network calls
✅ Assume network partitions will happen; test with chaos engineering
✅ Use logical clocks for event ordering across distributed nodes
Modern Additions (2026)
Low-latency GC (Java ZGC, Shenandoah):
├─ Max GC pause: < 10ms (vs seconds for old GC)
├─ Reduces (but doesn't eliminate) GC pause risk
└─ JVM systems safer for short-lease distributed locking
Service mesh (Istio, Linkerd):
├─ Transparent retry, timeout, circuit breaking
├─ Network failures handled at infrastructure layer
└─ Application code freed from network retry logic
PTP (Precision Time Protocol):
├─ Microsecond accuracy (vs millisecond for NTP)
├─ Available in AWS, GCP for VMs
└─ Reduces clock skew but doesn't eliminate it
Interview Response Templates
When Asked About Handling Network Failures
“Network calls can fail in many ways — request lost, response lost, server slow, partition. Since you can’t distinguish these, I’d design all calls to be idempotent so retries are safe. I’d set explicit timeouts (not too short to avoid false positives under load). For critical operations, I’d track request IDs server-side to detect and deduplicate retries.”
When Asked About Distributed Clocks
“Wall clocks are unreliable for ordering distributed events — NTP drift means different nodes disagree by milliseconds, and clocks can jump backward. For event ordering, I’d use Lamport timestamps or vector clocks (logical clocks). For measuring timeouts, monotonic clocks are safe. The only exception is Google Spanner’s TrueTime, which uses GPS+atomic clocks to provide physical time with bounded uncertainty.”
Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-13