Chapter 8 Cheat Sheet - The Trouble with Distributed Systems

One-Line Summaries

Concept	One-Liner
Network unreliability	Packets can be lost, delayed, or duplicated — no response ≠ node dead
Timeout ambiguity	After timeout, you can’t know if request was received/processed
Clock skew	Clocks on different machines drift and disagree by milliseconds
Wall clock danger	Wall clocks can go backward (NTP sync); unsafe for ordering
Monotonic clock	Always forward; good for elapsed time; not comparable across nodes
GC pause	JVM stop-the-world can pause process for seconds; invalidates timing
Fencing token	Monotonically increasing number with each lock grant; rejects stale writes
Byzantine fault	Node behaves arbitrarily/maliciously (not just crashes)
FLP impossibility	Consensus impossible in asynchronous network if any node can fail
Partial failure	Some nodes work, some don’t — normal in distributed systems

What You Can’t Know After a Timeout

You sent a request. No response received. This could mean:

1. ❓ Request lost before reaching server (server never saw it)
2. ❓ Server received request, is processing slowly, response coming
3. ❓ Server received request, processed it, response was lost
4. ❓ Server crashed before processing
5. ❓ Server crashed after processing (action already taken!)
6. ❓ Server alive but network between you and it is partitioned

You CANNOT distinguish these cases with certainty.
Implication: Retries must be idempotent (safe to repeat without side effects)

Clock Types and Their Uses

Time-of-Day Clock (Wall Clock):           Monotonic Clock:
─────────────────────────────────         ────────────────────────────────
System.currentTimeMillis() [Java]         System.nanoTime() [Java]
time.time() [Python]                      time.monotonic() [Python]

Returns: current UTC time                 Returns: elapsed time since arbitrary point

Problems:
❌ Can jump backward (NTP correction)     ✅ Always increases
❌ Can jump forward (large NTP sync)      ✅ Good for measuring intervals
❌ Differs between machines               ❌ Meaningless to compare across nodes
❌ Unreliable for event ordering          ❌ Resets on process restart

USE FOR:                                  USE FOR:
- Displaying time to users                - Measuring timeout duration
- Rough time-ordering (with caution)      - Measuring elapsed time
- Logging events                          - Timer/rate measurements

NEVER USE FOR:
- Determining which write happened last (LWW with timestamps = data loss risk)
- Distributed locking/leasing

The GC Pause Trap

DANGER SCENARIO:
─────────────────
Time 0: Node acquires lease (expires in 30s)
Time 1: Node checks: lease valid (29s remaining) ✅
Time 2: JVM GC pause begins [PROCESS FROZEN]
         ...15 seconds pass...
Time 17: Lease expires (another node takes over as leader)
Time 18: GC pause ends, node resumes
Time 18: Node still thinks it holds lease! → SPLIT BRAIN

FIX: Fencing tokens
──────────────────
Lock service issues fencing token (monotonically increasing number):
  Node A gets lease + token=10
  Node A pauses
  Lease expires; Node B gets lease + token=11
  Node A resumes, tries to write with token=10
  Storage: "token 10 < 11, reject!" ← Stale write blocked

System Models Summary

Timing Model:
─────────────
Synchronous:   Bounded delays; know exactly how long operations take (unrealistic)
Asynchronous:  No timing guarantees at all (too pessimistic; consensus impossible)
Partially sync: Usually bounded; occasionally violates bounds (REALISTIC MODEL)

Node Failure Model:
──────────────────
Crash-stop:   Node stops and never recovers (safe to assume it's dead)
Crash-recovery: Node can crash and restart; durable state survives (most practical)
Byzantine:    Node can lie, be malicious, send contradictory messages (hardest)

Most distributed algorithms assume: Partially synchronous + crash-recovery

Clock Safety in Distributed Systems

UNSAFE:                                   SAFE:
───────────────────────────────           ──────────────────────────────────
LWW with wall clock timestamps            Logical clocks (Lamport timestamps)
                                          
Node A writes x=1 at t=100               Each event increments a counter
Node B writes x=2 at t=99 (1ms drift)   Ordering: causal not temporal
LWW picks x=1 (WRONG! x=2 was later)    
                                          
Using wall clock for timeout checks       Monotonic clock for timeouts
"Was this write within last 100ms?"       
(clocks can drift; answer unreliable)     

Using wall clock for distributed lock     Fencing tokens for locks
"My lease hasn't expired yet"             Storage service rejects stale tokens
(your clock may disagree with server's)

Network Failure Modes

Full connectivity:    All nodes can communicate — ideal

Network partition:    Group A ←---→ Group B (cannot communicate)
                      Both groups may elect leaders → split brain risk

Asymmetric partition: A→B works, B→A fails (rare but happens)
                      B sends, never receives; A never hears from B

Switch failure:       Subset of nodes unreachable
                      Often looks like packet loss / high latency first

NIC firmware bug:     Intermittent packet drops on one interface
                      Very hard to diagnose

Congestion:           High latency, not loss; TCP retries hide it
                      Appears as slow responses, not failures

Key Trade-offs

Decision	Pro	Con	When to Use
Short timeout	Fast failure detection	False positives under load	Low-latency SLAs
Long timeout	Fewer false positives	Slow recovery from real failures	Stability over speed
Logical clocks	Safe ordering across nodes	No physical time	Event ordering
TrueTime (Spanner)	Bounded physical time	Expensive hardware	Global serializable DB
Fencing tokens	Prevents stale writes	Extra complexity in storage	Leader election with leases
Idempotent retries	Safe to retry	Requires idempotency design	All network operations

Red Flags

❌ “If we don’t hear back, it means the request failed” (might have succeeded!)
❌ Using wall clock for distributed lock expiry decisions
❌ Thread.sleep() or time-based waits as coordination mechanism
❌ Assuming GC pauses don’t happen (they do, even on modern JVMs)
❌ “Our network never has partitions” (all networks partition eventually)
❌ Using timestamps from multiple machines for LWW without bounded clock sync

Green Flags

✅ Design all external operations to be idempotent (safe to retry)
✅ Use monotonic clocks for measuring elapsed time / timeouts
✅ Fencing tokens for any lease-based distributed locking
✅ Set explicit timeouts on all network calls
✅ Assume network partitions will happen; test with chaos engineering
✅ Use logical clocks for event ordering across distributed nodes

Modern Additions (2026)

Low-latency GC (Java ZGC, Shenandoah):
├─ Max GC pause: < 10ms (vs seconds for old GC)
├─ Reduces (but doesn't eliminate) GC pause risk
└─ JVM systems safer for short-lease distributed locking

Service mesh (Istio, Linkerd):
├─ Transparent retry, timeout, circuit breaking
├─ Network failures handled at infrastructure layer
└─ Application code freed from network retry logic

PTP (Precision Time Protocol):
├─ Microsecond accuracy (vs millisecond for NTP)
├─ Available in AWS, GCP for VMs
└─ Reduces clock skew but doesn't eliminate it

Interview Response Templates

When Asked About Handling Network Failures

“Network calls can fail in many ways — request lost, response lost, server slow, partition. Since you can’t distinguish these, I’d design all calls to be idempotent so retries are safe. I’d set explicit timeouts (not too short to avoid false positives under load). For critical operations, I’d track request IDs server-side to detect and deduplicate retries.”

When Asked About Distributed Clocks

“Wall clocks are unreliable for ordering distributed events — NTP drift means different nodes disagree by milliseconds, and clocks can jump backward. For event ordering, I’d use Lamport timestamps or vector clocks (logical clocks). For measuring timeouts, monotonic clocks are safe. The only exception is Google Spanner’s TrueTime, which uses GPS+atomic clocks to provide physical time with bounded uncertainty.”

Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch08-cheatsheet