Chapter 8 Cheat Sheet - The Trouble with Distributed Systems

One-Line Summaries

ConceptOne-Liner
Network unreliabilityPackets can be lost, delayed, or duplicated — no response ≠ node dead
Timeout ambiguityAfter timeout, you can’t know if request was received/processed
Clock skewClocks on different machines drift and disagree by milliseconds
Wall clock dangerWall clocks can go backward (NTP sync); unsafe for ordering
Monotonic clockAlways forward; good for elapsed time; not comparable across nodes
GC pauseJVM stop-the-world can pause process for seconds; invalidates timing
Fencing tokenMonotonically increasing number with each lock grant; rejects stale writes
Byzantine faultNode behaves arbitrarily/maliciously (not just crashes)
FLP impossibilityConsensus impossible in asynchronous network if any node can fail
Partial failureSome nodes work, some don’t — normal in distributed systems

What You Can’t Know After a Timeout

You sent a request. No response received. This could mean:

1. ❓ Request lost before reaching server (server never saw it)
2. ❓ Server received request, is processing slowly, response coming
3. ❓ Server received request, processed it, response was lost
4. ❓ Server crashed before processing
5. ❓ Server crashed after processing (action already taken!)
6. ❓ Server alive but network between you and it is partitioned

You CANNOT distinguish these cases with certainty.
Implication: Retries must be idempotent (safe to repeat without side effects)

Clock Types and Their Uses

Time-of-Day Clock (Wall Clock):           Monotonic Clock:
─────────────────────────────────         ────────────────────────────────
System.currentTimeMillis() [Java]         System.nanoTime() [Java]
time.time() [Python]                      time.monotonic() [Python]

Returns: current UTC time                 Returns: elapsed time since arbitrary point

Problems:
❌ Can jump backward (NTP correction)     ✅ Always increases
❌ Can jump forward (large NTP sync)      ✅ Good for measuring intervals
❌ Differs between machines               ❌ Meaningless to compare across nodes
❌ Unreliable for event ordering          ❌ Resets on process restart

USE FOR:                                  USE FOR:
- Displaying time to users                - Measuring timeout duration
- Rough time-ordering (with caution)      - Measuring elapsed time
- Logging events                          - Timer/rate measurements

NEVER USE FOR:
- Determining which write happened last (LWW with timestamps = data loss risk)
- Distributed locking/leasing

The GC Pause Trap

DANGER SCENARIO:
─────────────────
Time 0: Node acquires lease (expires in 30s)
Time 1: Node checks: lease valid (29s remaining) ✅
Time 2: JVM GC pause begins [PROCESS FROZEN]
         ...15 seconds pass...
Time 17: Lease expires (another node takes over as leader)
Time 18: GC pause ends, node resumes
Time 18: Node still thinks it holds lease! → SPLIT BRAIN

FIX: Fencing tokens
──────────────────
Lock service issues fencing token (monotonically increasing number):
  Node A gets lease + token=10
  Node A pauses
  Lease expires; Node B gets lease + token=11
  Node A resumes, tries to write with token=10
  Storage: "token 10 < 11, reject!" ← Stale write blocked

System Models Summary

Timing Model:
─────────────
Synchronous:   Bounded delays; know exactly how long operations take (unrealistic)
Asynchronous:  No timing guarantees at all (too pessimistic; consensus impossible)
Partially sync: Usually bounded; occasionally violates bounds (REALISTIC MODEL)

Node Failure Model:
──────────────────
Crash-stop:   Node stops and never recovers (safe to assume it's dead)
Crash-recovery: Node can crash and restart; durable state survives (most practical)
Byzantine:    Node can lie, be malicious, send contradictory messages (hardest)

Most distributed algorithms assume: Partially synchronous + crash-recovery

Clock Safety in Distributed Systems

UNSAFE:                                   SAFE:
───────────────────────────────           ──────────────────────────────────
LWW with wall clock timestamps            Logical clocks (Lamport timestamps)
                                          
Node A writes x=1 at t=100               Each event increments a counter
Node B writes x=2 at t=99 (1ms drift)   Ordering: causal not temporal
LWW picks x=1 (WRONG! x=2 was later)    
                                          
Using wall clock for timeout checks       Monotonic clock for timeouts
"Was this write within last 100ms?"       
(clocks can drift; answer unreliable)     

Using wall clock for distributed lock     Fencing tokens for locks
"My lease hasn't expired yet"             Storage service rejects stale tokens
(your clock may disagree with server's)

Network Failure Modes

Full connectivity:    All nodes can communicate — ideal

Network partition:    Group A ←---→ Group B (cannot communicate)
                      Both groups may elect leaders → split brain risk

Asymmetric partition: A→B works, B→A fails (rare but happens)
                      B sends, never receives; A never hears from B

Switch failure:       Subset of nodes unreachable
                      Often looks like packet loss / high latency first

NIC firmware bug:     Intermittent packet drops on one interface
                      Very hard to diagnose

Congestion:           High latency, not loss; TCP retries hide it
                      Appears as slow responses, not failures

Key Trade-offs

DecisionProConWhen to Use
Short timeoutFast failure detectionFalse positives under loadLow-latency SLAs
Long timeoutFewer false positivesSlow recovery from real failuresStability over speed
Logical clocksSafe ordering across nodesNo physical timeEvent ordering
TrueTime (Spanner)Bounded physical timeExpensive hardwareGlobal serializable DB
Fencing tokensPrevents stale writesExtra complexity in storageLeader election with leases
Idempotent retriesSafe to retryRequires idempotency designAll network operations

Red Flags

❌ “If we don’t hear back, it means the request failed” (might have succeeded!)
❌ Using wall clock for distributed lock expiry decisions
Thread.sleep() or time-based waits as coordination mechanism
❌ Assuming GC pauses don’t happen (they do, even on modern JVMs)
❌ “Our network never has partitions” (all networks partition eventually)
❌ Using timestamps from multiple machines for LWW without bounded clock sync

Green Flags

✅ Design all external operations to be idempotent (safe to retry)
✅ Use monotonic clocks for measuring elapsed time / timeouts
✅ Fencing tokens for any lease-based distributed locking
✅ Set explicit timeouts on all network calls
✅ Assume network partitions will happen; test with chaos engineering
✅ Use logical clocks for event ordering across distributed nodes

Modern Additions (2026)

Low-latency GC (Java ZGC, Shenandoah):
├─ Max GC pause: < 10ms (vs seconds for old GC)
├─ Reduces (but doesn't eliminate) GC pause risk
└─ JVM systems safer for short-lease distributed locking

Service mesh (Istio, Linkerd):
├─ Transparent retry, timeout, circuit breaking
├─ Network failures handled at infrastructure layer
└─ Application code freed from network retry logic

PTP (Precision Time Protocol):
├─ Microsecond accuracy (vs millisecond for NTP)
├─ Available in AWS, GCP for VMs
└─ Reduces clock skew but doesn't eliminate it

Interview Response Templates

When Asked About Handling Network Failures

“Network calls can fail in many ways — request lost, response lost, server slow, partition. Since you can’t distinguish these, I’d design all calls to be idempotent so retries are safe. I’d set explicit timeouts (not too short to avoid false positives under load). For critical operations, I’d track request IDs server-side to detect and deduplicate retries.”

When Asked About Distributed Clocks

“Wall clocks are unreliable for ordering distributed events — NTP drift means different nodes disagree by milliseconds, and clocks can jump backward. For event ordering, I’d use Lamport timestamps or vector clocks (logical clocks). For measuring timeouts, monotonic clocks are safe. The only exception is Google Spanner’s TrueTime, which uses GPS+atomic clocks to provide physical time with bounded uncertainty.”


Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-13