Chapter 8: The Trouble with Distributed Systems
Overview
This chapter is a systematic catalog of things that go wrong in distributed systems. Unlike Chapter 7 (which covered database-specific failures), Chapter 8 covers the fundamental unreliability of the network, clocks, and processes in distributed environments. The key mindset shift: in distributed systems, partial failures are normal and must be handled explicitly, unlike a single machine where hardware failure usually causes total failure.
Core thesis: If you assume a distributed system behaves like a single machine—that the network is reliable, clocks are synchronized, and processes don’t pause—you will build systems with subtle but dangerous bugs.
Key Concepts
Unreliable Networks
The reality: The internet and datacenter networks are asynchronous packet networks where:
- Packets can be lost
- Packets can be delayed arbitrarily long
- Packets can arrive out of order
- Packets can be duplicated
When you send a request and get no response, you cannot distinguish:
- Request was lost
- Remote node is down
- Remote node processed request but response was lost
- Remote node is alive but processing is slow
- Response is still in transit (delayed)
You have no way to tell — this fundamental ambiguity is the core challenge.
Network partitions: Two groups of nodes can communicate within their group but not between groups. This is different from total failure.
Practical implication: Always set timeouts on network requests; never wait indefinitely. But even with timeouts, you can’t know if your request was processed before the timeout.
Approaches:
- Timeouts: Declare node dead after timeout; choose timeout carefully
- Too short: false positives (overloaded node declared dead)
- Too long: slow detection of real failures
- Failure detectors: Dedicated service to track node health (heartbeats)
- Phi Accrual: Probabilistic failure detection (used in Cassandra, Akka)
- Automatic retries: But must handle idempotency (retrying may cause duplicate actions)
Datacenter-level network issues:
- Network congestion → queuing delays → high latency
- NIC firmware bugs
- Misconfigured switches
- Accidental network partition by operator error
- Google’s 2012 chubby experiment: 30-second network partition caused chaos
Unreliable Clocks
Two types of clocks:
-
Time-of-day clock (wall clock):
- Returns current date and time (UTC)
- Subject to NTP synchronization (can jump forwards or backwards)
System.currentTimeMillis()in Java;time.time()in Python- Problem: Clock can go backward! Cannot use for measuring durations.
- Google Spanner TrueTime: GPS + atomic clocks; provides bounded uncertainty
-
Monotonic clock (logical time):
- Always moves forward
- Measures elapsed time (duration)
System.nanoTime()in Java;time.monotonic()in Python- Good for measuring timeouts and elapsed time
- Cannot compare across different nodes (each node’s monotonic clock is independent)
Clock skew in distributed systems:
- Clocks on different machines are never perfectly synchronized
- NTP can adjust up to milliseconds, but not microseconds
- Quartz clock drift: may lose/gain 200ms per day without correction
- Google Spanner TrueTime: GPS receivers + atomic clocks give uncertainty of ±7ms (maximum observed)
The Dangerous Clock Assumption:
- “Last write wins” with timestamps: assumes all clocks are synchronized
- If clocks drift, a later write may have a smaller timestamp → lost update
- Example: Node A at t=100 writes x=1; Node B at t=99 (1ms drift) writes x=2; LWW gives x=1 (wrong!)
Logical clocks (Lamport timestamps, vector clocks):
- Increment-based; don’t measure physical time
- Only capture ordering (happens-before), not actual time
- Safe for ordering events in distributed systems
Process Pauses
The problem: A process can pause for an arbitrarily long time in the middle of execution:
- GC (garbage collection) pause: JVM stop-the-world GC can pause for seconds
- Virtual machine migration: Cloud VM snapshot and migrate while running
- Context switches: OS preempts process to run another
- Disk I/O: Disk I/O causes process to block
- Memory swapping: Process swapped out to disk (seconds of pause)
Why this matters for distributed systems:
Scenario: Node has a lease (time-limited lock). Node checks lease is valid, then gets GC pause. Lease expires during pause. Node resumes, thinks it still has lease, proceeds with write → two nodes think they have the lock simultaneously.
Real-time garbage collection: Java G1GC, ZGC aim to reduce pause times but can’t eliminate them.
Solutions:
- Fencing tokens: Monotonically increasing counter issued with each lock grant; storage server rejects writes with old token numbers
- Lease renewal before time-sensitive operations: Renew lease immediately before critical section
- Real-time systems (RTOS): Hard timing guarantees, but impractical for most applications
- Choosing safe isolation level: Operations must be idempotent and handle race conditions
The Byzantine Generals Problem
Byzantine faults: Nodes that don’t just crash but send malicious or incorrect information to other nodes.
Byzantine fault tolerance (BFT): Algorithms that work correctly even if some nodes are malicious/faulty.
- Not required for most databases (assume trusted environment)
- Required for: blockchain, aerospace systems, nuclear systems
Distinction:
- Crash-stop fault model: Node stops working (simple); most DB algorithms assume this
- Crash-recovery fault model: Node can crash and recover with data from disk
- Byzantine fault model: Node can behave arbitrarily (malicious/corrupted); much harder
Most datacenter systems assume crash-stop or crash-recovery, not Byzantine.
Knowledge, Truth, and Lies
Truth is defined by majority:
- In distributed systems, what is “true” is defined by quorum agreement
- A node cannot trust its own local state if the network is unreliable
- Example: A node might believe it’s the leader, but the rest of the cluster has elected a new one
The impossibility of distributed consensus:
- FLP impossibility (Fischer, Lynch, Paterson 1985): In an asynchronous distributed system, consensus is impossible if even one node can fail
- Practical implication: Must make timing assumptions (timeouts) to achieve consensus; no fully asynchronous solution
System models:
- Synchronous model: Bounded message delay; bounded processing time (unrealistic for most real networks)
- Partially synchronous model: Usually synchronous but can sometimes violate bounds (realistic)
- Asynchronous model: No timing guarantees at all (most algorithms unsafe here)
- Most practical algorithms assume partially synchronous model
Node failure models:
- Crash-stop: Node stops and never resumes
- Crash-recovery: Node can restart and recover state from disk
- Byzantine: Node can lie, act maliciously, or behave arbitrarily
Important Points
- Network unreliability is fundamental: Timeouts don’t tell you what happened; they just let you give up waiting.
- Wall clocks are untrustworthy for ordering events: Use logical clocks (Lamport, vector) or Spanner’s TrueTime.
- GC pauses and process pauses invalidate timing assumptions: Always design for arbitrary delays.
- Fencing tokens are essential for leader leases: Without them, a paused process can take action after its lease expires.
- Truth in distributed systems = quorum consensus: A single node’s view is unreliable.
- FLP impossibility means consensus requires timing assumptions: Fully asynchronous consensus is impossible.
Examples & Case Studies
-
GitHub’s MySQL Failover Bug (mentioned in Ch5)
- Network partition isolated master; followers elected new master
- Old master recovered, thought it was still master → split brain → data loss
-
GC Pause Nightmare
- HBase: GC pause caused ZooKeeper session to expire
- ZooKeeper (correctly) declared HBase regionserver dead, reassigned regions
- Regionserver resumed, thought it still owned the regions → two regionservers serving same region
-
Amazon’s Dynamo
- Designed explicitly for unreliable networks and nodes
- Sloppy quorum + hinted handoff = available even during network partitions
- Chose availability over strong consistency
-
Google Spanner TrueTime
- Hardware solution to clock synchronization: GPS + atomic clocks in each datacenter
- Provides time with bounded uncertainty (TrueTime API: [earliest, latest])
- Transactions wait for uncertainty to pass before committing (ensures linearizability)
Questions
- Why can’t you know if a remote node processed your request after a timeout?
- What is the difference between time-of-day clocks and monotonic clocks?
- Why is using wall clock timestamps for event ordering dangerous?
- What is a fencing token and why is it needed?
- What are the three system models for timing assumptions?
- What is the Byzantine Generals Problem and when do you need BFT?
- How does FLP impossibility affect real distributed system design?
- What makes network partitions different from node failures?
Modern Context (2026)
NTP improvements:
- PTP (Precision Time Protocol) IEEE 1588: microsecond accuracy over hardware-timestamped networks
- Many cloud providers offer PTP (AWS EC2 offers it for VMs)
- Still not perfect, but much better than classic NTP (millisecond accuracy)
eBPF for network observability:
- Modern Linux kernel feature; enables deep network packet inspection with minimal overhead
- Cilium, Pixie: use eBPF for network observability in Kubernetes
- Helps diagnose subtle network partition issues in production
Service mesh for reliability:
- Istio, Linkerd: transparent retry, circuit breaking, timeout management
- Network failures handled at infrastructure layer rather than application layer
- Reduces burden of handling network unreliability in application code
ZGC and Shenandoah:
- Java’s low-latency GC algorithms (2020s+): max pause < 10ms (vs seconds for G1)
- Reduces (but doesn’t eliminate) GC pause risk for distributed leases
- Java/JVM systems are more reliable for lease-based leadership in 2026
AI-assisted failure prediction:
- ML models predict hardware failures (disk, NIC) before they happen
- Google’s Borg, Meta’s infrastructure: predictive maintenance reducing unexpected node failures
Status: Notes complete
Last Updated: 2026-04-13