Chapter 8: The Trouble with Distributed Systems

Overview

This chapter is a systematic catalog of things that go wrong in distributed systems. Unlike Chapter 7 (which covered database-specific failures), Chapter 8 covers the fundamental unreliability of the network, clocks, and processes in distributed environments. The key mindset shift: in distributed systems, partial failures are normal and must be handled explicitly, unlike a single machine where hardware failure usually causes total failure.

Core thesis: If you assume a distributed system behaves like a single machine—that the network is reliable, clocks are synchronized, and processes don’t pause—you will build systems with subtle but dangerous bugs.

Key Concepts

Unreliable Networks

The reality: The internet and datacenter networks are asynchronous packet networks where:

  • Packets can be lost
  • Packets can be delayed arbitrarily long
  • Packets can arrive out of order
  • Packets can be duplicated

When you send a request and get no response, you cannot distinguish:

  1. Request was lost
  2. Remote node is down
  3. Remote node processed request but response was lost
  4. Remote node is alive but processing is slow
  5. Response is still in transit (delayed)

You have no way to tell — this fundamental ambiguity is the core challenge.

Network partitions: Two groups of nodes can communicate within their group but not between groups. This is different from total failure.

Practical implication: Always set timeouts on network requests; never wait indefinitely. But even with timeouts, you can’t know if your request was processed before the timeout.

Approaches:

  • Timeouts: Declare node dead after timeout; choose timeout carefully
    • Too short: false positives (overloaded node declared dead)
    • Too long: slow detection of real failures
  • Failure detectors: Dedicated service to track node health (heartbeats)
  • Phi Accrual: Probabilistic failure detection (used in Cassandra, Akka)
  • Automatic retries: But must handle idempotency (retrying may cause duplicate actions)

Datacenter-level network issues:

  • Network congestion → queuing delays → high latency
  • NIC firmware bugs
  • Misconfigured switches
  • Accidental network partition by operator error
  • Google’s 2012 chubby experiment: 30-second network partition caused chaos

Unreliable Clocks

Two types of clocks:

  1. Time-of-day clock (wall clock):

    • Returns current date and time (UTC)
    • Subject to NTP synchronization (can jump forwards or backwards)
    • System.currentTimeMillis() in Java; time.time() in Python
    • Problem: Clock can go backward! Cannot use for measuring durations.
    • Google Spanner TrueTime: GPS + atomic clocks; provides bounded uncertainty
  2. Monotonic clock (logical time):

    • Always moves forward
    • Measures elapsed time (duration)
    • System.nanoTime() in Java; time.monotonic() in Python
    • Good for measuring timeouts and elapsed time
    • Cannot compare across different nodes (each node’s monotonic clock is independent)

Clock skew in distributed systems:

  • Clocks on different machines are never perfectly synchronized
  • NTP can adjust up to milliseconds, but not microseconds
  • Quartz clock drift: may lose/gain 200ms per day without correction
  • Google Spanner TrueTime: GPS receivers + atomic clocks give uncertainty of ±7ms (maximum observed)

The Dangerous Clock Assumption:

  • “Last write wins” with timestamps: assumes all clocks are synchronized
  • If clocks drift, a later write may have a smaller timestamp → lost update
  • Example: Node A at t=100 writes x=1; Node B at t=99 (1ms drift) writes x=2; LWW gives x=1 (wrong!)

Logical clocks (Lamport timestamps, vector clocks):

  • Increment-based; don’t measure physical time
  • Only capture ordering (happens-before), not actual time
  • Safe for ordering events in distributed systems

Process Pauses

The problem: A process can pause for an arbitrarily long time in the middle of execution:

  • GC (garbage collection) pause: JVM stop-the-world GC can pause for seconds
  • Virtual machine migration: Cloud VM snapshot and migrate while running
  • Context switches: OS preempts process to run another
  • Disk I/O: Disk I/O causes process to block
  • Memory swapping: Process swapped out to disk (seconds of pause)

Why this matters for distributed systems:

Scenario: Node has a lease (time-limited lock). Node checks lease is valid, then gets GC pause. Lease expires during pause. Node resumes, thinks it still has lease, proceeds with write → two nodes think they have the lock simultaneously.

Real-time garbage collection: Java G1GC, ZGC aim to reduce pause times but can’t eliminate them.

Solutions:

  • Fencing tokens: Monotonically increasing counter issued with each lock grant; storage server rejects writes with old token numbers
  • Lease renewal before time-sensitive operations: Renew lease immediately before critical section
  • Real-time systems (RTOS): Hard timing guarantees, but impractical for most applications
  • Choosing safe isolation level: Operations must be idempotent and handle race conditions

The Byzantine Generals Problem

Byzantine faults: Nodes that don’t just crash but send malicious or incorrect information to other nodes.

Byzantine fault tolerance (BFT): Algorithms that work correctly even if some nodes are malicious/faulty.

  • Not required for most databases (assume trusted environment)
  • Required for: blockchain, aerospace systems, nuclear systems

Distinction:

  • Crash-stop fault model: Node stops working (simple); most DB algorithms assume this
  • Crash-recovery fault model: Node can crash and recover with data from disk
  • Byzantine fault model: Node can behave arbitrarily (malicious/corrupted); much harder

Most datacenter systems assume crash-stop or crash-recovery, not Byzantine.

Knowledge, Truth, and Lies

Truth is defined by majority:

  • In distributed systems, what is “true” is defined by quorum agreement
  • A node cannot trust its own local state if the network is unreliable
  • Example: A node might believe it’s the leader, but the rest of the cluster has elected a new one

The impossibility of distributed consensus:

  • FLP impossibility (Fischer, Lynch, Paterson 1985): In an asynchronous distributed system, consensus is impossible if even one node can fail
  • Practical implication: Must make timing assumptions (timeouts) to achieve consensus; no fully asynchronous solution

System models:

  • Synchronous model: Bounded message delay; bounded processing time (unrealistic for most real networks)
  • Partially synchronous model: Usually synchronous but can sometimes violate bounds (realistic)
  • Asynchronous model: No timing guarantees at all (most algorithms unsafe here)
  • Most practical algorithms assume partially synchronous model

Node failure models:

  • Crash-stop: Node stops and never resumes
  • Crash-recovery: Node can restart and recover state from disk
  • Byzantine: Node can lie, act maliciously, or behave arbitrarily

Important Points

  • Network unreliability is fundamental: Timeouts don’t tell you what happened; they just let you give up waiting.
  • Wall clocks are untrustworthy for ordering events: Use logical clocks (Lamport, vector) or Spanner’s TrueTime.
  • GC pauses and process pauses invalidate timing assumptions: Always design for arbitrary delays.
  • Fencing tokens are essential for leader leases: Without them, a paused process can take action after its lease expires.
  • Truth in distributed systems = quorum consensus: A single node’s view is unreliable.
  • FLP impossibility means consensus requires timing assumptions: Fully asynchronous consensus is impossible.

Examples & Case Studies

  1. GitHub’s MySQL Failover Bug (mentioned in Ch5)

    • Network partition isolated master; followers elected new master
    • Old master recovered, thought it was still master → split brain → data loss
  2. GC Pause Nightmare

    • HBase: GC pause caused ZooKeeper session to expire
    • ZooKeeper (correctly) declared HBase regionserver dead, reassigned regions
    • Regionserver resumed, thought it still owned the regions → two regionservers serving same region
  3. Amazon’s Dynamo

    • Designed explicitly for unreliable networks and nodes
    • Sloppy quorum + hinted handoff = available even during network partitions
    • Chose availability over strong consistency
  4. Google Spanner TrueTime

    • Hardware solution to clock synchronization: GPS + atomic clocks in each datacenter
    • Provides time with bounded uncertainty (TrueTime API: [earliest, latest])
    • Transactions wait for uncertainty to pass before committing (ensures linearizability)

Questions

  1. Why can’t you know if a remote node processed your request after a timeout?
  2. What is the difference between time-of-day clocks and monotonic clocks?
  3. Why is using wall clock timestamps for event ordering dangerous?
  4. What is a fencing token and why is it needed?
  5. What are the three system models for timing assumptions?
  6. What is the Byzantine Generals Problem and when do you need BFT?
  7. How does FLP impossibility affect real distributed system design?
  8. What makes network partitions different from node failures?

Modern Context (2026)

NTP improvements:

  • PTP (Precision Time Protocol) IEEE 1588: microsecond accuracy over hardware-timestamped networks
  • Many cloud providers offer PTP (AWS EC2 offers it for VMs)
  • Still not perfect, but much better than classic NTP (millisecond accuracy)

eBPF for network observability:

  • Modern Linux kernel feature; enables deep network packet inspection with minimal overhead
  • Cilium, Pixie: use eBPF for network observability in Kubernetes
  • Helps diagnose subtle network partition issues in production

Service mesh for reliability:

  • Istio, Linkerd: transparent retry, circuit breaking, timeout management
  • Network failures handled at infrastructure layer rather than application layer
  • Reduces burden of handling network unreliability in application code

ZGC and Shenandoah:

  • Java’s low-latency GC algorithms (2020s+): max pause < 10ms (vs seconds for G1)
  • Reduces (but doesn’t eliminate) GC pause risk for distributed leases
  • Java/JVM systems are more reliable for lease-based leadership in 2026

AI-assisted failure prediction:

  • ML models predict hardware failures (disk, NIC) before they happen
  • Google’s Borg, Meta’s infrastructure: predictive maintenance reducing unexpected node failures

Status: Notes complete
Last Updated: 2026-04-13