Chapter 9: The Trouble with Distributed Systems

ddia-2e distributed-systems networks clocks byzantine-faults formal-methods

Status: Notes complete


Overview

Chapter 9 is a systematic catalog of everything that can go wrong when building distributed systems. Unlike a single machine — where failure is typically total and detectable — distributed systems experience partial failures: some components fail while others continue working, and the system cannot always tell the difference between a failed node and a slow one. The chapter examines three fundamental unreliability sources — networks, clocks, and processes — and then explores how to reason about correctness using formal models.

This is the 2nd edition’s equivalent of 1st edition Chapter 8, with two major additions: an expanded Formal Methods and Randomized Testing section (TLA+, Jepsen, fuzzing) and an updated treatment of Byzantine faults reflecting renewed relevance through blockchain and BFT consensus systems.

Core thesis: Building distributed systems requires abandoning the assumption that networks are reliable, clocks are synchronized, and processes don’t pause. Partial failure is the normal operating state, not an exceptional condition.


Key Concepts

Faults and Partial Failures

The fundamental difference from single-machine systems:

  • Single machine: hardware failure is usually total — the machine either works or doesn’t
  • Distributed system: components fail independently, unpredictably, and partially

Partial failure: A condition where some nodes in the system are working correctly while others are not. The critical challenge is that the rest of the system must handle this gracefully — and often cannot even reliably detect which components have failed.

Nondeterminism: Unlike a single CPU executing instructions sequentially, distributed systems are inherently nondeterministic. Messages may arrive in different orders on each run. A node may pause at any point. This nondeterminism is the source of most distributed system bugs.

The network ambiguity problem: When you send a request to a remote node and receive no response, you cannot determine which of these happened:

  1. Request was lost before reaching the server
  2. Remote node is down or unreachable
  3. Remote node received request but is processing slowly
  4. Remote node processed the request; response was lost
  5. Remote node processed the request; crashed immediately after
  6. Network between you and the node is partitioned

You have no way to distinguish these cases with certainty. This fundamental ambiguity is the core challenge of distributed networking.


Synchronous Versus Asynchronous Networks

Synchronous networks (e.g., circuit-switched telephone networks):

  • A fixed amount of bandwidth is reserved for each call end-to-end
  • Maximum end-to-end latency is bounded and known in advance
  • No queuing — packets never wait because their “slot” is always reserved
  • Consequence: Can make timing guarantees; timeouts are reliable

Asynchronous networks (e.g., packet-switched internet, datacenter networks):

  • No reserved bandwidth; packets compete for available capacity
  • Maximum latency is unbounded — a packet may be delayed arbitrarily
  • Queuing at routers introduces variable latency (jitter)
  • Consequence: Cannot make timing guarantees; timeouts are unreliable

Why the internet is packet-switched: Packet switching is more efficient for bursty traffic (web requests, file transfers). Circuit switching wastes bandwidth when no data is being transmitted. The internet was optimized for utilization, not latency predictability.

Why not make datacenter networks synchronous?: Even inside a single datacenter, networks are asynchronous. Switches buffer packets during congestion. VM migration adds latency. Hardware interrupts pause processes. The gains in determinism would require sacrificing flexibility and efficiency.

Comparison: Synchronous vs Asynchronous Networks
───────────────────────────────────────────────────────────────────────────
Property            Synchronous (Circuit)     Asynchronous (Packet)
───────────────────────────────────────────────────────────────────────────
Bandwidth           Reserved per connection   Shared; best-effort
Latency bound       Yes (known maximum)       No (unbounded in theory)
Queuing             None (slots reserved)     Yes (at each router/switch)
Failure detection   Reliable via timeout      Ambiguous; no guarantee
Utilization         Low (idle slots wasted)   High (multiplexed)
Use cases           Voice/video telephony     Internet, datacenters
Algorithm design    Can use timing            Must assume no timing
───────────────────────────────────────────────────────────────────────────

Network partition: Two groups of nodes can communicate within their group but not between groups. Both groups may continue operating, potentially electing their own leaders — this is the root cause of split-brain scenarios.

Practical implication: Always set explicit timeouts on network requests. But understand that a timeout only tells you “I gave up waiting” — not what the remote node actually did.


Unreliable Clocks

Monotonic Versus Time-of-Day Clocks

Time-of-day clock (wall clock):

  • Returns current date and time (UTC)
  • Synchronized via NTP — can jump forwards or backwards during sync
  • System.currentTimeMillis() in Java; time.time() in Python; CLOCK_REALTIME in Linux
  • Problem: Can go backward. Cannot use for measuring durations or ordering events across nodes.

Monotonic clock:

  • Always moves forward; never jumps backward
  • Measures elapsed time since an arbitrary epoch (usually process start)
  • System.nanoTime() in Java; time.monotonic() in Python; CLOCK_MONOTONIC in Linux
  • Good for: measuring timeouts and elapsed duration within a single process
  • Cannot compare across nodes: each node’s monotonic clock has an independent, arbitrary zero point
Clock Usage Guide:
─────────────────────────────────────────────────────────────────
                    Time-of-Day Clock    Monotonic Clock
─────────────────────────────────────────────────────────────────
Display to users        YES                  no
Log event timestamps    YES (cautiously)     no
Measure timeout         no                   YES
Measure elapsed time    no                   YES
Order events across     DANGEROUS            IMPOSSIBLE
 multiple nodes
Distributed lock        DANGEROUS            meaningless
 expiry check
─────────────────────────────────────────────────────────────────

Clock Synchronization and Accuracy

Clocks on different machines drift continuously. Without correction:

  • Quartz crystal oscillator drift: ~200ms per day (varies with temperature, age)
  • NTP correction: syncs to within ~10ms under good conditions; worse under load or network congestion
  • Google Spanner TrueTime: GPS receivers + atomic clocks in every datacenter; uncertainty ≤ 7ms

Concrete example of clock drift impact:

Scenario: Two nodes, 1ms clock drift, using LWW (Last Write Wins)
─────────────────────────────────────────────────────────────────
Node A clock: 100ms     Node A writes x=1 with timestamp=100
Node B clock:  99ms     Node B writes x=2 with timestamp=99
                        (Node B's clock is 1ms behind real time)

LWW comparison: timestamp 100 > 99, so x=1 wins.
RESULT: x=2 is silently discarded. A more recent write is LOST.
No error, no warning — the data is simply gone.
─────────────────────────────────────────────────────────────────

NTP accuracy factors (all of these affect synchronization quality):

  • Round-trip network latency to NTP server (asymmetry causes error)
  • Quartz crystal temperature (datacenter cooling varies)
  • NTP server load
  • Multiple hops to NTP stratum 1 sources
  • Leap second handling (some NTP implementations smear, others step)

Google Spanner TrueTime API:

  • Returns [earliest, latest] — an interval, not a point
  • Typical interval width: 1–7ms; never exceeds ~14ms in practice
  • Commits wait until TT.after(commit_timestamp) is true, ensuring the commit timestamp is safely in the past
  • Cost: Requires GPS antennas and atomic clock hardware in every datacenter

PTP (Precision Time Protocol, IEEE 1588):

  • Hardware timestamping at NIC level eliminates software jitter
  • Achieves microsecond-level accuracy (vs milliseconds for software NTP)
  • AWS EC2, GCP, and Azure offer PTP for VMs as of 2026
  • Still doesn’t provide the bounded-uncertainty guarantees of Spanner TrueTime

Relying on Synchronized Clocks

Dangerous uses of wall clock timestamps:

  1. Last Write Wins (LWW): As shown above, clock drift silently discards writes
  2. Distributed lock expiry: Your clock may disagree with the lock server’s clock about whether the lease has expired
  3. Causal ordering: A later write may receive an earlier timestamp due to drift

Safe alternatives:

  • Logical clocks (Lamport timestamps, vector clocks): Increment-based, capture causal ordering without physical time
  • Monotonic clocks: Safe for measuring elapsed time within a single process
  • Hybrid Logical Clocks (HLC): Combine physical time with a logical counter — monotonically increasing while staying close to wall time; used in CockroachDB and YugabyteDB

Process Pauses

A process can pause for an arbitrarily long time in the middle of execution:

Pause SourceTypical DurationWhy It Happens
JVM GC stop-the-world100ms — several secondsOld GC algorithms freeze all threads
ZGC / Shenandoah< 10msLow-pause GC (still nonzero!)
VM live migration10ms — 1 secondCloud hypervisor copies VM memory to new host
OS context switch< 1ms (usually)Kernel preempts process to run another
Memory pressure / swapSecondsOS pages memory to disk; process waits on page fault
Synchronous disk I/O5–100msProcess blocks waiting for fsync
Signal handlerMicrosecondsOS delivers a signal (SIGTERM, SIGUSR1)

Why process pauses invalidate distributed protocols:

The Lease Expiry Nightmare:
────────────────────────────────────────────────────────────────────
Time 0:    Node acquires lease (valid for 30s), gets fencing token=10
Time 1:    Node checks: "lease valid? yes, 29s remaining" ✓
Time 2:    JVM GC pause begins  ─────────────────────────────────┐
           (process is frozen)                                    │
           ...15 seconds pass...                               PAUSED
           Lease expires at Time 17.                              │
           New node B gets lease with token=11                    │
Time 18:   GC pause ends ─────────────────────────────────────────┘
Time 18:   Node resumes. Thinks it still has the lease.
Time 18:   Node writes to storage with token=10

WITHOUT FENCING TOKENS: Two nodes are now writing to the same resource.
WITH FENCING TOKENS:    Storage rejects token=10 (< current max=11). Safe.
────────────────────────────────────────────────────────────────────

Fencing tokens: The correct solution. The lock service issues a monotonically increasing counter with every lock grant. Every write to the protected resource must include the token. The resource server rejects any write where token < max_token_seen. Safety is enforced by the resource, not the lock holder — even a paused process can’t corrupt data because the resource will reject its stale token.


Knowledge, Truth, and Lies

The Majority Rules

In distributed systems, truth is defined by the majority (quorum). A single node cannot trust its own local perception of the world. Examples:

  • A node that was disconnected for 30 seconds may think it’s the leader; the rest of the cluster has moved on
  • A node that is slow may conclude “nobody else responded, so I am the sole survivor”; actually it’s the slow one that looks dead to others
  • A “dead” node may actually be alive and healthy — the detector is simply not receiving its heartbeats

Why quorum is the only safe abstraction: If truth required unanimous agreement, a single node failure would block all decisions. If it required only one node, any isolated node could declare itself master. Majority (n/2 + 1) is the minimum threshold that prevents two disjoint groups from both thinking they have authority.

Distributed Locks and Leases

The split-brain failure mode: Two nodes simultaneously believe they hold the same distributed lock. This can happen when:

  • A GC pause causes lease expiry (as shown above)
  • A network partition makes the lock service unreachable from one group
  • A slow lock service response leaves a client uncertain about whether its lease was granted

Lease vs lock:

  • Lock: Binary — you either hold it or don’t. Requires active release.
  • Lease: A lock with a time-to-live (TTL). Automatically expires if holder fails. Safer in distributed environments because a crashed holder won’t hold the lock forever.

Single leader with fencing token flow:

Lock Service (etcd/ZooKeeper) → Client A → Token=42
                              → Client A pauses (GC)
                              → Token=43 → Client B
                              ← Client A wakes, writes with Token=42
Storage Server: max_seen=43, reject Token=42 ← Safe!

Byzantine Faults

Byzantine fault: A node that does not simply crash or become unresponsive, but instead sends incorrect, contradictory, or malicious information to other nodes.

Name origin: The Byzantine Generals Problem (Lamport, Shostak, Pease, 1982) — Byzantine generals communicating by messenger, where some generals (or messengers) may be traitors sending false information.

Why most datacenter systems don’t need BFT:

  • Nodes in a corporate/cloud datacenter are trusted — hardware bugs send wrong data by accident, not malice
  • BFT algorithms require n ≥ 3f + 1 nodes to tolerate f Byzantine faults (vs n ≥ 2f + 1 for crash faults)
  • BFT message complexity is roughly O(n²) per consensus round — extremely expensive
  • Simpler protection: checksums, cryptographic MACs, TLS for authentication

When BFT is required:

  • Blockchain: Nodes are untrusted (arbitrary participants); Tendermint, PBFT, HotStuff
  • Aerospace / safety-critical systems: Hardware radiation errors can corrupt data in any node
  • Multi-organizational systems: Nodes operated by competing organizations with potential conflicts of interest

Byzantine Fault Tolerance comparison:

Crash-Stop Model:          Byzantine Model:
────────────────           ────────────────
Node fails silently        Node can lie actively
n ≥ 2f + 1 nodes          n ≥ 3f + 1 nodes
Simple quorum reads        Requires signed messages
Low overhead               ~3x message overhead
Used by: Raft, Paxos       Used by: PBFT, Tendermint, HotStuff

Weak forms of Byzantine defense (used in most datacenter systems):

  • Checksums / CRCs on data at rest and in transit
  • TLS with certificate validation prevents message injection
  • Input validation and type checking at API boundaries
  • “Trust but verify” with cryptographic signatures on critical config changes

System Model and Reality

System model: A formal set of assumptions about the behavior of nodes and networks that an algorithm is designed for.

Timing models:

ModelAssumptionRealistic?Used for
SynchronousBounded message delay + processing timeNo — real networks violate thisTheoretical analysis
Partially synchronousUsually bounded; occasionally violates boundsYes — real systemsMost practical algorithms
AsynchronousNo timing assumptions at allToo pessimisticFLP impossibility proofs

Node failure models:

ModelBehaviorTypical Use
Crash-stopNode fails by halting; never restartsSimplest; most theoretical proofs
Crash-recoveryNode can crash, restart with durable stateMost practical DB/distributed systems
ByzantineNode can behave arbitrarilyBlockchain, aerospace, adversarial environments

The combined system model matrix:

                    Crash-Stop    Crash-Recovery    Byzantine
                    ──────────    ──────────────    ─────────
Synchronous         Simple        Practical DB       Aerospace
Partially sync      Raft theory   Raft in practice   Tendermint
Asynchronous        FLP proof     Complex            Extremely hard

Most distributed databases assume: Partially synchronous + crash-recovery. This is why Raft and Paxos use timeouts (partially synchronous assumption) and write to disk before acknowledging (crash-recovery assumption).

Safety vs liveness:

  • Safety property: “Nothing bad happens” — system never enters an invalid state, regardless of timing
  • Liveness property: “Something good eventually happens” — system eventually makes progress
  • Most algorithms sacrifice liveness under adversarial timing (they may block) but never violate safety (they never return wrong results)

Formal Methods and Randomized Testing

(New in 2nd Edition) This section covers how the distributed systems community has moved from informal correctness arguments to rigorous mechanized verification.

Why informal reasoning fails:

  • Distributed system bugs often involve 3–4 concurrent failure conditions that are individually unlikely but jointly catastrophic
  • Human intuition handles 2-way race conditions but fails badly with 4+ concurrent conditions
  • A “correct-looking” algorithm may fail in a 1-in-a-billion scenario that only appears under production load

TLA+ (Temporal Logic of Actions):

  • Author: Leslie Lamport (also invented Paxos)
  • What it is: A formal specification language for concurrent and distributed systems
  • How it works: Write a formal model of the algorithm; the TLC model checker exhaustively checks all possible interleavings of events
  • What it finds: Safety violations (bad states reachable) and liveness violations (progress never made)
  • Used by: Amazon (AWS S3, DynamoDB, EBS, EC2), Microsoft Azure, MongoDB, CockroachDB
  • Amazon finding: TLA+ found 10 bugs in their distributed protocols — 7 of which would have been missed by code review and testing alone

PlusCal: A higher-level language that compiles to TLA+; easier to write than raw TLA+

Alloy: An alternative formal modeling tool with automatic counterexample generation; used for data model and schema verification

Model checking limitations:

  • State space explosion: checking all interleavings is exponential in number of concurrent agents
  • TLC uses bounded model checking (limits state space depth); may miss bugs in very deep scenarios
  • Models must be kept abstract — you model the algorithm’s logic, not the implementation code
  • Finding: a correct TLA+ spec does not guarantee a correct implementation (gap between spec and code)

Jepsen:

  • Author: Kyle Kingsbury (“aphyr”)
  • What it is: A framework for testing distributed systems by injecting network partitions, clock skews, and node failures while running workloads
  • How it works: Uses nemeses (fault injectors) + checkers (linearizability verifiers using Knossos model checker)
  • Famous results:
    • MongoDB (various versions): lost acknowledged writes during elections
    • Redis Cluster: lost data during failover
    • Cassandra: violated linearizability under clock skew
    • Elasticsearch: acknowledged writes not readable after node failure
    • CockroachDB: initially found linearizability violations; now passes (fixed by team)
  • Impact: Most major distributed databases now run Jepsen testing before releases

Randomized / property-based testing:

  • Fuzzing: Feed random or malformed inputs to find crashes and panics
  • Property-based testing (QuickCheck, Hypothesis): Generate random test cases; define properties that must hold; find counterexamples automatically
  • Simulation testing (FoundationDB approach): Run the entire distributed system in a deterministic simulation where all nondeterminism (clocks, network) is controlled; replay exact scenarios that trigger bugs
  • Chaos engineering (Netflix Chaos Monkey): Randomly kill processes in production to verify resilience

FoundationDB’s deterministic simulation:

  • FoundationDB runs its entire distributed system as a simulation: all network calls, disk I/O, and time are simulated with a single-threaded, deterministic event loop
  • Any test that fails can be reproduced exactly (same seed = same execution)
  • This approach found and eliminated entire classes of rare distributed bugs before production
  • Industry influence: The approach (often called “simulation testing” or “discrete event simulation”) is now used by TigerBeetle, Tigris Data, and others

Comparison Tables

Network Models

PropertyLAN (Same Rack)WAN (Same DC)Cross-DCPublic Internet
Typical latency< 1ms1–5ms1–100ms10–500ms
Latency boundNoNoNoNo
Packet lossVery rareRareOccasionalCommon
Partition riskVery lowLowMediumHigh
CongestionRareOccasionalCommonVery common

Clock Accuracy Options

MethodAccuracyCostInfrastructure Required
Software NTP±10–100msFreePublic NTP servers
HW-timestamped NTP±1–5msLowPTP-capable NICs
PTP (IEEE 1588)±1–100µsMediumHW-timestamped switches
Google TrueTime±7ms boundedVery HighGPS + atomic clock hardware
Logical clocksN/A (no physical time)ZeroNone
Hybrid Logical ClocksClose to wall timeLowSoftware only

Failure Detection Strategies

StrategyFalse Positive RateDetection LatencyUsed By
Fixed timeoutHigh under load= timeout valueSimple systems
Heartbeat + timeoutMedium= heartbeat intervalZooKeeper, etcd
Phi Accrual detectorLow (adaptive)AdaptiveCassandra, Akka
Gossip protocolVery lowEventualDynamoDB, SWIM
Hardware health checksLowFast (IPMI/BMC)Cloud providers

Important Points Summary

  • You cannot know what happened after a timeout — the remote node may have processed your request. Design all operations to be idempotent.
  • Wall clocks cannot be used for distributed event ordering or lock expiry decisions — NTP drift of even 10ms can silently discard writes.
  • GC pauses, VM migrations, and OS scheduling can pause a process for seconds without warning. Never assume a process is running continuously.
  • Fencing tokens are the correct solution for distributed leases — they move the safety check to the resource, not the lock holder.
  • Quorum (majority) defines truth in distributed systems — a single node’s local view cannot be trusted when it has been disconnected.
  • Most distributed algorithms assume the partially synchronous + crash-recovery model — realistic for production datacenter systems.
  • Byzantine fault tolerance is not needed for most datacenter systems (trusted environment), but is essential for blockchain, aerospace, and multi-organization systems.
  • TLA+ and Jepsen have found bugs in production-grade distributed systems (AWS, MongoDB, Redis, Cassandra) — formal verification and fault-injection testing are now industry-standard practice.
  • Simulation testing (FoundationDB approach) is the gold standard for finding rare distributed bugs — deterministic replay enables perfect reproducibility.
  • System models must be explicit — an algorithm’s correctness guarantees are only as good as its model assumptions about timing and failure.

Modern Context (2026)

Formal methods in mainstream industry:

  • Amazon mandates TLA+ for new distributed system designs (since ~2015; expanded since)
  • CockroachDB, TiKV, and MongoDB run Jepsen suites on every release
  • FoundationDB’s simulation testing is now a reference architecture for reliability-critical systems
  • TLA+ tooling has improved: VSCode extension, better error messages, faster TLC

Low-latency GC (Java ZGC, Shenandoah):

  • ZGC and Shenandoah achieve < 10ms pauses even for multi-TB heaps
  • Reduces (but does not eliminate) GC-induced lease expiry false positives
  • Fencing tokens remain the correct defense — GC cannot be fully eliminated
  • Java/JVM systems are now more viable for tight-lease distributed coordination

PTP (Precision Time Protocol) availability:

  • AWS EC2 provides PTP (microsecond accuracy) for instances as of 2022
  • GCP and Azure followed; PTP is now standard in public cloud
  • Reduces clock skew risk for LWW-based systems but doesn’t provide bounded uncertainty
  • For true external consistency (Spanner-style), TrueTime hardware is still required

eBPF for network observability:

  • eBPF allows deep kernel-level network inspection with < 1% overhead
  • Cilium, Pixie, Datadog: use eBPF to detect packet drops, retransmits, and partition-like conditions in real time
  • Helps diagnose subtle network partition bugs in production that would previously require wiresharks

Deterministic simulation testing spread:

  • TigerBeetle (financial ledger DB): entire system runs as a deterministic simulation
  • Antithesis (startup): provides simulation-as-a-service — runs any container application in deterministic simulation
  • This approach is increasingly considered the most effective way to test distributed systems

Questions for Reflection

  1. If you send a request to a remote server and receive no response after 5 seconds, what are all the possible states the system could be in? What are the implications for retrying?
  2. A developer argues “we don’t need fencing tokens because our GC pauses are < 5ms (ZGC).” What is the flaw in this reasoning?
  3. Why does the asynchronous network model make consensus impossible, but the partially synchronous model allows it? What assumption does the timeout exploit?
  4. A write was acknowledged by a database node, but the node crashed before replicating to followers. Under the crash-recovery model, when the node restarts, should it re-apply the write or discard it? What guarantees does each choice provide?
  5. Why does BFT require n ≥ 3f + 1 nodes while crash-stop consensus requires only n ≥ 2f + 1? What extra message round is needed and why?
  6. TLA+ found 10 bugs in AWS’s distributed systems, including some only reachable via sequences of 5+ concurrent failures. Could Jepsen have found these same bugs? What are the relative strengths of model checking vs fault-injection testing?

Last Updated: 2026-05-29