Chapter 9: The Trouble with Distributed Systems
ddia-2e distributed-systems networks clocks byzantine-faults formal-methods
Status: Notes complete
Overview
Chapter 9 is a systematic catalog of everything that can go wrong when building distributed systems. Unlike a single machine — where failure is typically total and detectable — distributed systems experience partial failures: some components fail while others continue working, and the system cannot always tell the difference between a failed node and a slow one. The chapter examines three fundamental unreliability sources — networks, clocks, and processes — and then explores how to reason about correctness using formal models.
This is the 2nd edition’s equivalent of 1st edition Chapter 8, with two major additions: an expanded Formal Methods and Randomized Testing section (TLA+, Jepsen, fuzzing) and an updated treatment of Byzantine faults reflecting renewed relevance through blockchain and BFT consensus systems.
Core thesis: Building distributed systems requires abandoning the assumption that networks are reliable, clocks are synchronized, and processes don’t pause. Partial failure is the normal operating state, not an exceptional condition.
Key Concepts
Faults and Partial Failures
The fundamental difference from single-machine systems:
- Single machine: hardware failure is usually total — the machine either works or doesn’t
- Distributed system: components fail independently, unpredictably, and partially
Partial failure: A condition where some nodes in the system are working correctly while others are not. The critical challenge is that the rest of the system must handle this gracefully — and often cannot even reliably detect which components have failed.
Nondeterminism: Unlike a single CPU executing instructions sequentially, distributed systems are inherently nondeterministic. Messages may arrive in different orders on each run. A node may pause at any point. This nondeterminism is the source of most distributed system bugs.
The network ambiguity problem: When you send a request to a remote node and receive no response, you cannot determine which of these happened:
- Request was lost before reaching the server
- Remote node is down or unreachable
- Remote node received request but is processing slowly
- Remote node processed the request; response was lost
- Remote node processed the request; crashed immediately after
- Network between you and the node is partitioned
You have no way to distinguish these cases with certainty. This fundamental ambiguity is the core challenge of distributed networking.
Synchronous Versus Asynchronous Networks
Synchronous networks (e.g., circuit-switched telephone networks):
- A fixed amount of bandwidth is reserved for each call end-to-end
- Maximum end-to-end latency is bounded and known in advance
- No queuing — packets never wait because their “slot” is always reserved
- Consequence: Can make timing guarantees; timeouts are reliable
Asynchronous networks (e.g., packet-switched internet, datacenter networks):
- No reserved bandwidth; packets compete for available capacity
- Maximum latency is unbounded — a packet may be delayed arbitrarily
- Queuing at routers introduces variable latency (jitter)
- Consequence: Cannot make timing guarantees; timeouts are unreliable
Why the internet is packet-switched: Packet switching is more efficient for bursty traffic (web requests, file transfers). Circuit switching wastes bandwidth when no data is being transmitted. The internet was optimized for utilization, not latency predictability.
Why not make datacenter networks synchronous?: Even inside a single datacenter, networks are asynchronous. Switches buffer packets during congestion. VM migration adds latency. Hardware interrupts pause processes. The gains in determinism would require sacrificing flexibility and efficiency.
Comparison: Synchronous vs Asynchronous Networks
───────────────────────────────────────────────────────────────────────────
Property Synchronous (Circuit) Asynchronous (Packet)
───────────────────────────────────────────────────────────────────────────
Bandwidth Reserved per connection Shared; best-effort
Latency bound Yes (known maximum) No (unbounded in theory)
Queuing None (slots reserved) Yes (at each router/switch)
Failure detection Reliable via timeout Ambiguous; no guarantee
Utilization Low (idle slots wasted) High (multiplexed)
Use cases Voice/video telephony Internet, datacenters
Algorithm design Can use timing Must assume no timing
───────────────────────────────────────────────────────────────────────────
Network partition: Two groups of nodes can communicate within their group but not between groups. Both groups may continue operating, potentially electing their own leaders — this is the root cause of split-brain scenarios.
Practical implication: Always set explicit timeouts on network requests. But understand that a timeout only tells you “I gave up waiting” — not what the remote node actually did.
Unreliable Clocks
Monotonic Versus Time-of-Day Clocks
Time-of-day clock (wall clock):
- Returns current date and time (UTC)
- Synchronized via NTP — can jump forwards or backwards during sync
System.currentTimeMillis()in Java;time.time()in Python;CLOCK_REALTIMEin Linux- Problem: Can go backward. Cannot use for measuring durations or ordering events across nodes.
Monotonic clock:
- Always moves forward; never jumps backward
- Measures elapsed time since an arbitrary epoch (usually process start)
System.nanoTime()in Java;time.monotonic()in Python;CLOCK_MONOTONICin Linux- Good for: measuring timeouts and elapsed duration within a single process
- Cannot compare across nodes: each node’s monotonic clock has an independent, arbitrary zero point
Clock Usage Guide:
─────────────────────────────────────────────────────────────────
Time-of-Day Clock Monotonic Clock
─────────────────────────────────────────────────────────────────
Display to users YES no
Log event timestamps YES (cautiously) no
Measure timeout no YES
Measure elapsed time no YES
Order events across DANGEROUS IMPOSSIBLE
multiple nodes
Distributed lock DANGEROUS meaningless
expiry check
─────────────────────────────────────────────────────────────────
Clock Synchronization and Accuracy
Clocks on different machines drift continuously. Without correction:
- Quartz crystal oscillator drift: ~200ms per day (varies with temperature, age)
- NTP correction: syncs to within ~10ms under good conditions; worse under load or network congestion
- Google Spanner TrueTime: GPS receivers + atomic clocks in every datacenter; uncertainty ≤ 7ms
Concrete example of clock drift impact:
Scenario: Two nodes, 1ms clock drift, using LWW (Last Write Wins)
─────────────────────────────────────────────────────────────────
Node A clock: 100ms Node A writes x=1 with timestamp=100
Node B clock: 99ms Node B writes x=2 with timestamp=99
(Node B's clock is 1ms behind real time)
LWW comparison: timestamp 100 > 99, so x=1 wins.
RESULT: x=2 is silently discarded. A more recent write is LOST.
No error, no warning — the data is simply gone.
─────────────────────────────────────────────────────────────────
NTP accuracy factors (all of these affect synchronization quality):
- Round-trip network latency to NTP server (asymmetry causes error)
- Quartz crystal temperature (datacenter cooling varies)
- NTP server load
- Multiple hops to NTP stratum 1 sources
- Leap second handling (some NTP implementations smear, others step)
Google Spanner TrueTime API:
- Returns
[earliest, latest]— an interval, not a point - Typical interval width: 1–7ms; never exceeds ~14ms in practice
- Commits wait until
TT.after(commit_timestamp)is true, ensuring the commit timestamp is safely in the past - Cost: Requires GPS antennas and atomic clock hardware in every datacenter
PTP (Precision Time Protocol, IEEE 1588):
- Hardware timestamping at NIC level eliminates software jitter
- Achieves microsecond-level accuracy (vs milliseconds for software NTP)
- AWS EC2, GCP, and Azure offer PTP for VMs as of 2026
- Still doesn’t provide the bounded-uncertainty guarantees of Spanner TrueTime
Relying on Synchronized Clocks
Dangerous uses of wall clock timestamps:
- Last Write Wins (LWW): As shown above, clock drift silently discards writes
- Distributed lock expiry: Your clock may disagree with the lock server’s clock about whether the lease has expired
- Causal ordering: A later write may receive an earlier timestamp due to drift
Safe alternatives:
- Logical clocks (Lamport timestamps, vector clocks): Increment-based, capture causal ordering without physical time
- Monotonic clocks: Safe for measuring elapsed time within a single process
- Hybrid Logical Clocks (HLC): Combine physical time with a logical counter — monotonically increasing while staying close to wall time; used in CockroachDB and YugabyteDB
Process Pauses
A process can pause for an arbitrarily long time in the middle of execution:
| Pause Source | Typical Duration | Why It Happens |
|---|---|---|
| JVM GC stop-the-world | 100ms — several seconds | Old GC algorithms freeze all threads |
| ZGC / Shenandoah | < 10ms | Low-pause GC (still nonzero!) |
| VM live migration | 10ms — 1 second | Cloud hypervisor copies VM memory to new host |
| OS context switch | < 1ms (usually) | Kernel preempts process to run another |
| Memory pressure / swap | Seconds | OS pages memory to disk; process waits on page fault |
| Synchronous disk I/O | 5–100ms | Process blocks waiting for fsync |
| Signal handler | Microseconds | OS delivers a signal (SIGTERM, SIGUSR1) |
Why process pauses invalidate distributed protocols:
The Lease Expiry Nightmare:
────────────────────────────────────────────────────────────────────
Time 0: Node acquires lease (valid for 30s), gets fencing token=10
Time 1: Node checks: "lease valid? yes, 29s remaining" ✓
Time 2: JVM GC pause begins ─────────────────────────────────┐
(process is frozen) │
...15 seconds pass... PAUSED
Lease expires at Time 17. │
New node B gets lease with token=11 │
Time 18: GC pause ends ─────────────────────────────────────────┘
Time 18: Node resumes. Thinks it still has the lease.
Time 18: Node writes to storage with token=10
WITHOUT FENCING TOKENS: Two nodes are now writing to the same resource.
WITH FENCING TOKENS: Storage rejects token=10 (< current max=11). Safe.
────────────────────────────────────────────────────────────────────
Fencing tokens: The correct solution. The lock service issues a monotonically increasing counter with every lock grant. Every write to the protected resource must include the token. The resource server rejects any write where token < max_token_seen. Safety is enforced by the resource, not the lock holder — even a paused process can’t corrupt data because the resource will reject its stale token.
Knowledge, Truth, and Lies
The Majority Rules
In distributed systems, truth is defined by the majority (quorum). A single node cannot trust its own local perception of the world. Examples:
- A node that was disconnected for 30 seconds may think it’s the leader; the rest of the cluster has moved on
- A node that is slow may conclude “nobody else responded, so I am the sole survivor”; actually it’s the slow one that looks dead to others
- A “dead” node may actually be alive and healthy — the detector is simply not receiving its heartbeats
Why quorum is the only safe abstraction: If truth required unanimous agreement, a single node failure would block all decisions. If it required only one node, any isolated node could declare itself master. Majority (n/2 + 1) is the minimum threshold that prevents two disjoint groups from both thinking they have authority.
Distributed Locks and Leases
The split-brain failure mode: Two nodes simultaneously believe they hold the same distributed lock. This can happen when:
- A GC pause causes lease expiry (as shown above)
- A network partition makes the lock service unreachable from one group
- A slow lock service response leaves a client uncertain about whether its lease was granted
Lease vs lock:
- Lock: Binary — you either hold it or don’t. Requires active release.
- Lease: A lock with a time-to-live (TTL). Automatically expires if holder fails. Safer in distributed environments because a crashed holder won’t hold the lock forever.
Single leader with fencing token flow:
Lock Service (etcd/ZooKeeper) → Client A → Token=42
→ Client A pauses (GC)
→ Token=43 → Client B
← Client A wakes, writes with Token=42
Storage Server: max_seen=43, reject Token=42 ← Safe!
Byzantine Faults
Byzantine fault: A node that does not simply crash or become unresponsive, but instead sends incorrect, contradictory, or malicious information to other nodes.
Name origin: The Byzantine Generals Problem (Lamport, Shostak, Pease, 1982) — Byzantine generals communicating by messenger, where some generals (or messengers) may be traitors sending false information.
Why most datacenter systems don’t need BFT:
- Nodes in a corporate/cloud datacenter are trusted — hardware bugs send wrong data by accident, not malice
- BFT algorithms require
n ≥ 3f + 1nodes to toleratefByzantine faults (vsn ≥ 2f + 1for crash faults) - BFT message complexity is roughly
O(n²)per consensus round — extremely expensive - Simpler protection: checksums, cryptographic MACs, TLS for authentication
When BFT is required:
- Blockchain: Nodes are untrusted (arbitrary participants); Tendermint, PBFT, HotStuff
- Aerospace / safety-critical systems: Hardware radiation errors can corrupt data in any node
- Multi-organizational systems: Nodes operated by competing organizations with potential conflicts of interest
Byzantine Fault Tolerance comparison:
Crash-Stop Model: Byzantine Model:
──────────────── ────────────────
Node fails silently Node can lie actively
n ≥ 2f + 1 nodes n ≥ 3f + 1 nodes
Simple quorum reads Requires signed messages
Low overhead ~3x message overhead
Used by: Raft, Paxos Used by: PBFT, Tendermint, HotStuff
Weak forms of Byzantine defense (used in most datacenter systems):
- Checksums / CRCs on data at rest and in transit
- TLS with certificate validation prevents message injection
- Input validation and type checking at API boundaries
- “Trust but verify” with cryptographic signatures on critical config changes
System Model and Reality
System model: A formal set of assumptions about the behavior of nodes and networks that an algorithm is designed for.
Timing models:
| Model | Assumption | Realistic? | Used for |
|---|---|---|---|
| Synchronous | Bounded message delay + processing time | No — real networks violate this | Theoretical analysis |
| Partially synchronous | Usually bounded; occasionally violates bounds | Yes — real systems | Most practical algorithms |
| Asynchronous | No timing assumptions at all | Too pessimistic | FLP impossibility proofs |
Node failure models:
| Model | Behavior | Typical Use |
|---|---|---|
| Crash-stop | Node fails by halting; never restarts | Simplest; most theoretical proofs |
| Crash-recovery | Node can crash, restart with durable state | Most practical DB/distributed systems |
| Byzantine | Node can behave arbitrarily | Blockchain, aerospace, adversarial environments |
The combined system model matrix:
Crash-Stop Crash-Recovery Byzantine
────────── ────────────── ─────────
Synchronous Simple Practical DB Aerospace
Partially sync Raft theory Raft in practice Tendermint
Asynchronous FLP proof Complex Extremely hard
Most distributed databases assume: Partially synchronous + crash-recovery. This is why Raft and Paxos use timeouts (partially synchronous assumption) and write to disk before acknowledging (crash-recovery assumption).
Safety vs liveness:
- Safety property: “Nothing bad happens” — system never enters an invalid state, regardless of timing
- Liveness property: “Something good eventually happens” — system eventually makes progress
- Most algorithms sacrifice liveness under adversarial timing (they may block) but never violate safety (they never return wrong results)
Formal Methods and Randomized Testing
(New in 2nd Edition) This section covers how the distributed systems community has moved from informal correctness arguments to rigorous mechanized verification.
Why informal reasoning fails:
- Distributed system bugs often involve 3–4 concurrent failure conditions that are individually unlikely but jointly catastrophic
- Human intuition handles 2-way race conditions but fails badly with 4+ concurrent conditions
- A “correct-looking” algorithm may fail in a 1-in-a-billion scenario that only appears under production load
TLA+ (Temporal Logic of Actions):
- Author: Leslie Lamport (also invented Paxos)
- What it is: A formal specification language for concurrent and distributed systems
- How it works: Write a formal model of the algorithm; the TLC model checker exhaustively checks all possible interleavings of events
- What it finds: Safety violations (bad states reachable) and liveness violations (progress never made)
- Used by: Amazon (AWS S3, DynamoDB, EBS, EC2), Microsoft Azure, MongoDB, CockroachDB
- Amazon finding: TLA+ found 10 bugs in their distributed protocols — 7 of which would have been missed by code review and testing alone
PlusCal: A higher-level language that compiles to TLA+; easier to write than raw TLA+
Alloy: An alternative formal modeling tool with automatic counterexample generation; used for data model and schema verification
Model checking limitations:
- State space explosion: checking all interleavings is exponential in number of concurrent agents
- TLC uses bounded model checking (limits state space depth); may miss bugs in very deep scenarios
- Models must be kept abstract — you model the algorithm’s logic, not the implementation code
- Finding: a correct TLA+ spec does not guarantee a correct implementation (gap between spec and code)
Jepsen:
- Author: Kyle Kingsbury (“aphyr”)
- What it is: A framework for testing distributed systems by injecting network partitions, clock skews, and node failures while running workloads
- How it works: Uses nemeses (fault injectors) + checkers (linearizability verifiers using Knossos model checker)
- Famous results:
- MongoDB (various versions): lost acknowledged writes during elections
- Redis Cluster: lost data during failover
- Cassandra: violated linearizability under clock skew
- Elasticsearch: acknowledged writes not readable after node failure
- CockroachDB: initially found linearizability violations; now passes (fixed by team)
- Impact: Most major distributed databases now run Jepsen testing before releases
Randomized / property-based testing:
- Fuzzing: Feed random or malformed inputs to find crashes and panics
- Property-based testing (QuickCheck, Hypothesis): Generate random test cases; define properties that must hold; find counterexamples automatically
- Simulation testing (FoundationDB approach): Run the entire distributed system in a deterministic simulation where all nondeterminism (clocks, network) is controlled; replay exact scenarios that trigger bugs
- Chaos engineering (Netflix Chaos Monkey): Randomly kill processes in production to verify resilience
FoundationDB’s deterministic simulation:
- FoundationDB runs its entire distributed system as a simulation: all network calls, disk I/O, and time are simulated with a single-threaded, deterministic event loop
- Any test that fails can be reproduced exactly (same seed = same execution)
- This approach found and eliminated entire classes of rare distributed bugs before production
- Industry influence: The approach (often called “simulation testing” or “discrete event simulation”) is now used by TigerBeetle, Tigris Data, and others
Comparison Tables
Network Models
| Property | LAN (Same Rack) | WAN (Same DC) | Cross-DC | Public Internet |
|---|---|---|---|---|
| Typical latency | < 1ms | 1–5ms | 1–100ms | 10–500ms |
| Latency bound | No | No | No | No |
| Packet loss | Very rare | Rare | Occasional | Common |
| Partition risk | Very low | Low | Medium | High |
| Congestion | Rare | Occasional | Common | Very common |
Clock Accuracy Options
| Method | Accuracy | Cost | Infrastructure Required |
|---|---|---|---|
| Software NTP | ±10–100ms | Free | Public NTP servers |
| HW-timestamped NTP | ±1–5ms | Low | PTP-capable NICs |
| PTP (IEEE 1588) | ±1–100µs | Medium | HW-timestamped switches |
| Google TrueTime | ±7ms bounded | Very High | GPS + atomic clock hardware |
| Logical clocks | N/A (no physical time) | Zero | None |
| Hybrid Logical Clocks | Close to wall time | Low | Software only |
Failure Detection Strategies
| Strategy | False Positive Rate | Detection Latency | Used By |
|---|---|---|---|
| Fixed timeout | High under load | = timeout value | Simple systems |
| Heartbeat + timeout | Medium | = heartbeat interval | ZooKeeper, etcd |
| Phi Accrual detector | Low (adaptive) | Adaptive | Cassandra, Akka |
| Gossip protocol | Very low | Eventual | DynamoDB, SWIM |
| Hardware health checks | Low | Fast (IPMI/BMC) | Cloud providers |
Important Points Summary
- You cannot know what happened after a timeout — the remote node may have processed your request. Design all operations to be idempotent.
- Wall clocks cannot be used for distributed event ordering or lock expiry decisions — NTP drift of even 10ms can silently discard writes.
- GC pauses, VM migrations, and OS scheduling can pause a process for seconds without warning. Never assume a process is running continuously.
- Fencing tokens are the correct solution for distributed leases — they move the safety check to the resource, not the lock holder.
- Quorum (majority) defines truth in distributed systems — a single node’s local view cannot be trusted when it has been disconnected.
- Most distributed algorithms assume the partially synchronous + crash-recovery model — realistic for production datacenter systems.
- Byzantine fault tolerance is not needed for most datacenter systems (trusted environment), but is essential for blockchain, aerospace, and multi-organization systems.
- TLA+ and Jepsen have found bugs in production-grade distributed systems (AWS, MongoDB, Redis, Cassandra) — formal verification and fault-injection testing are now industry-standard practice.
- Simulation testing (FoundationDB approach) is the gold standard for finding rare distributed bugs — deterministic replay enables perfect reproducibility.
- System models must be explicit — an algorithm’s correctness guarantees are only as good as its model assumptions about timing and failure.
Modern Context (2026)
Formal methods in mainstream industry:
- Amazon mandates TLA+ for new distributed system designs (since ~2015; expanded since)
- CockroachDB, TiKV, and MongoDB run Jepsen suites on every release
- FoundationDB’s simulation testing is now a reference architecture for reliability-critical systems
- TLA+ tooling has improved: VSCode extension, better error messages, faster TLC
Low-latency GC (Java ZGC, Shenandoah):
- ZGC and Shenandoah achieve < 10ms pauses even for multi-TB heaps
- Reduces (but does not eliminate) GC-induced lease expiry false positives
- Fencing tokens remain the correct defense — GC cannot be fully eliminated
- Java/JVM systems are now more viable for tight-lease distributed coordination
PTP (Precision Time Protocol) availability:
- AWS EC2 provides PTP (microsecond accuracy) for instances as of 2022
- GCP and Azure followed; PTP is now standard in public cloud
- Reduces clock skew risk for LWW-based systems but doesn’t provide bounded uncertainty
- For true external consistency (Spanner-style), TrueTime hardware is still required
eBPF for network observability:
- eBPF allows deep kernel-level network inspection with < 1% overhead
- Cilium, Pixie, Datadog: use eBPF to detect packet drops, retransmits, and partition-like conditions in real time
- Helps diagnose subtle network partition bugs in production that would previously require wiresharks
Deterministic simulation testing spread:
- TigerBeetle (financial ledger DB): entire system runs as a deterministic simulation
- Antithesis (startup): provides simulation-as-a-service — runs any container application in deterministic simulation
- This approach is increasingly considered the most effective way to test distributed systems
Questions for Reflection
- If you send a request to a remote server and receive no response after 5 seconds, what are all the possible states the system could be in? What are the implications for retrying?
- A developer argues “we don’t need fencing tokens because our GC pauses are < 5ms (ZGC).” What is the flaw in this reasoning?
- Why does the asynchronous network model make consensus impossible, but the partially synchronous model allows it? What assumption does the timeout exploit?
- A write was acknowledged by a database node, but the node crashed before replicating to followers. Under the crash-recovery model, when the node restarts, should it re-apply the write or discard it? What guarantees does each choice provide?
- Why does BFT require
n ≥ 3f + 1nodes while crash-stop consensus requires onlyn ≥ 2f + 1? What extra message round is needed and why? - TLA+ found 10 bugs in AWS’s distributed systems, including some only reachable via sequences of 5+ concurrent failures. Could Jepsen have found these same bugs? What are the relative strengths of model checking vs fault-injection testing?
Related Resources
- ch10-consistency-and-consensus — The formal solutions to partial failure: linearizability, consensus, Raft
- ch08-transactions — ACID and isolation levels; how single-node databases handle failure
- ch06-replication — How replicas handle network partitions and leader elections
- ch08-trouble-with-distributed-systems — 1st edition equivalent; useful for comparison
- ch09-consistency-and-consensus — 1st edition consensus chapter
Last Updated: 2026-05-29