Chapter 9: The Trouble with Distributed Systems

ddia-2e distributed-systems networks clocks byzantine-faults formal-methods

Status: Notes complete

Overview

Chapter 9 is a systematic catalog of everything that can go wrong when building distributed systems. Unlike a single machine — where failure is typically total and detectable — distributed systems experience partial failures: some components fail while others continue working, and the system cannot always tell the difference between a failed node and a slow one. The chapter examines three fundamental unreliability sources — networks, clocks, and processes — and then explores how to reason about correctness using formal models.

This is the 2nd edition’s equivalent of 1st edition Chapter 8, with two major additions: an expanded Formal Methods and Randomized Testing section (TLA+, Jepsen, fuzzing) and an updated treatment of Byzantine faults reflecting renewed relevance through blockchain and BFT consensus systems.

Core thesis: Building distributed systems requires abandoning the assumption that networks are reliable, clocks are synchronized, and processes don’t pause. Partial failure is the normal operating state, not an exceptional condition.

Key Concepts

Faults and Partial Failures

The fundamental difference from single-machine systems:

Single machine: hardware failure is usually total — the machine either works or doesn’t
Distributed system: components fail independently, unpredictably, and partially

Partial failure: A condition where some nodes in the system are working correctly while others are not. The critical challenge is that the rest of the system must handle this gracefully — and often cannot even reliably detect which components have failed.

Nondeterminism: Unlike a single CPU executing instructions sequentially, distributed systems are inherently nondeterministic. Messages may arrive in different orders on each run. A node may pause at any point. This nondeterminism is the source of most distributed system bugs.

The network ambiguity problem: When you send a request to a remote node and receive no response, you cannot determine which of these happened:

Request was lost before reaching the server
Remote node is down or unreachable
Remote node received request but is processing slowly
Remote node processed the request; response was lost
Remote node processed the request; crashed immediately after
Network between you and the node is partitioned

You have no way to distinguish these cases with certainty. This fundamental ambiguity is the core challenge of distributed networking.

Synchronous Versus Asynchronous Networks

Synchronous networks (e.g., circuit-switched telephone networks):

A fixed amount of bandwidth is reserved for each call end-to-end
Maximum end-to-end latency is bounded and known in advance
No queuing — packets never wait because their “slot” is always reserved
Consequence: Can make timing guarantees; timeouts are reliable

Asynchronous networks (e.g., packet-switched internet, datacenter networks):

No reserved bandwidth; packets compete for available capacity
Maximum latency is unbounded — a packet may be delayed arbitrarily
Queuing at routers introduces variable latency (jitter)
Consequence: Cannot make timing guarantees; timeouts are unreliable

Why the internet is packet-switched: Packet switching is more efficient for bursty traffic (web requests, file transfers). Circuit switching wastes bandwidth when no data is being transmitted. The internet was optimized for utilization, not latency predictability.

Why not make datacenter networks synchronous?: Even inside a single datacenter, networks are asynchronous. Switches buffer packets during congestion. VM migration adds latency. Hardware interrupts pause processes. The gains in determinism would require sacrificing flexibility and efficiency.

Comparison: Synchronous vs Asynchronous Networks
───────────────────────────────────────────────────────────────────────────
Property            Synchronous (Circuit)     Asynchronous (Packet)
───────────────────────────────────────────────────────────────────────────
Bandwidth           Reserved per connection   Shared; best-effort
Latency bound       Yes (known maximum)       No (unbounded in theory)
Queuing             None (slots reserved)     Yes (at each router/switch)
Failure detection   Reliable via timeout      Ambiguous; no guarantee
Utilization         Low (idle slots wasted)   High (multiplexed)
Use cases           Voice/video telephony     Internet, datacenters
Algorithm design    Can use timing            Must assume no timing
───────────────────────────────────────────────────────────────────────────

Network partition: Two groups of nodes can communicate within their group but not between groups. Both groups may continue operating, potentially electing their own leaders — this is the root cause of split-brain scenarios.

Practical implication: Always set explicit timeouts on network requests. But understand that a timeout only tells you “I gave up waiting” — not what the remote node actually did.

Unreliable Clocks

Monotonic Versus Time-of-Day Clocks

Time-of-day clock (wall clock):

Returns current date and time (UTC)
Synchronized via NTP — can jump forwards or backwards during sync
System.currentTimeMillis() in Java; time.time() in Python; CLOCK_REALTIME in Linux
Problem: Can go backward. Cannot use for measuring durations or ordering events across nodes.

Monotonic clock:

Always moves forward; never jumps backward
Measures elapsed time since an arbitrary epoch (usually process start)
System.nanoTime() in Java; time.monotonic() in Python; CLOCK_MONOTONIC in Linux
Good for: measuring timeouts and elapsed duration within a single process
Cannot compare across nodes: each node’s monotonic clock has an independent, arbitrary zero point

Clock Usage Guide:
─────────────────────────────────────────────────────────────────
                    Time-of-Day Clock    Monotonic Clock
─────────────────────────────────────────────────────────────────
Display to users        YES                  no
Log event timestamps    YES (cautiously)     no
Measure timeout         no                   YES
Measure elapsed time    no                   YES
Order events across     DANGEROUS            IMPOSSIBLE
 multiple nodes
Distributed lock        DANGEROUS            meaningless
 expiry check
─────────────────────────────────────────────────────────────────

Clock Synchronization and Accuracy

Clocks on different machines drift continuously. Without correction:

Quartz crystal oscillator drift: ~200ms per day (varies with temperature, age)
NTP correction: syncs to within ~10ms under good conditions; worse under load or network congestion
Google Spanner TrueTime: GPS receivers + atomic clocks in every datacenter; uncertainty ≤ 7ms

Concrete example of clock drift impact:

Scenario: Two nodes, 1ms clock drift, using LWW (Last Write Wins)
─────────────────────────────────────────────────────────────────
Node A clock: 100ms     Node A writes x=1 with timestamp=100
Node B clock:  99ms     Node B writes x=2 with timestamp=99
                        (Node B's clock is 1ms behind real time)

LWW comparison: timestamp 100 > 99, so x=1 wins.
RESULT: x=2 is silently discarded. A more recent write is LOST.
No error, no warning — the data is simply gone.
─────────────────────────────────────────────────────────────────

NTP accuracy factors (all of these affect synchronization quality):

Round-trip network latency to NTP server (asymmetry causes error)
Quartz crystal temperature (datacenter cooling varies)
NTP server load
Multiple hops to NTP stratum 1 sources
Leap second handling (some NTP implementations smear, others step)

Google Spanner TrueTime API:

Returns [earliest, latest] — an interval, not a point
Typical interval width: 1–7ms; never exceeds ~14ms in practice
Commits wait until TT.after(commit_timestamp) is true, ensuring the commit timestamp is safely in the past
Cost: Requires GPS antennas and atomic clock hardware in every datacenter

PTP (Precision Time Protocol, IEEE 1588):

Hardware timestamping at NIC level eliminates software jitter
Achieves microsecond-level accuracy (vs milliseconds for software NTP)
AWS EC2, GCP, and Azure offer PTP for VMs as of 2026
Still doesn’t provide the bounded-uncertainty guarantees of Spanner TrueTime

Relying on Synchronized Clocks

Dangerous uses of wall clock timestamps:

Last Write Wins (LWW): As shown above, clock drift silently discards writes
Distributed lock expiry: Your clock may disagree with the lock server’s clock about whether the lease has expired
Causal ordering: A later write may receive an earlier timestamp due to drift

Safe alternatives:

Logical clocks (Lamport timestamps, vector clocks): Increment-based, capture causal ordering without physical time
Monotonic clocks: Safe for measuring elapsed time within a single process
Hybrid Logical Clocks (HLC): Combine physical time with a logical counter — monotonically increasing while staying close to wall time; used in CockroachDB and YugabyteDB

Process Pauses

A process can pause for an arbitrarily long time in the middle of execution:

Pause Source	Typical Duration	Why It Happens
JVM GC stop-the-world	100ms — several seconds	Old GC algorithms freeze all threads
ZGC / Shenandoah	< 10ms	Low-pause GC (still nonzero!)
VM live migration	10ms — 1 second	Cloud hypervisor copies VM memory to new host
OS context switch	< 1ms (usually)	Kernel preempts process to run another
Memory pressure / swap	Seconds	OS pages memory to disk; process waits on page fault
Synchronous disk I/O	5–100ms	Process blocks waiting for fsync
Signal handler	Microseconds	OS delivers a signal (SIGTERM, SIGUSR1)

Why process pauses invalidate distributed protocols:

The Lease Expiry Nightmare:
────────────────────────────────────────────────────────────────────
Time 0:    Node acquires lease (valid for 30s), gets fencing token=10
Time 1:    Node checks: "lease valid? yes, 29s remaining" ✓
Time 2:    JVM GC pause begins  ─────────────────────────────────┐
           (process is frozen)                                    │
           ...15 seconds pass...                               PAUSED
           Lease expires at Time 17.                              │
           New node B gets lease with token=11                    │
Time 18:   GC pause ends ─────────────────────────────────────────┘
Time 18:   Node resumes. Thinks it still has the lease.
Time 18:   Node writes to storage with token=10

WITHOUT FENCING TOKENS: Two nodes are now writing to the same resource.
WITH FENCING TOKENS:    Storage rejects token=10 (< current max=11). Safe.
────────────────────────────────────────────────────────────────────

Fencing tokens: The correct solution. The lock service issues a monotonically increasing counter with every lock grant. Every write to the protected resource must include the token. The resource server rejects any write where token < max_token_seen. Safety is enforced by the resource, not the lock holder — even a paused process can’t corrupt data because the resource will reject its stale token.

Knowledge, Truth, and Lies

The Majority Rules

In distributed systems, truth is defined by the majority (quorum). A single node cannot trust its own local perception of the world. Examples:

A node that was disconnected for 30 seconds may think it’s the leader; the rest of the cluster has moved on
A node that is slow may conclude “nobody else responded, so I am the sole survivor”; actually it’s the slow one that looks dead to others
A “dead” node may actually be alive and healthy — the detector is simply not receiving its heartbeats

Why quorum is the only safe abstraction: If truth required unanimous agreement, a single node failure would block all decisions. If it required only one node, any isolated node could declare itself master. Majority (n/2 + 1) is the minimum threshold that prevents two disjoint groups from both thinking they have authority.

Distributed Locks and Leases

The split-brain failure mode: Two nodes simultaneously believe they hold the same distributed lock. This can happen when:

A GC pause causes lease expiry (as shown above)
A network partition makes the lock service unreachable from one group
A slow lock service response leaves a client uncertain about whether its lease was granted

Lease vs lock:

Lock: Binary — you either hold it or don’t. Requires active release.
Lease: A lock with a time-to-live (TTL). Automatically expires if holder fails. Safer in distributed environments because a crashed holder won’t hold the lock forever.

Single leader with fencing token flow:

Lock Service (etcd/ZooKeeper) → Client A → Token=42
                              → Client A pauses (GC)
                              → Token=43 → Client B
                              ← Client A wakes, writes with Token=42
Storage Server: max_seen=43, reject Token=42 ← Safe!

Byzantine Faults

Byzantine fault: A node that does not simply crash or become unresponsive, but instead sends incorrect, contradictory, or malicious information to other nodes.

Name origin: The Byzantine Generals Problem (Lamport, Shostak, Pease, 1982) — Byzantine generals communicating by messenger, where some generals (or messengers) may be traitors sending false information.

Why most datacenter systems don’t need BFT:

Nodes in a corporate/cloud datacenter are trusted — hardware bugs send wrong data by accident, not malice
BFT algorithms require n ≥ 3f + 1 nodes to tolerate f Byzantine faults (vs n ≥ 2f + 1 for crash faults)
BFT message complexity is roughly O(n²) per consensus round — extremely expensive
Simpler protection: checksums, cryptographic MACs, TLS for authentication

When BFT is required:

Blockchain: Nodes are untrusted (arbitrary participants); Tendermint, PBFT, HotStuff
Aerospace / safety-critical systems: Hardware radiation errors can corrupt data in any node
Multi-organizational systems: Nodes operated by competing organizations with potential conflicts of interest

Byzantine Fault Tolerance comparison:

Crash-Stop Model:          Byzantine Model:
────────────────           ────────────────
Node fails silently        Node can lie actively
n ≥ 2f + 1 nodes          n ≥ 3f + 1 nodes
Simple quorum reads        Requires signed messages
Low overhead               ~3x message overhead
Used by: Raft, Paxos       Used by: PBFT, Tendermint, HotStuff

Weak forms of Byzantine defense (used in most datacenter systems):

Checksums / CRCs on data at rest and in transit
TLS with certificate validation prevents message injection
Input validation and type checking at API boundaries
“Trust but verify” with cryptographic signatures on critical config changes

System Model and Reality

System model: A formal set of assumptions about the behavior of nodes and networks that an algorithm is designed for.

Timing models:

Model	Assumption	Realistic?	Used for
Synchronous	Bounded message delay + processing time	No — real networks violate this	Theoretical analysis
Partially synchronous	Usually bounded; occasionally violates bounds	Yes — real systems	Most practical algorithms
Asynchronous	No timing assumptions at all	Too pessimistic	FLP impossibility proofs

Node failure models:

Model	Behavior	Typical Use
Crash-stop	Node fails by halting; never restarts	Simplest; most theoretical proofs
Crash-recovery	Node can crash, restart with durable state	Most practical DB/distributed systems
Byzantine	Node can behave arbitrarily	Blockchain, aerospace, adversarial environments

The combined system model matrix:

                    Crash-Stop    Crash-Recovery    Byzantine
                    ──────────    ──────────────    ─────────
Synchronous         Simple        Practical DB       Aerospace
Partially sync      Raft theory   Raft in practice   Tendermint
Asynchronous        FLP proof     Complex            Extremely hard

Most distributed databases assume: Partially synchronous + crash-recovery. This is why Raft and Paxos use timeouts (partially synchronous assumption) and write to disk before acknowledging (crash-recovery assumption).

Safety vs liveness:

Safety property: “Nothing bad happens” — system never enters an invalid state, regardless of timing
Liveness property: “Something good eventually happens” — system eventually makes progress
Most algorithms sacrifice liveness under adversarial timing (they may block) but never violate safety (they never return wrong results)

Formal Methods and Randomized Testing

(New in 2nd Edition) This section covers how the distributed systems community has moved from informal correctness arguments to rigorous mechanized verification.

Why informal reasoning fails:

Distributed system bugs often involve 3–4 concurrent failure conditions that are individually unlikely but jointly catastrophic
Human intuition handles 2-way race conditions but fails badly with 4+ concurrent conditions
A “correct-looking” algorithm may fail in a 1-in-a-billion scenario that only appears under production load

TLA+ (Temporal Logic of Actions):

Author: Leslie Lamport (also invented Paxos)
What it is: A formal specification language for concurrent and distributed systems
How it works: Write a formal model of the algorithm; the TLC model checker exhaustively checks all possible interleavings of events
What it finds: Safety violations (bad states reachable) and liveness violations (progress never made)
Used by: Amazon (AWS S3, DynamoDB, EBS, EC2), Microsoft Azure, MongoDB, CockroachDB
Amazon finding: TLA+ found 10 bugs in their distributed protocols — 7 of which would have been missed by code review and testing alone

PlusCal: A higher-level language that compiles to TLA+; easier to write than raw TLA+

Alloy: An alternative formal modeling tool with automatic counterexample generation; used for data model and schema verification

Model checking limitations:

State space explosion: checking all interleavings is exponential in number of concurrent agents
TLC uses bounded model checking (limits state space depth); may miss bugs in very deep scenarios
Models must be kept abstract — you model the algorithm’s logic, not the implementation code
Finding: a correct TLA+ spec does not guarantee a correct implementation (gap between spec and code)

Jepsen:

Author: Kyle Kingsbury (“aphyr”)
What it is: A framework for testing distributed systems by injecting network partitions, clock skews, and node failures while running workloads
How it works: Uses nemeses (fault injectors) + checkers (linearizability verifiers using Knossos model checker)
Famous results:
- MongoDB (various versions): lost acknowledged writes during elections
- Redis Cluster: lost data during failover
- Cassandra: violated linearizability under clock skew
- Elasticsearch: acknowledged writes not readable after node failure
- CockroachDB: initially found linearizability violations; now passes (fixed by team)
Impact: Most major distributed databases now run Jepsen testing before releases

Randomized / property-based testing:

Fuzzing: Feed random or malformed inputs to find crashes and panics
Property-based testing (QuickCheck, Hypothesis): Generate random test cases; define properties that must hold; find counterexamples automatically
Simulation testing (FoundationDB approach): Run the entire distributed system in a deterministic simulation where all nondeterminism (clocks, network) is controlled; replay exact scenarios that trigger bugs
Chaos engineering (Netflix Chaos Monkey): Randomly kill processes in production to verify resilience

FoundationDB’s deterministic simulation:

FoundationDB runs its entire distributed system as a simulation: all network calls, disk I/O, and time are simulated with a single-threaded, deterministic event loop
Any test that fails can be reproduced exactly (same seed = same execution)
This approach found and eliminated entire classes of rare distributed bugs before production
Industry influence: The approach (often called “simulation testing” or “discrete event simulation”) is now used by TigerBeetle, Tigris Data, and others

Comparison Tables

Network Models

Property	LAN (Same Rack)	WAN (Same DC)	Cross-DC	Public Internet
Typical latency	< 1ms	1–5ms	1–100ms	10–500ms
Latency bound	No	No	No	No
Packet loss	Very rare	Rare	Occasional	Common
Partition risk	Very low	Low	Medium	High
Congestion	Rare	Occasional	Common	Very common

Clock Accuracy Options

Method	Accuracy	Cost	Infrastructure Required
Software NTP	±10–100ms	Free	Public NTP servers
HW-timestamped NTP	±1–5ms	Low	PTP-capable NICs
PTP (IEEE 1588)	±1–100µs	Medium	HW-timestamped switches
Google TrueTime	±7ms bounded	Very High	GPS + atomic clock hardware
Logical clocks	N/A (no physical time)	Zero	None
Hybrid Logical Clocks	Close to wall time	Low	Software only

Failure Detection Strategies

Strategy	False Positive Rate	Detection Latency	Used By
Fixed timeout	High under load	= timeout value	Simple systems
Heartbeat + timeout	Medium	= heartbeat interval	ZooKeeper, etcd
Phi Accrual detector	Low (adaptive)	Adaptive	Cassandra, Akka
Gossip protocol	Very low	Eventual	DynamoDB, SWIM
Hardware health checks	Low	Fast (IPMI/BMC)	Cloud providers

Important Points Summary

You cannot know what happened after a timeout — the remote node may have processed your request. Design all operations to be idempotent.
Wall clocks cannot be used for distributed event ordering or lock expiry decisions — NTP drift of even 10ms can silently discard writes.
GC pauses, VM migrations, and OS scheduling can pause a process for seconds without warning. Never assume a process is running continuously.
Fencing tokens are the correct solution for distributed leases — they move the safety check to the resource, not the lock holder.
Quorum (majority) defines truth in distributed systems — a single node’s local view cannot be trusted when it has been disconnected.
Most distributed algorithms assume the partially synchronous + crash-recovery model — realistic for production datacenter systems.
Byzantine fault tolerance is not needed for most datacenter systems (trusted environment), but is essential for blockchain, aerospace, and multi-organization systems.
TLA+ and Jepsen have found bugs in production-grade distributed systems (AWS, MongoDB, Redis, Cassandra) — formal verification and fault-injection testing are now industry-standard practice.
Simulation testing (FoundationDB approach) is the gold standard for finding rare distributed bugs — deterministic replay enables perfect reproducibility.
System models must be explicit — an algorithm’s correctness guarantees are only as good as its model assumptions about timing and failure.

Modern Context (2026)

Formal methods in mainstream industry:

Amazon mandates TLA+ for new distributed system designs (since ~2015; expanded since)
CockroachDB, TiKV, and MongoDB run Jepsen suites on every release
FoundationDB’s simulation testing is now a reference architecture for reliability-critical systems
TLA+ tooling has improved: VSCode extension, better error messages, faster TLC

Low-latency GC (Java ZGC, Shenandoah):

ZGC and Shenandoah achieve < 10ms pauses even for multi-TB heaps
Reduces (but does not eliminate) GC-induced lease expiry false positives
Fencing tokens remain the correct defense — GC cannot be fully eliminated
Java/JVM systems are now more viable for tight-lease distributed coordination

PTP (Precision Time Protocol) availability:

AWS EC2 provides PTP (microsecond accuracy) for instances as of 2022
GCP and Azure followed; PTP is now standard in public cloud
Reduces clock skew risk for LWW-based systems but doesn’t provide bounded uncertainty
For true external consistency (Spanner-style), TrueTime hardware is still required

eBPF for network observability:

eBPF allows deep kernel-level network inspection with < 1% overhead
Cilium, Pixie, Datadog: use eBPF to detect packet drops, retransmits, and partition-like conditions in real time
Helps diagnose subtle network partition bugs in production that would previously require wiresharks

Deterministic simulation testing spread:

TigerBeetle (financial ledger DB): entire system runs as a deterministic simulation
Antithesis (startup): provides simulation-as-a-service — runs any container application in deterministic simulation
This approach is increasingly considered the most effective way to test distributed systems

Questions for Reflection

If you send a request to a remote server and receive no response after 5 seconds, what are all the possible states the system could be in? What are the implications for retrying?
A developer argues “we don’t need fencing tokens because our GC pauses are < 5ms (ZGC).” What is the flaw in this reasoning?
Why does the asynchronous network model make consensus impossible, but the partially synchronous model allows it? What assumption does the timeout exploit?
A write was acknowledged by a database node, but the node crashed before replicating to followers. Under the crash-recovery model, when the node restarts, should it re-apply the write or discard it? What guarantees does each choice provide?
Why does BFT require n ≥ 3f + 1 nodes while crash-stop consensus requires only n ≥ 2f + 1? What extra message round is needed and why?
TLA+ found 10 bugs in AWS’s distributed systems, including some only reachable via sequences of 5+ concurrent failures. Could Jepsen have found these same bugs? What are the relative strengths of model checking vs fault-injection testing?

ch10-consistency-and-consensus — The formal solutions to partial failure: linearizability, consensus, Raft
ch08-transactions — ACID and isolation levels; how single-node databases handle failure
ch06-replication — How replicas handle network partitions and leader elections
ch08-trouble-with-distributed-systems — 1st edition equivalent; useful for comparison
ch09-consistency-and-consensus — 1st edition consensus chapter

Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch09-trouble-with-distributed-systems

Chapter 9: The Trouble with Distributed Systems

Overview

Key Concepts

Faults and Partial Failures

Synchronous Versus Asynchronous Networks

Unreliable Clocks

Monotonic Versus Time-of-Day Clocks

Clock Synchronization and Accuracy

Relying on Synchronized Clocks

Process Pauses

Knowledge, Truth, and Lies

The Majority Rules

Distributed Locks and Leases

Byzantine Faults

System Model and Reality

Formal Methods and Randomized Testing

Comparison Tables

Network Models

Clock Accuracy Options

Failure Detection Strategies

Important Points Summary

Modern Context (2026)

Questions for Reflection

Graph View

Table of Contents

Backlinks

Study Notes by Niladri & AI

Explorer

ch09-trouble-with-distributed-systems

Chapter 9: The Trouble with Distributed Systems

Overview

Key Concepts

Faults and Partial Failures

Synchronous Versus Asynchronous Networks

Unreliable Clocks

Monotonic Versus Time-of-Day Clocks

Clock Synchronization and Accuracy

Relying on Synchronized Clocks

Process Pauses

Knowledge, Truth, and Lies

The Majority Rules

Distributed Locks and Leases

Byzantine Faults

System Model and Reality

Formal Methods and Randomized Testing

Comparison Tables

Network Models

Clock Accuracy Options

Failure Detection Strategies

Important Points Summary

Modern Context (2026)

Questions for Reflection

Related Resources

Graph View

Table of Contents

Backlinks