Chapter 9 Flashcards — The Trouble with Distributed Systems

flashcards ddia-2e chapter-9 distributed-systems networks clocks


Foundations

What is a partial failure and why is it the defining challenge of distributed systems?
?

  • Partial failure: Some nodes in the system work correctly while others do not — simultaneously
    • Example: 3 of 5 database replicas reachable; 2 unreachable due to switch failure
  • Why defining: Unlike a single machine (usually total failure or full operation), distributed systems require code to handle the in-between state
  • Nondeterminism: Different nodes may observe different subsets of failures at the same time
  • Core implication: Every distributed protocol must be designed to function correctly despite partial failure — you can’t assume “if A succeeded, B succeeded too”

What is the fundamental ambiguity problem when a distributed request times out?
?
When a request times out, you cannot determine which of these happened:

  1. Request was lost before reaching the server (server never saw it)
  2. Server is down or unreachable (request never processed)
  3. Server received the request but is processing slowly
  4. Server processed the request; response was lost
  5. Server processed the request; crashed immediately after
  6. Network is partitioned between you and the server
  • No way to distinguish these cases with certainty
  • Critical implication: You may retry an already-completed operation
  • Required design: All operations must be idempotent (safe to repeat without changing the result beyond the first execution)
  • Implementation: Use unique request IDs; server returns cached result for duplicate IDs

Networks

What is the difference between synchronous and asynchronous networks?
?
Synchronous network (circuit-switched, e.g., telephone):

  • Bandwidth is reserved end-to-end for each connection
  • Maximum end-to-end latency is bounded and known
  • No queuing — packets are never delayed by other traffic
  • ✅ Algorithms can make timing assumptions
  • Example: PSTN telephone network

Asynchronous network (packet-switched, e.g., internet, datacenters):

  • Bandwidth is shared — packets compete with all other traffic
  • Maximum latency is unbounded in theory (queuing at each router)
  • Packets may be delayed arbitrarily, reordered, dropped, or duplicated
  • ❌ Algorithms cannot assume anything about message timing
  • Example: The internet, corporate LANs, cloud datacenter networks

Key insight: All modern datacenter and internet networks are asynchronous. A “timeout” only tells you “I gave up waiting” — it says nothing about what the remote node actually did.

What is a network partition and why does it cause split-brain?
?

  • Network partition: Two groups of nodes can communicate within their group but not between groups. All nodes are still alive and running.

    • Example: A switch failure divides 6 nodes into Group A (3 nodes) and Group B (3 nodes)
    • Within groups: normal communication
    • Between groups: complete inability to communicate
  • Split-brain: Both groups may independently elect their own leader, believing the other group is dead

    • Group A: “Group B is unreachable, we’ll elect a new leader”
    • Group B: “Group A is unreachable, we’ll elect a new leader”
    • Result: Two active leaders simultaneously writing to the same data
  • Why dangerous: Both leaders may accept writes that are never reconciled, leading to data divergence or corruption


Clocks

What are the two types of clocks in distributed systems, and when should each be used?
?
Time-of-day clock (wall clock):

  • Returns current UTC date and time
  • Synchronized via NTP → can jump forward or backward during sync
  • System.currentTimeMillis() (Java), time.time() (Python), CLOCK_REALTIME (Linux)
  • ✅ Use for: displaying time to users, logging events, approximate timestamps
  • ❌ Never use for: measuring elapsed time, ordering events across nodes, LWW, lock expiry

Monotonic clock:

  • Always increases; never jumps backward
  • Measures elapsed time since an arbitrary (per-process) epoch
  • System.nanoTime() (Java), time.monotonic() (Python), CLOCK_MONOTONIC (Linux)
  • ✅ Use for: measuring timeouts, rate limiting, elapsed time within one process
  • ❌ Never use for: comparing across different nodes (each has an independent epoch)

The danger in one sentence: A wall clock on Node B may be 10ms behind Node A’s. If both write to the same key with LWW, Node A’s write (at timestamp=100) wins over Node B’s later write (at timestamp=99). Node B’s write is silently discarded.

Why is “last write wins” with wall clock timestamps dangerous?
?

  • LWW assumption: The write with the largest timestamp happened “last” and should win
  • The problem: Clocks on different nodes drift and disagree, even with NTP

Concrete failure scenario:

Node A clock: 100ms  →  Node A writes x=1 with timestamp=100
Node B clock:  99ms  →  Node B writes x=2 with timestamp=99 (1ms behind)
LWW result: timestamp 100 > 99, so x=1 wins
ACTUAL result: x=2 (the more recent write) is SILENTLY DISCARDED
No error. No warning. The data is simply gone.
  • Quartz crystal drift: ~200ms per day without NTP correction
  • NTP accuracy: ±10ms typical; ±50–100ms under load or congestion
  • Leap seconds: Can cause clocks to repeat or skip a second entirely

Safe alternatives:

  • Lamport timestamps or vector clocks (logical ordering, no physical time)
  • Hybrid Logical Clocks — stay close to wall time while monotonically increasing
  • Google Spanner TrueTime — physical time with bounded uncertainty (requires GPS + atomic clock hardware)

What is Google Spanner’s TrueTime API and how does it solve the clock problem?
?

  • TrueTime: GPS receivers + atomic clocks deployed in every Google datacenter
  • API: TT.now() returns an interval [earliest, latest], not a point
    • The interval represents the bounded uncertainty in current time
    • Width of interval: typically 1–7ms; never exceeds ~14ms
  • Usage in Spanner: Before committing a transaction, Spanner calls TT.after(commit_timestamp)
    • This waits until the commit timestamp is provably in the past on all nodes
    • Result: No node can observe a snapshot from “before” the commit — external consistency
  • Why this enables linearizability: The “commit wait” ensures no future transaction can see a time before this commit’s timestamp
  • Cost: GPS antennas + atomic clock cards in every datacenter — prohibitively expensive for most organizations
  • Impact: Spanner achieves linearizable distributed transactions at global scale — no other system does this without TrueTime-equivalent hardware

Process Pauses

What is the GC pause lease expiry problem and how do fencing tokens solve it?
?
The problem — without fencing tokens:

  1. Node A acquires distributed lock with 30s TTL; receives fencing token=10
  2. Node A checks: “lease valid? 29s remaining” ✓
  3. JVM GC stop-the-world pause begins — Node A is frozen for 15 seconds
  4. Lease expires; Node B acquires lock with fencing token=11
  5. GC pause ends; Node A resumes — still thinks it holds the lock
  6. Both A and B write to the resource → split brain / data corruption

Fencing token solution:

  • Lock service issues a monotonically increasing integer with each lock grant
  • Every write to the resource must include the fencing token
  • The resource server tracks max_token_seen and rejects any write with a lower token
  • When Node A (token=10) tries to write, resource sees token=11 already used → rejects
  • Node A’s stale write is blocked without Node A needing to know its lease expired

Key insight: Safety is enforced by the resource, not the lock holder. A paused or slow process cannot corrupt data because the resource will reject its stale token.

What are the main causes of process pauses in distributed systems?
?

SourceTypical DurationNotes
JVM GC (old G1)100ms — several secondsStop-the-world; all threads frozen
JVM ZGC / Shenandoah< 10msLow-pause; still nonzero
VM live migration10ms — 1 secondHypervisor suspends + migrates VM
OS context switch< 1msKernel preempts; another process runs
Memory swap (page fault)SecondsProcess memory paged to disk
Synchronous disk I/O5–100msProcess blocks on fsync/read
Debugging breakpointIndefiniteEntire process paused

Why they can’t be fully eliminated:

  • GC: Low-pause algorithms reduce pauses but cannot eliminate them entirely
  • VM migration: Cloud platform decides when to migrate (no JVM flag can prevent it)
  • OS scheduling: Required for multitasking; unavoidable

Design implication: Assume any process can pause for an arbitrary duration. Never rely on a process knowing how long it has been paused. Always design protocols that remain correct even if a process pauses for minutes.


System Models

What are the three timing models used in distributed algorithm design?
?
1. Synchronous model:

  • Message delay: bounded by known maximum d
  • Processing time: bounded by known maximum p
  • Clock drift: bounded by known maximum ε
  • ✅ Simplest to reason about; timeouts work reliably
  • Unrealistic: real networks can experience arbitrary delays

2. Asynchronous model:

  • No timing guarantees whatsoever
  • Algorithms cannot assume anything about delays or clocks
  • FLP theorem: Consensus is impossible if any single node can fail
  • Too pessimistic: real systems sometimes have reasonable timing

3. Partially synchronous modelThe realistic model:

  • System behaves synchronously most of the time
  • But can occasionally violate timing bounds (GC pauses, congestion, etc.)
  • Most practical distributed algorithms assume this model
  • Allows consensus via timeouts (which can sometimes be wrong, but “eventually” correct)

Bottom line: Write algorithms for the partially synchronous model. Document your timing assumptions explicitly.

What are the three node failure models in distributed systems?
?
1. Crash-stop:

  • Node fails by halting; never restarts
  • Safe to assume a dead node stays dead
  • Used mostly in theoretical analysis (unrealistic for production)

2. Crash-recovery (most practical):

  • Node can crash and then restart
  • Durable state (written to disk) survives the crash
  • In-memory state is lost
  • Example: A database node that restarts after a kernel panic, replaying its WAL
  • Most distributed DB algorithms assume this model

3. Byzantine:

  • Node can behave arbitrarily — lie, send contradictory messages, act maliciously
  • Hardware corruption, software bugs sending wrong data, or actual adversarial behavior
  • Requires n ≥ 3f + 1 nodes to tolerate f faults (vs 2f + 1 for crash models)
  • ~3x message overhead per consensus round
  • Used: blockchain, aerospace, multi-organization systems
  • Most datacenter systems do NOT need BFT — trusted environment

What is the FLP impossibility result and what does it mean in practice?
?

  • FLP = Fischer, Lynch, Paterson (1985 paper)
  • Result: In a fully asynchronous distributed system, there is no deterministic algorithm that can guarantee consensus if even one node may crash

Why it matters:

  • Every real consensus algorithm (Raft, Paxos, Zab) must make timing assumptions to work
  • They all use timeouts: “if no response in X seconds, assume the node crashed”
  • These timeouts allow progress in the partially synchronous model
  • But they can be wrong: a slow node might be declared dead and excluded; later it recovers

Practical implication:

  • Raft and Paxos are NOT proven correct for fully asynchronous networks
  • They work because real networks are partially synchronous (usually bounded)
  • Design for “usually works” not “always correct regardless of timing”
  • Accept that consensus can become temporarily unavailable under extreme timing violations

Byzantine Faults

What is the Byzantine fault problem and when does a distributed system need BFT?
?
Byzantine fault: A node that does not simply crash, but instead sends incorrect, contradictory, or malicious data to other nodes.

Byzantine Generals Problem: Generals communicating by messenger, where some generals may be traitors sending false information. The problem is to reach agreement despite some fraction of traitorous nodes.

BFT requirement: n ≥ 3f + 1 nodes to tolerate f Byzantine faults

  • Compared to crash-stop: n ≥ 2f + 1 (much cheaper)
  • Each consensus round requires additional message phases and cryptographic signing
  • Roughly 3× more expensive than crash-tolerant consensus

When BFT is needed:

  • ✅ Blockchain (Tendermint, PBFT, HotStuff): participants are untrusted
  • ✅ Aerospace / safety-critical: radiation can corrupt any node’s data
  • ✅ Multi-organization systems: operators have conflicting interests
  • ❌ Corporate datacenters: nodes are trusted; crash-recovery model sufficient
  • ❌ Cloud-hosted microservices: use mTLS for authentication, not BFT

Weak BFT defenses (suitable for most systems):

  • Checksums/CRCs on data at rest and in transit
  • TLS with certificate validation (prevents message injection/forgery)
  • Input validation at API boundaries (prevents malformed data propagation)

Formal Methods and Testing

What is TLA+ and what role does it play in distributed systems design?
?

  • TLA+ (Temporal Logic of Actions): A formal specification language created by Leslie Lamport
  • Purpose: Write a mathematical model of an algorithm; exhaustively check all possible execution interleavings
  • TLC model checker: The tooling that checks all reachable states for safety and liveness violations

What TLA+ finds:

  • Safety violations: “Can the system ever reach this bad state?”
  • Liveness violations: “Is it possible the system never makes progress?”
  • Bugs only reachable through rare concurrent failure combinations (3–5+ simultaneous failures)

Industry adoption:

  • Amazon: Mandates TLA+ for distributed system designs; found 10 bugs in AWS services (S3, DynamoDB, EBS); 7 would have been missed by code review + testing
  • Microsoft Azure, MongoDB, CockroachDB: Use TLA+ for critical protocol design

Limitations:

  • State space explosion: exponential in number of concurrent agents
  • TLC uses bounded model checking — may miss bugs in very deep scenarios
  • Models must be abstract (algorithm logic, not implementation code)
  • A correct spec doesn’t guarantee a correct implementation — gap between model and code

What is Jepsen and what distributed system bugs has it found?
?

  • Jepsen: A distributed system testing framework by Kyle Kingsbury (“aphyr”)
  • Method: Inject network partitions, clock skews, and node kills while running concurrent workloads; then verify linearizability using the Knossos model checker
  • Components: Nemesis (fault injector) + client (runs ops) + checker (verifies history)

Famous findings:

  • MongoDB (multiple versions): Lost acknowledged writes during leader elections
  • Redis Cluster: Lost data during failover (acknowledged writes not durable)
  • Cassandra: Violated linearizability under network partition + clock skew
  • Elasticsearch: Acknowledged writes unreadable after node failure
  • CockroachDB: Initially found violations; team fixed them; now passes Jepsen

Difference from TLA+:

  • Jepsen tests the actual running implementation (not a model)
  • Finds bugs in code that may not exist in the spec
  • Cannot enumerate all possible interleavings (only tests scenarios that actually occur)
  • TLA+ is broader (covers all interleavings) but doesn’t test real code

Impact: Most major distributed databases now run Jepsen tests before each release

What is simulation testing (FoundationDB approach) and why is it the gold standard?
?

  • Simulation testing: Run the entire distributed system as a deterministic, single-threaded simulation
    • All network I/O, disk I/O, timers, and randomness are controlled by the simulation
    • The simulation replays the same sequence of events every time for a given seed

FoundationDB’s approach:

  • Built from day one to run in simulation mode
  • Any bug that can be triggered can be reproduced exactly (same seed = same execution)
  • The simulator can inject faults at arbitrary points in the execution
  • FoundationDB ran billions of simulated hours before its first production release

Why it’s the gold standard:

  • Perfect reproducibility: Any rare bug found can be reproduced on demand
  • Exhaustive exploration: The simulator can explore corner cases too rare for integration tests
  • Faster iteration: No need to set up real clusters; bugs reproduced in milliseconds
  • Complete coverage: Covers all code paths, including rare concurrent failure combinations

Adoption (2026):

  • TigerBeetle (financial database): Entire system is a deterministic simulation
  • Antithesis (startup): Simulation-as-a-service for any containerized application
  • The approach is spreading as “the right way” to test distributed systems

Modern Context (2026)

How has ZGC changed distributed system design considerations for Java?
?

  • Before ZGC (pre-2020): JVM G1GC could pause for seconds → significant risk for distributed leases
  • ZGC (Java 11+, production-ready ~2020): Max pause < 10ms for heaps up to multi-TB
  • Shenandoah (Red Hat): Similar pause targets

Impact on distributed systems:

  • Short lease TTLs (100ms–1s) are now safer with ZGC/Shenandoah
  • Reduced GC-induced false positives in failure detectors
  • Java/JVM systems are now viable for tight-lease distributed coordination

What ZGC does NOT fix:

  • Cloud VM live migration still pauses the VM (10ms–1s) — no JVM flag prevents this
  • OS-level memory pressure can still cause page faults (process waits for disk)
  • 10ms pauses, while small, are still > 0ms — fencing tokens remain necessary
  • Bottom line: ZGC reduces GC pause risk substantially but doesn’t eliminate the fundamental need to design for arbitrary pauses

What is a Hybrid Logical Clock (HLC) and why is it used in databases like CockroachDB?
?

  • Hybrid Logical Clock (HLC): A clock algorithm that combines a physical timestamp with a logical counter
    • Format: (physical_time, logical_counter)
    • Always monotonically increasing — never goes backward
    • Stays close to wall clock time (within bounded drift)

Why HLC over pure logical clocks:

  • Lamport timestamps only give causal ordering — no physical time info
  • Wall clocks can go backward (NTP correction) — unsafe for ordering
  • HLC gives: monotonic ordering + approximate physical time + causal consistency

HLC properties:

  • If event A happens before event B: HLC(A) < HLC(B) always
  • HLC never goes backward (unlike wall clocks)
  • HLC ≈ wall clock time within a bounded skew

Used by: CockroachDB, YugabyteDB (for transaction timestamps and MVCC ordering)

Comparison:

Pure Lamport:    Monotonic, causal, no physical time
Wall clock:      Physical time, can go backward, not causal
HLC:             Monotonic, causal, close to physical time  ← best of both
TrueTime:        Bounded physical uncertainty, requires hardware

Quick Facts

What is the typical NTP clock accuracy and what does quartz crystal drift look like without correction?
?

  • NTP accuracy (software): ±10ms under good conditions
    • Loaded servers: ±50ms
    • Network congestion: ±100ms+
    • Misconfigured: drift to seconds before next NTP sync
  • Quartz crystal drift (no NTP): ~200ms per day (varies with temperature and aging)
  • PTP (Precision Time Protocol): ±1–100 microseconds (requires HW timestamping at NIC)
  • Google TrueTime: ±7ms maximum bounded uncertainty (GPS + atomic clocks)

Practical implication for LWW:

  • 10ms NTP drift: any two writes within a 10ms window may be reordered
  • A “last write wins” system may silently discard writes if both nodes’ clock differ by even 1ms

What is the safety vs liveness distinction in distributed algorithm correctness?
?

  • Safety property: “Nothing bad ever happens”

    • The system never enters an invalid or incorrect state
    • Must hold regardless of timing, partitions, or pauses
    • Examples: linearizability (reads always see latest write), consensus never decides two values
  • Liveness property: “Something good eventually happens”

    • The system eventually makes progress
    • Can be violated temporarily under adverse conditions
    • Examples: consensus eventually decides, all failed nodes eventually recover

The key trade-off:

  • Most distributed algorithms prioritize safety over liveness
  • Under network partition or timing violations: algorithm blocks (no progress) rather than returns wrong results
  • A system that sacrifices safety is dangerous; one that sacrifices liveness is just slow

Raft example:

  • Safety: a value once committed is never overwritten (always preserved)
  • Liveness: if a leader is elected and a majority is healthy, entries are committed
  • Under partition: Raft may block (cannot elect leader without majority) — sacrifices liveness, never safety

Total Cards: 18
Review Time: ~25 minutes
Priority: HIGH
Last Updated: 2026-05-29