Chapter 8 Flashcards - The Trouble with Distributed Systems

flashcards chapter-8 ddia


Basic Concepts

What is the fundamental ambiguity problem with network timeouts in distributed systems?
?

  • When a request times out, you cannot determine which of these happened:

    1. Request lost before reaching server
    2. Server is down / unreachable
    3. Server received request but is processing slowly
    4. Server processed request, response was lost
    5. Server processed request but crashed immediately after
  • No way to distinguish these cases with certainty over a network

  • Critical implication: You may retry an operation that already succeeded → must design operations to be idempotent (safe to repeat without different effect)

What is a partial failure and why is it the central challenge of distributed systems?
?

  • Partial failure: Some nodes in the system are working correctly while others are not

    • Example: 3 of 5 nodes reachable; 2 nodes unreachable (partition, crash, slowness)
  • Why challenging: Unlike single-machine failure (usually total), partial failures require code to handle the in-between state

  • Contrast with single machines:

    • Single machine: usually fails fully (crash) or works fine
    • Distributed system: some components can fail independently, unpredictably
  • Key insight: Distributed systems must be designed to work correctly despite partial failures; cannot assume “if X worked, Y worked too”

Unreliable Networks

What are the differences between a network partition and a node failure?
?

  • Node failure: A specific machine has crashed or become unavailable

    • The machine and its data are gone until recovery
    • Other nodes can’t communicate with it at all
  • Network partition: Two groups of nodes can communicate within their group but not between groups

    • All nodes are still running
    • Nodes in Group A can talk to each other; Group B nodes can talk to each other
    • But Group A and Group B cannot communicate
  • Why matters: During a network partition, both groups may elect their own leader → split brain → data conflicts

  • Example: Data center network switch failure creates two groups; each group thinks the other is dead

What is idempotency and why is it essential for distributed systems?
?

  • Idempotency: An operation can be performed multiple times with the same effect as performing it once

    • x = 5 is idempotent (repeating has no additional effect)
    • x = x + 1 is NOT idempotent (repeating has additional effect)
    • INSERT INTO table ... WHERE NOT EXISTS is idempotent
    • INSERT INTO table ... is NOT (creates duplicate)
  • Why essential: Since network timeouts can’t tell you if a request succeeded, you must retry

  • Safe retries require idempotent operations (retrying won’t cause double charges, duplicate records, etc.)

  • Implementation: Use unique request IDs; server deduplicates requests with same ID

    • Client: POST /payments {idempotency_key: "txn-abc123", amount: 100}
    • Server: if “txn-abc123” seen before → return cached response; else process

Unreliable Clocks

What are the two types of clocks in distributed systems and when should each be used?
?
Time-of-day clock (wall clock):

  • Returns current date/time (UTC)
  • Synchronized via NTP → can jump forward or backward
  • System.currentTimeMillis() (Java), time.time() (Python)
  • ✅ Use for: displaying time, logging, rough timestamp in single system
  • ❌ Never use for: measuring elapsed time, ordering events across nodes, LWW

Monotonic clock:

  • Always increases, never jumps backward
  • Measures elapsed time since an arbitrary point (usually process start)
  • System.nanoTime() (Java), time.monotonic() (Python)
  • ✅ Use for: measuring timeouts, rate limiting, elapsed time within one process
  • ❌ Never use for: comparing time across different nodes (each has independent monotonic clock)

Why is using wall clock timestamps for “last write wins” (LWW) dangerous?
?

  • Problem: Clocks on different nodes drift and disagree, even with NTP

    • NTP accuracy: typically ~10ms; can be much worse on loaded or misconfigured systems
    • Quartz crystal drift: ~200ms per day without correction
    • Leap seconds can cause clocks to repeat or skip a second
  • LWW danger scenario:

    • Node A writes x=1 with timestamp t=100ms
    • Node B writes x=2 with timestamp t=99ms (1ms behind)
    • LWW compares timestamps: 100 > 99 → x=1 wins (WRONG! x=2 was more recent in real time)
    • A write is silently lost with no error or warning
  • Safe alternatives:

    • Logical clocks (Lamport timestamps) for ordering
    • Version vectors for detecting concurrent writes
    • Google Spanner TrueTime for physical time with bounded uncertainty

What is Google’s TrueTime API and how does it solve the distributed clock problem?
?

  • TrueTime: GPS receivers + atomic clocks deployed in every Google datacenter

  • API: TT.now() returns [earliest, latest] — a time interval, not a point

    • Interval represents the uncertainty in current time
    • latest - earliest ≤ 7ms (typical), max ~14ms observed
  • Usage: Before committing a Spanner transaction, wait until TT.after(commit_timestamp) is true

    • This ensures the committed timestamp is safely in the past
    • Eliminates “time traveler” reads (reading a snapshot from the future)
  • Cost: Special hardware (GPS antennas, atomic clock cards) in every datacenter

  • Result: Externally consistent (linearizable) distributed transactions at global scale

  • Why most systems can’t do this: Expensive hardware + Google-scale infrastructure required

Process Pauses

What is the “GC pause lease expiry” problem and how do fencing tokens solve it?
?
Problem:

  1. Process A acquires distributed lock with 30-second lease, gets token=10
  2. Process A checks: “lease valid” ✅ → begins operation
  3. JVM GC stop-the-world pause begins (process A frozen for 15 seconds)
  4. Lease expires; Process B acquires lock with token=11
  5. GC pause ends; Process A resumes, thinks it still holds lock
  6. A and B both write → data corruption

Fencing token solution:

  • Lock service issues monotonically increasing fencing token with each lock grant
  • Clients include token in every write request to the storage service
  • Storage service rejects any write with a lower token than the highest seen
  • When A (token=10) tries to write, storage sees token=11 already used → rejects A’s write

Key insight: Moves safety checking to the resource being accessed, not the lock holder — the resource can always verify freshness

What causes process pauses in distributed systems and why can’t they be eliminated?
?
Sources of process pauses:

  1. Garbage collection (JVM/CLR): Stop-the-world GC pauses can last seconds
  2. Virtual machine migration: Cloud hypervisor suspends VM, migrates to another host, resumes
  3. Context switches: OS scheduler preempts process; may not run for milliseconds
  4. Memory pressure: OS swaps process memory to disk (page fault) → seconds-long pause
  5. Disk I/O: Process blocks waiting for synchronous disk read
  6. Signal handlers: OS delivers signal, process handles it
  7. Hardware interrupts: CPU handles interrupt, process not running

Why can’t be eliminated:

  • GC: Low-pause GC (ZGC, Shenandoah) reduces pauses to <10ms but cannot eliminate
  • VM migration: Cloud feature, platform decides when to migrate
  • OS scheduling: Required for multi-process OS operation

Design implication: Always assume any process can pause for an arbitrary duration; design protocols that remain correct despite pauses

System Models

What are the three system timing models used in distributed algorithms?
?
1. Synchronous model:

  • Bounded network delay (≤ d)
  • Bounded processing time (≤ p)
  • Bounded clock drift (≤ ε)
  • ✅ Easiest to reason about; algorithms can use timeouts safely
  • ❌ Unrealistic: real networks can experience arbitrary delays

2. Asynchronous model:

  • No timing guarantees whatsoever
  • Algorithms cannot make any assumptions about time
  • ❌ FLP theorem: consensus impossible in this model if any node can fail
  • ❌ Too pessimistic for real systems

3. Partially synchronous model (most realistic):

  • System behaves synchronously most of the time
  • But can occasionally violate timing bounds (e.g., during GC pauses, congestion)
  • Most practical distributed algorithms assume this model
  • Allows consensus by adding timeouts (though not guaranteed)

What is the FLP impossibility result and what does it mean in practice?
?

  • FLP = Fischer, Lynch, Paterson (1985 paper)

  • Result: In a fully asynchronous distributed system, there is no deterministic algorithm that can guarantee consensus (agreement) if even ONE node might crash

  • Why it matters: Every distributed consensus algorithm (Paxos, Raft) must make timing assumptions to work

    • Uses timeouts: “if no response in X seconds, assume node crashed”
    • These timeouts work in practice (partially synchronous model) but can violate in worst case
    • Raft/Paxos are NOT proven correct in asynchronous model — they work because of timing assumptions
  • Practical implication: Perfect distributed consensus is impossible; all real consensus algorithms can get stuck or produce wrong results under adversarial network conditions

    • Design for “usually works” not “always correct regardless of timing”

What are the three node failure models and which do most distributed algorithms assume?
?
1. Crash-stop:

  • Node fails by stopping; never resumes
  • Simplest model for algorithm design

2. Crash-recovery:

  • Node can crash, then restart and recover state from disk
  • Must design for: what state is preserved? What is lost?
  • Most practical: DB nodes crash and restart with durable state intact

3. Byzantine:

  • Node can behave arbitrarily: lie, send contradictory messages, act maliciously
  • Requires Byzantine Fault Tolerant (BFT) algorithms (very expensive: ~3x message overhead)
  • Used in: blockchain (Tendermint, PBFT), aerospace, nuclear systems
  • NOT needed for most datacenter systems (trusted environment)

Most distributed DB algorithms assume: crash-recovery (with durable state on disk) + partially synchronous timing

Modern Context (2026)

How have low-latency GC algorithms changed distributed system design?
?

  • Pre-2018: JVM GC pauses could be seconds → huge problem for distributed locks/leases
  • ZGC (Java 11+, production ready ~2020): max pause ~10ms for multi-TB heaps
  • Shenandoah (Red Hat, Java 8+): Similar pause targets
  • G1GC with tuning: Still has pauses but more predictable

Impact on distributed systems:

  • Short lease timeouts (100ms-1s) are now safer with ZGC
  • Reduced GC-induced false positives in failure detectors
  • Fencing tokens still recommended (GC not eliminated, just reduced)

Caveats:

  • Low-latency GC trades throughput for pause reduction
  • Cloud VMs can still be paused for migration (no JVM flag prevents this)
  • 10ms pauses < most distributed timeout thresholds → acceptable

Bottom line: ZGC/Shenandoah make Java safer for distributed systems but don’t eliminate the fundamental need to design for arbitrary pauses

How does a service mesh handle network unreliability at the infrastructure layer?
?

  • Service mesh (Istio, Linkerd, Consul Connect): A layer of infrastructure proxies (sidecars) alongside each microservice
  • Each proxy intercepts all network traffic in/out of the service

What it provides automatically:

  • Retries: Automatic retry on timeout/5xx (with configurable limits, backoff)
  • Circuit breaking: Stop sending to failing services; fast-fail with fallback
  • Timeout management: Enforce timeouts per-route without application code changes
  • Load balancing: Weighted, health-aware routing
  • mTLS: Automatic mutual TLS between services (no Byzantine trust needed)

Benefits:

  • Application code freed from network retry/timeout logic
  • Consistent failure handling across services in different languages
  • Observable (metrics, tracing automatically captured)

Limitations:

  • Not a silver bullet: doesn’t solve idempotency (app must still handle duplicate requests)
  • Added latency (sidecar proxy hop)
  • Operational complexity of mesh itself

Interview Scenarios

Design a distributed lock service that is safe against GC pauses and network delays.
?
Requirements: Lease-based lock, correct even if holder pauses

Core design:

  1. Lock service (ZooKeeper/etcd with Raft): Issues leases with TTL + fencing tokens

    • Fencing token: monotonically increasing integer
    • acquire(resource) → returns {lease_id, token=42, expires=+30s}
  2. Lock holder (client):

    • Include token in every operation: write(data, token=42)
    • Renew lease before TTL expires (keepalive)
    • On GC pause recovery: re-check if still holding lock before proceeding
  3. Protected resource (storage server):

    • Track max_token_seen per resource
    • Reject any write where token ≤ max_token_seen
    • This prevents stale writes from paused/slow lock holders
  4. Failure handling:

    • If lock holder crashes: TTL expires → another client acquires lock with higher token
    • If lock service crashes: Raft consensus ensures durable leader election state

Key insight: Safety guaranteed by the resource, not by the lock holder — even a slow/paused process can’t corrupt data

Quick Facts

What is the typical NTP clock accuracy and why does it matter?
?

  • NTP accuracy: typically ±10ms under good conditions; can be much worse

    • Loaded servers: ±50ms
    • Network congestion: ±100ms+
    • Misconfigured: drift up to seconds before correction
  • Quartz crystal drift (no NTP): ~200ms per day

  • Google Spanner TrueTime: ±7ms (bounded by GPS + atomic clocks)

  • Why it matters:

    • LWW with timestamps: 10ms drift = writes within 10ms window can be reordered
    • Distributed lock expiry: if client’s clock says “30 seconds from now” and server’s disagrees, expiry time is ambiguous
    • Database backups: timestamp-based incremental backups can miss rows if clocks disagree
  • Rule: For anything that requires correct ordering across nodes, don’t trust wall clock timestamps — use logical clocks

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13