Chapter 8 Flashcards - The Trouble with Distributed Systems

flashcards chapter-8 ddia

Basic Concepts

What is the fundamental ambiguity problem with network timeouts in distributed systems?
?

When a request times out, you cannot determine which of these happened:
1. Request lost before reaching server
2. Server is down / unreachable
3. Server received request but is processing slowly
4. Server processed request, response was lost
5. Server processed request but crashed immediately after
No way to distinguish these cases with certainty over a network
Critical implication: You may retry an operation that already succeeded → must design operations to be idempotent (safe to repeat without different effect)

What is a partial failure and why is it the central challenge of distributed systems?
?

Partial failure: Some nodes in the system are working correctly while others are not
- Example: 3 of 5 nodes reachable; 2 nodes unreachable (partition, crash, slowness)
Why challenging: Unlike single-machine failure (usually total), partial failures require code to handle the in-between state
Contrast with single machines:
- Single machine: usually fails fully (crash) or works fine
- Distributed system: some components can fail independently, unpredictably
Key insight: Distributed systems must be designed to work correctly despite partial failures; cannot assume “if X worked, Y worked too”

Unreliable Networks

What are the differences between a network partition and a node failure?
?

Node failure: A specific machine has crashed or become unavailable
- The machine and its data are gone until recovery
- Other nodes can’t communicate with it at all
Network partition: Two groups of nodes can communicate within their group but not between groups
- All nodes are still running
- Nodes in Group A can talk to each other; Group B nodes can talk to each other
- But Group A and Group B cannot communicate
Why matters: During a network partition, both groups may elect their own leader → split brain → data conflicts
Example: Data center network switch failure creates two groups; each group thinks the other is dead

What is idempotency and why is it essential for distributed systems?
?

Idempotency: An operation can be performed multiple times with the same effect as performing it once
- x = 5 is idempotent (repeating has no additional effect)
- x = x + 1 is NOT idempotent (repeating has additional effect)
- INSERT INTO table ... WHERE NOT EXISTS is idempotent
- INSERT INTO table ... is NOT (creates duplicate)
Why essential: Since network timeouts can’t tell you if a request succeeded, you must retry
Safe retries require idempotent operations (retrying won’t cause double charges, duplicate records, etc.)
Implementation: Use unique request IDs; server deduplicates requests with same ID
- Client: POST /payments {idempotency_key: "txn-abc123", amount: 100}
- Server: if “txn-abc123” seen before → return cached response; else process

Unreliable Clocks

What are the two types of clocks in distributed systems and when should each be used?
?
Time-of-day clock (wall clock):

Returns current date/time (UTC)
Synchronized via NTP → can jump forward or backward
System.currentTimeMillis() (Java), time.time() (Python)
✅ Use for: displaying time, logging, rough timestamp in single system
❌ Never use for: measuring elapsed time, ordering events across nodes, LWW

Monotonic clock:

Always increases, never jumps backward
Measures elapsed time since an arbitrary point (usually process start)
System.nanoTime() (Java), time.monotonic() (Python)
✅ Use for: measuring timeouts, rate limiting, elapsed time within one process
❌ Never use for: comparing time across different nodes (each has independent monotonic clock)

Why is using wall clock timestamps for “last write wins” (LWW) dangerous?
?

Problem: Clocks on different nodes drift and disagree, even with NTP
- NTP accuracy: typically ~10ms; can be much worse on loaded or misconfigured systems
- Quartz crystal drift: ~200ms per day without correction
- Leap seconds can cause clocks to repeat or skip a second
LWW danger scenario:
- Node A writes x=1 with timestamp t=100ms
- Node B writes x=2 with timestamp t=99ms (1ms behind)
- LWW compares timestamps: 100 > 99 → x=1 wins (WRONG! x=2 was more recent in real time)
- A write is silently lost with no error or warning
Safe alternatives:
- Logical clocks (Lamport timestamps) for ordering
- Version vectors for detecting concurrent writes
- Google Spanner TrueTime for physical time with bounded uncertainty

What is Google’s TrueTime API and how does it solve the distributed clock problem?
?

TrueTime: GPS receivers + atomic clocks deployed in every Google datacenter
API: TT.now() returns [earliest, latest] — a time interval, not a point
- Interval represents the uncertainty in current time
- latest - earliest ≤ 7ms (typical), max ~14ms observed
Usage: Before committing a Spanner transaction, wait until TT.after(commit_timestamp) is true
- This ensures the committed timestamp is safely in the past
- Eliminates “time traveler” reads (reading a snapshot from the future)
Cost: Special hardware (GPS antennas, atomic clock cards) in every datacenter
Result: Externally consistent (linearizable) distributed transactions at global scale
Why most systems can’t do this: Expensive hardware + Google-scale infrastructure required

Process Pauses

What is the “GC pause lease expiry” problem and how do fencing tokens solve it?
?
Problem:

Process A acquires distributed lock with 30-second lease, gets token=10
Process A checks: “lease valid” ✅ → begins operation
JVM GC stop-the-world pause begins (process A frozen for 15 seconds)
Lease expires; Process B acquires lock with token=11
GC pause ends; Process A resumes, thinks it still holds lock
A and B both write → data corruption

Fencing token solution:

Lock service issues monotonically increasing fencing token with each lock grant
Clients include token in every write request to the storage service
Storage service rejects any write with a lower token than the highest seen
When A (token=10) tries to write, storage sees token=11 already used → rejects A’s write

Key insight: Moves safety checking to the resource being accessed, not the lock holder — the resource can always verify freshness

What causes process pauses in distributed systems and why can’t they be eliminated?
?
Sources of process pauses:

Garbage collection (JVM/CLR): Stop-the-world GC pauses can last seconds
Virtual machine migration: Cloud hypervisor suspends VM, migrates to another host, resumes
Context switches: OS scheduler preempts process; may not run for milliseconds
Memory pressure: OS swaps process memory to disk (page fault) → seconds-long pause
Disk I/O: Process blocks waiting for synchronous disk read
Signal handlers: OS delivers signal, process handles it
Hardware interrupts: CPU handles interrupt, process not running

Why can’t be eliminated:

GC: Low-pause GC (ZGC, Shenandoah) reduces pauses to <10ms but cannot eliminate
VM migration: Cloud feature, platform decides when to migrate
OS scheduling: Required for multi-process OS operation

Design implication: Always assume any process can pause for an arbitrary duration; design protocols that remain correct despite pauses

System Models

What are the three system timing models used in distributed algorithms?
?
1. Synchronous model:

Bounded network delay (≤ d)
Bounded processing time (≤ p)
Bounded clock drift (≤ ε)
✅ Easiest to reason about; algorithms can use timeouts safely
❌ Unrealistic: real networks can experience arbitrary delays

2. Asynchronous model:

No timing guarantees whatsoever
Algorithms cannot make any assumptions about time
❌ FLP theorem: consensus impossible in this model if any node can fail
❌ Too pessimistic for real systems

3. Partially synchronous model (most realistic):

System behaves synchronously most of the time
But can occasionally violate timing bounds (e.g., during GC pauses, congestion)
Most practical distributed algorithms assume this model
Allows consensus by adding timeouts (though not guaranteed)

What is the FLP impossibility result and what does it mean in practice?
?

FLP = Fischer, Lynch, Paterson (1985 paper)
Result: In a fully asynchronous distributed system, there is no deterministic algorithm that can guarantee consensus (agreement) if even ONE node might crash
Why it matters: Every distributed consensus algorithm (Paxos, Raft) must make timing assumptions to work
- Uses timeouts: “if no response in X seconds, assume node crashed”
- These timeouts work in practice (partially synchronous model) but can violate in worst case
- Raft/Paxos are NOT proven correct in asynchronous model — they work because of timing assumptions
Practical implication: Perfect distributed consensus is impossible; all real consensus algorithms can get stuck or produce wrong results under adversarial network conditions
- Design for “usually works” not “always correct regardless of timing”

What are the three node failure models and which do most distributed algorithms assume?
?
1. Crash-stop:

Node fails by stopping; never resumes
Simplest model for algorithm design

2. Crash-recovery:

Node can crash, then restart and recover state from disk
Must design for: what state is preserved? What is lost?
Most practical: DB nodes crash and restart with durable state intact

3. Byzantine:

Node can behave arbitrarily: lie, send contradictory messages, act maliciously
Requires Byzantine Fault Tolerant (BFT) algorithms (very expensive: ~3x message overhead)
Used in: blockchain (Tendermint, PBFT), aerospace, nuclear systems
NOT needed for most datacenter systems (trusted environment)

Most distributed DB algorithms assume: crash-recovery (with durable state on disk) + partially synchronous timing

Modern Context (2026)

How have low-latency GC algorithms changed distributed system design?
?

Pre-2018: JVM GC pauses could be seconds → huge problem for distributed locks/leases
ZGC (Java 11+, production ready ~2020): max pause ~10ms for multi-TB heaps
Shenandoah (Red Hat, Java 8+): Similar pause targets
G1GC with tuning: Still has pauses but more predictable

Impact on distributed systems:

Short lease timeouts (100ms-1s) are now safer with ZGC
Reduced GC-induced false positives in failure detectors
Fencing tokens still recommended (GC not eliminated, just reduced)

Caveats:

Low-latency GC trades throughput for pause reduction
Cloud VMs can still be paused for migration (no JVM flag prevents this)
10ms pauses < most distributed timeout thresholds → acceptable

Bottom line: ZGC/Shenandoah make Java safer for distributed systems but don’t eliminate the fundamental need to design for arbitrary pauses

How does a service mesh handle network unreliability at the infrastructure layer?
?

Service mesh (Istio, Linkerd, Consul Connect): A layer of infrastructure proxies (sidecars) alongside each microservice
Each proxy intercepts all network traffic in/out of the service

What it provides automatically:

Retries: Automatic retry on timeout/5xx (with configurable limits, backoff)
Circuit breaking: Stop sending to failing services; fast-fail with fallback
Timeout management: Enforce timeouts per-route without application code changes
Load balancing: Weighted, health-aware routing
mTLS: Automatic mutual TLS between services (no Byzantine trust needed)

Benefits:

Application code freed from network retry/timeout logic
Consistent failure handling across services in different languages
Observable (metrics, tracing automatically captured)

Limitations:

Not a silver bullet: doesn’t solve idempotency (app must still handle duplicate requests)
Added latency (sidecar proxy hop)
Operational complexity of mesh itself

Interview Scenarios

Design a distributed lock service that is safe against GC pauses and network delays.
?
Requirements: Lease-based lock, correct even if holder pauses

Core design:

Lock service (ZooKeeper/etcd with Raft): Issues leases with TTL + fencing tokens
- Fencing token: monotonically increasing integer
- acquire(resource) → returns {lease_id, token=42, expires=+30s}
Lock holder (client):
- Include token in every operation: write(data, token=42)
- Renew lease before TTL expires (keepalive)
- On GC pause recovery: re-check if still holding lock before proceeding
Protected resource (storage server):
- Track max_token_seen per resource
- Reject any write where token ≤ max_token_seen
- This prevents stale writes from paused/slow lock holders
Failure handling:
- If lock holder crashes: TTL expires → another client acquires lock with higher token
- If lock service crashes: Raft consensus ensures durable leader election state

Key insight: Safety guaranteed by the resource, not by the lock holder — even a slow/paused process can’t corrupt data

Quick Facts

What is the typical NTP clock accuracy and why does it matter?
?

NTP accuracy: typically ±10ms under good conditions; can be much worse
- Loaded servers: ±50ms
- Network congestion: ±100ms+
- Misconfigured: drift up to seconds before correction
Quartz crystal drift (no NTP): ~200ms per day
Google Spanner TrueTime: ±7ms (bounded by GPS + atomic clocks)
Why it matters:
- LWW with timestamps: 10ms drift = writes within 10ms window can be reordered
- Distributed lock expiry: if client’s clock says “30 seconds from now” and server’s disagrees, expiry time is ambiguous
- Database backups: timestamp-based incremental backups can miss rows if clocks disagree
Rule: For anything that requires correct ordering across nodes, don’t trust wall clock timestamps — use logical clocks

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch08-flashcards