Chapter 8 Flashcards - The Trouble with Distributed Systems
Basic Concepts
What is the fundamental ambiguity problem with network timeouts in distributed systems?
?
-
When a request times out, you cannot determine which of these happened:
- Request lost before reaching server
- Server is down / unreachable
- Server received request but is processing slowly
- Server processed request, response was lost
- Server processed request but crashed immediately after
-
No way to distinguish these cases with certainty over a network
-
Critical implication: You may retry an operation that already succeeded → must design operations to be idempotent (safe to repeat without different effect)
What is a partial failure and why is it the central challenge of distributed systems?
?
-
Partial failure: Some nodes in the system are working correctly while others are not
- Example: 3 of 5 nodes reachable; 2 nodes unreachable (partition, crash, slowness)
-
Why challenging: Unlike single-machine failure (usually total), partial failures require code to handle the in-between state
-
Contrast with single machines:
- Single machine: usually fails fully (crash) or works fine
- Distributed system: some components can fail independently, unpredictably
-
Key insight: Distributed systems must be designed to work correctly despite partial failures; cannot assume “if X worked, Y worked too”
Unreliable Networks
What are the differences between a network partition and a node failure?
?
-
Node failure: A specific machine has crashed or become unavailable
- The machine and its data are gone until recovery
- Other nodes can’t communicate with it at all
-
Network partition: Two groups of nodes can communicate within their group but not between groups
- All nodes are still running
- Nodes in Group A can talk to each other; Group B nodes can talk to each other
- But Group A and Group B cannot communicate
-
Why matters: During a network partition, both groups may elect their own leader → split brain → data conflicts
-
Example: Data center network switch failure creates two groups; each group thinks the other is dead
What is idempotency and why is it essential for distributed systems?
?
-
Idempotency: An operation can be performed multiple times with the same effect as performing it once
x = 5is idempotent (repeating has no additional effect)x = x + 1is NOT idempotent (repeating has additional effect)INSERT INTO table ... WHERE NOT EXISTSis idempotentINSERT INTO table ...is NOT (creates duplicate)
-
Why essential: Since network timeouts can’t tell you if a request succeeded, you must retry
-
Safe retries require idempotent operations (retrying won’t cause double charges, duplicate records, etc.)
-
Implementation: Use unique request IDs; server deduplicates requests with same ID
- Client:
POST /payments {idempotency_key: "txn-abc123", amount: 100} - Server: if “txn-abc123” seen before → return cached response; else process
- Client:
Unreliable Clocks
What are the two types of clocks in distributed systems and when should each be used?
?
Time-of-day clock (wall clock):
- Returns current date/time (UTC)
- Synchronized via NTP → can jump forward or backward
System.currentTimeMillis()(Java),time.time()(Python)- ✅ Use for: displaying time, logging, rough timestamp in single system
- ❌ Never use for: measuring elapsed time, ordering events across nodes, LWW
Monotonic clock:
- Always increases, never jumps backward
- Measures elapsed time since an arbitrary point (usually process start)
System.nanoTime()(Java),time.monotonic()(Python)- ✅ Use for: measuring timeouts, rate limiting, elapsed time within one process
- ❌ Never use for: comparing time across different nodes (each has independent monotonic clock)
Why is using wall clock timestamps for “last write wins” (LWW) dangerous?
?
-
Problem: Clocks on different nodes drift and disagree, even with NTP
- NTP accuracy: typically ~10ms; can be much worse on loaded or misconfigured systems
- Quartz crystal drift: ~200ms per day without correction
- Leap seconds can cause clocks to repeat or skip a second
-
LWW danger scenario:
- Node A writes x=1 with timestamp t=100ms
- Node B writes x=2 with timestamp t=99ms (1ms behind)
- LWW compares timestamps: 100 > 99 → x=1 wins (WRONG! x=2 was more recent in real time)
- A write is silently lost with no error or warning
-
Safe alternatives:
- Logical clocks (Lamport timestamps) for ordering
- Version vectors for detecting concurrent writes
- Google Spanner TrueTime for physical time with bounded uncertainty
What is Google’s TrueTime API and how does it solve the distributed clock problem?
?
-
TrueTime: GPS receivers + atomic clocks deployed in every Google datacenter
-
API:
TT.now()returns[earliest, latest]— a time interval, not a point- Interval represents the uncertainty in current time
latest - earliest≤ 7ms (typical), max ~14ms observed
-
Usage: Before committing a Spanner transaction, wait until
TT.after(commit_timestamp)is true- This ensures the committed timestamp is safely in the past
- Eliminates “time traveler” reads (reading a snapshot from the future)
-
Cost: Special hardware (GPS antennas, atomic clock cards) in every datacenter
-
Result: Externally consistent (linearizable) distributed transactions at global scale
-
Why most systems can’t do this: Expensive hardware + Google-scale infrastructure required
Process Pauses
What is the “GC pause lease expiry” problem and how do fencing tokens solve it?
?
Problem:
- Process A acquires distributed lock with 30-second lease, gets token=10
- Process A checks: “lease valid” ✅ → begins operation
- JVM GC stop-the-world pause begins (process A frozen for 15 seconds)
- Lease expires; Process B acquires lock with token=11
- GC pause ends; Process A resumes, thinks it still holds lock
- A and B both write → data corruption
Fencing token solution:
- Lock service issues monotonically increasing fencing token with each lock grant
- Clients include token in every write request to the storage service
- Storage service rejects any write with a lower token than the highest seen
- When A (token=10) tries to write, storage sees token=11 already used → rejects A’s write
Key insight: Moves safety checking to the resource being accessed, not the lock holder — the resource can always verify freshness
What causes process pauses in distributed systems and why can’t they be eliminated?
?
Sources of process pauses:
- Garbage collection (JVM/CLR): Stop-the-world GC pauses can last seconds
- Virtual machine migration: Cloud hypervisor suspends VM, migrates to another host, resumes
- Context switches: OS scheduler preempts process; may not run for milliseconds
- Memory pressure: OS swaps process memory to disk (page fault) → seconds-long pause
- Disk I/O: Process blocks waiting for synchronous disk read
- Signal handlers: OS delivers signal, process handles it
- Hardware interrupts: CPU handles interrupt, process not running
Why can’t be eliminated:
- GC: Low-pause GC (ZGC, Shenandoah) reduces pauses to <10ms but cannot eliminate
- VM migration: Cloud feature, platform decides when to migrate
- OS scheduling: Required for multi-process OS operation
Design implication: Always assume any process can pause for an arbitrary duration; design protocols that remain correct despite pauses
System Models
What are the three system timing models used in distributed algorithms?
?
1. Synchronous model:
- Bounded network delay (≤ d)
- Bounded processing time (≤ p)
- Bounded clock drift (≤ ε)
- ✅ Easiest to reason about; algorithms can use timeouts safely
- ❌ Unrealistic: real networks can experience arbitrary delays
2. Asynchronous model:
- No timing guarantees whatsoever
- Algorithms cannot make any assumptions about time
- ❌ FLP theorem: consensus impossible in this model if any node can fail
- ❌ Too pessimistic for real systems
3. Partially synchronous model (most realistic):
- System behaves synchronously most of the time
- But can occasionally violate timing bounds (e.g., during GC pauses, congestion)
- Most practical distributed algorithms assume this model
- Allows consensus by adding timeouts (though not guaranteed)
What is the FLP impossibility result and what does it mean in practice?
?
-
FLP = Fischer, Lynch, Paterson (1985 paper)
-
Result: In a fully asynchronous distributed system, there is no deterministic algorithm that can guarantee consensus (agreement) if even ONE node might crash
-
Why it matters: Every distributed consensus algorithm (Paxos, Raft) must make timing assumptions to work
- Uses timeouts: “if no response in X seconds, assume node crashed”
- These timeouts work in practice (partially synchronous model) but can violate in worst case
- Raft/Paxos are NOT proven correct in asynchronous model — they work because of timing assumptions
-
Practical implication: Perfect distributed consensus is impossible; all real consensus algorithms can get stuck or produce wrong results under adversarial network conditions
- Design for “usually works” not “always correct regardless of timing”
What are the three node failure models and which do most distributed algorithms assume?
?
1. Crash-stop:
- Node fails by stopping; never resumes
- Simplest model for algorithm design
2. Crash-recovery:
- Node can crash, then restart and recover state from disk
- Must design for: what state is preserved? What is lost?
- Most practical: DB nodes crash and restart with durable state intact
3. Byzantine:
- Node can behave arbitrarily: lie, send contradictory messages, act maliciously
- Requires Byzantine Fault Tolerant (BFT) algorithms (very expensive: ~3x message overhead)
- Used in: blockchain (Tendermint, PBFT), aerospace, nuclear systems
- NOT needed for most datacenter systems (trusted environment)
Most distributed DB algorithms assume: crash-recovery (with durable state on disk) + partially synchronous timing
Modern Context (2026)
How have low-latency GC algorithms changed distributed system design?
?
- Pre-2018: JVM GC pauses could be seconds → huge problem for distributed locks/leases
- ZGC (Java 11+, production ready ~2020): max pause ~10ms for multi-TB heaps
- Shenandoah (Red Hat, Java 8+): Similar pause targets
- G1GC with tuning: Still has pauses but more predictable
Impact on distributed systems:
- Short lease timeouts (100ms-1s) are now safer with ZGC
- Reduced GC-induced false positives in failure detectors
- Fencing tokens still recommended (GC not eliminated, just reduced)
Caveats:
- Low-latency GC trades throughput for pause reduction
- Cloud VMs can still be paused for migration (no JVM flag prevents this)
- 10ms pauses < most distributed timeout thresholds → acceptable
Bottom line: ZGC/Shenandoah make Java safer for distributed systems but don’t eliminate the fundamental need to design for arbitrary pauses
How does a service mesh handle network unreliability at the infrastructure layer?
?
- Service mesh (Istio, Linkerd, Consul Connect): A layer of infrastructure proxies (sidecars) alongside each microservice
- Each proxy intercepts all network traffic in/out of the service
What it provides automatically:
- Retries: Automatic retry on timeout/5xx (with configurable limits, backoff)
- Circuit breaking: Stop sending to failing services; fast-fail with fallback
- Timeout management: Enforce timeouts per-route without application code changes
- Load balancing: Weighted, health-aware routing
- mTLS: Automatic mutual TLS between services (no Byzantine trust needed)
Benefits:
- Application code freed from network retry/timeout logic
- Consistent failure handling across services in different languages
- Observable (metrics, tracing automatically captured)
Limitations:
- Not a silver bullet: doesn’t solve idempotency (app must still handle duplicate requests)
- Added latency (sidecar proxy hop)
- Operational complexity of mesh itself
Interview Scenarios
Design a distributed lock service that is safe against GC pauses and network delays.
?
Requirements: Lease-based lock, correct even if holder pauses
Core design:
-
Lock service (ZooKeeper/etcd with Raft): Issues leases with TTL + fencing tokens
- Fencing token: monotonically increasing integer
acquire(resource)→ returns{lease_id, token=42, expires=+30s}
-
Lock holder (client):
- Include token in every operation:
write(data, token=42) - Renew lease before TTL expires (keepalive)
- On GC pause recovery: re-check if still holding lock before proceeding
- Include token in every operation:
-
Protected resource (storage server):
- Track
max_token_seenper resource - Reject any write where
token ≤ max_token_seen - This prevents stale writes from paused/slow lock holders
- Track
-
Failure handling:
- If lock holder crashes: TTL expires → another client acquires lock with higher token
- If lock service crashes: Raft consensus ensures durable leader election state
Key insight: Safety guaranteed by the resource, not by the lock holder — even a slow/paused process can’t corrupt data
Quick Facts
What is the typical NTP clock accuracy and why does it matter?
?
-
NTP accuracy: typically ±10ms under good conditions; can be much worse
- Loaded servers: ±50ms
- Network congestion: ±100ms+
- Misconfigured: drift up to seconds before correction
-
Quartz crystal drift (no NTP): ~200ms per day
-
Google Spanner TrueTime: ±7ms (bounded by GPS + atomic clocks)
-
Why it matters:
- LWW with timestamps: 10ms drift = writes within 10ms window can be reordered
- Distributed lock expiry: if client’s clock says “30 seconds from now” and server’s disagrees, expiry time is ambiguous
- Database backups: timestamp-based incremental backups can miss rows if clocks disagree
-
Rule: For anything that requires correct ordering across nodes, don’t trust wall clock timestamps — use logical clocks
Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13