Chapter 2 Flashcards — Defining Nonfunctional Requirements

flashcards ddia-2e chapter2

Definitions

What is the difference between a fault and a failure?
?

  • Fault: A single component deviating from its specification (e.g., a disk crashes, a network packet is dropped, a bug is triggered in an edge case)
  • Failure: The system as a whole stops providing the required service to users

A fault-tolerant (resilient) system prevents faults from escalating into failures.

  • Key insight: Do NOT try to prevent faults (impossible at scale) — prevent faults from causing failures
  • Example: Netflix Chaos Monkey intentionally creates faults to verify fault tolerance

What are the three types of faults and their key characteristics?
?

  1. Hardware faults — Random, independent; disks crash, RAM errors, power outages

    • MTTF 10–50 years per disk; at 10,000 disks, expect ~1/day
    • Mitigation: Hardware redundancy (RAID, failover) + software replication
  2. Software faults — Systematic, correlated across all nodes running the same code

    • A single bug can take down every replica simultaneously
    • Examples: Leap second bug (2012), runaway processes, cascading failures
    • Mitigation: Testing, isolation, monitoring, crash-only software design
  3. Human errors — Leading cause of production outages (majority of internet service outages)

    • Typically configuration or operational mistakes, not hardware
    • Mitigation: Good design (make wrong thing hard), gradual rollout, fast rollback, detailed monitoring

What is operability and what does a highly operable system look like?
?
Operability: Making it easy for operations teams to keep the system running smoothly day to day.

A highly operable system:

  • Has visibility: metrics, logs, distributed traces, and alerts that answer “what is the system doing right now?”
  • Supports automation: integrates with CI/CD, IaC (Terraform), and config management
  • Has predictable behavior: no surprising state transitions or auto-tuning magic
  • Provides good documentation: runbooks, architecture decision records, dependency maps
  • Avoids single points of human knowledge: bus factor > 1
  • Has good defaults: works without manual tuning; overridable when needed

What is the difference between essential complexity and accidental complexity?
?
Essential complexity: Inherent in the problem being solved; cannot be eliminated

  • Example: Distributed consensus IS fundamentally hard; handling concurrent writes IS complex
  • These must be dealt with — abstractions can hide them but not eliminate them

Accidental complexity: Introduced by poor design choices; could be avoided

  • Example: Spaghetti code, circular module dependencies, inconsistent APIs, magic globals
  • Can and should be eliminated through refactoring and better abstractions

Simplicity in DDIA means reducing accidental complexity, not essential complexity.
Good abstractions hide essential complexity behind clean interfaces (SQL hides B-trees, TCP hides routing).

What are SLI, SLO, and SLA? What is an error budget?
?
SLI (Service Level Indicator): A measurable metric

  • Example: “Fraction of requests completing in < 200ms over a 1-minute window”

SLO (Service Level Objective): Internal target for an SLI

  • Example: “SLI must be ≥ 99.5% measured over a rolling 28-day window”

SLA (Service Level Agreement): Contractual commitment with consequences for violation

  • Example: “If SLO is missed, customer receives a 10% bill credit”

Error budget: The amount of unreliability the SLO allows

  • Formula: error budget = 1 - SLO
  • Example: 99.5% SLO → 0.5% error budget → ~3.6 hours/month of allowable failure
  • When budget is exhausted, freeze risky deployments; focus on reliability

What are the three memory architecture types introduced in Ch2?
?
Shared-Memory (SMP):

  • Multiple CPUs share one pool of RAM and disk
  • Simple; any thread can access any data; no distribution
  • Expensive at scale; non-linear cost curve; single point of failure
  • Use for: workloads that fit on one large machine

Shared-Disk:

  • Multiple nodes have own CPU+RAM but share central storage (NAS, SAN, or cloud object store like S3)
  • Cloud version: Snowflake, Redshift — compute and storage scale independently
  • Use for: cloud data warehouses where compute and storage should scale separately

Shared-Nothing:

  • Each node has own CPU, RAM, and disk; communicate only via network
  • Data is partitioned (sharded) across nodes
  • Horizontally scalable; linear cost; no single hardware bottleneck
  • Requires: sharding, distributed state management, network coordination
  • Use for: large-scale distributed systems — Cassandra, Kafka, Spanner, CockroachDB

Trade-offs and Comparisons

What is tail latency amplification and why does it matter?
?
Tail latency amplification: When a request calls N services in parallel, the response time is bounded by the slowest, causing the effective tail latency to worsen with each additional service called.

Formula: P(all N calls succeed within T) = P(single call < T)

Example: 10 parallel calls, each with p99 = 100ms

  • P(single call < 100ms) = 0.99
  • P(all 10 < 100ms) = 0.99^10 = 0.904
  • So your composite request’s p95 ≈ 100ms (not p99!)
  • The composite p99 is significantly worse

Why it matters: Systems with large fan-out (timeline reads calling many services) see dramatically worse tail latency than any individual service. This is why large-scale systems obsess over p99 even when p50 looks fine.

What is the fan-out problem in social network timelines and what are the solution approaches?
?
Problem: When a user posts, deliver the post to all followers’ home timelines efficiently.

Fan-out on Read (Pull model):

  • Write: store the post once in a global posts table
  • Read: execute a join (posts × follows) on every timeline request
  • Pro: Simple writes, no duplication
  • Con: Expensive reads at scale; join over millions of posts and follows is slow

Fan-out on Write (Push model):

  • Write: propagate post to each follower’s timeline cache
  • Read: serve the pre-computed cache → fast
  • Pro: O(1) reads
  • Con: Write amplification — 1M followers × 1 post = 1M cache writes; celebrity with 30M followers is catastrophic

Hybrid (correct answer):

  • Normal users (< ~100K followers): push to timeline caches
  • Celebrities (> ~100K followers): pull/merge at read time
  • Threshold is tunable; configurable per user
  • Key insight: the load parameter is follower count distribution, not QPS

Why are percentiles better than averages for measuring response time?
?
Averages hide tail latencies: If 99 requests take 10ms and 1 takes 10,000ms:

  • Arithmetic mean ≈ 110ms — overstates typical experience and understates the worst case simultaneously
  • p50 = 10ms — reflects the typical user experience
  • p99 = 10,000ms — captures the outlier that affects 1% of users

Why tail latencies matter:

  1. Users with the most data (best customers) often experience the worst latency
  2. Amazon: 100ms increase in response time = 1% decrease in sales
  3. Tail latency amplification (see above) compounds the problem in distributed systems

Practical rule: Never use averages in SLOs. Always specify percentile: “p99 < 200ms” not “average < 200ms.”

What is the difference between vertical scaling and horizontal scaling?
?
Vertical scaling (scale-up):

  • Replace the current machine with a more powerful one
  • More CPUs, RAM, faster storage
  • Pro: No code changes required; simple operationally
  • Con: Cost scales non-linearly; hardware limit; single point of failure
  • Best first step; most apps never need to go beyond this

Horizontal scaling (scale-out):

  • Add more machines; distribute load across them
  • Requires data sharding and load balancing
  • Pro: Theoretically unlimited; uses commodity hardware; fault tolerant
  • Con: Complex — requires sharding, replication, distributed state management, network coordination
  • Use when vertical scaling is not cost-effective or hits hardware limits

Principle: Always try vertical scaling first. Horizontal scaling introduces distributed systems complexity that compounds over time.

What makes evolvability different from just “writing good code”?
?
Evolvability is specifically about making the system easy to change as requirements evolve — it goes beyond code quality to architectural design.

Why it’s important: Requirements change constantly. A system that is hard to change calcifies — teams avoid making necessary changes because the risk is too high, leading to technical debt accumulation and eventually “big bang rewrites.”

Evolvability requires:

  1. Good abstractions: Changes should be localized; interfaces stable even as implementations change
  2. Testability: Tests serve as a safety net for refactoring; catch regressions automatically
  3. Decoupling: Components that evolve at different rates should be separate
  4. Schema evolution: Data formats must support backward/forward compatibility (connects to Ch5 Encoding)
  5. Agile practices: TDD, CI/CD, feature flags enable small safe changes rather than big risky ones

Connection to simplicity: Simple, well-abstracted systems are easier to change. Accidental complexity is the primary enemy of evolvability.

Numbers and Precision

What are the key uptime numbers for “nines” of availability?
?

SLODowntime per yearDowntime per month
99%87.6 hours7.3 hours
99.9% (“three nines”)8.76 hours43.8 minutes
99.99% (“four nines”)52.6 minutes4.38 minutes
99.999% (“five nines”)5.26 minutes26.3 seconds

Context: Five nines requires < 5.3 minutes downtime per year — this means deployments, maintenance, and incidents must all fit within that budget. Extremely expensive to achieve.

Practical target: Most web services target 99.9%–99.99%. Five nines is reserved for critical infrastructure (telecom, payment networks, aviation).

What is the MTTF of a hard disk and what is the practical implication at scale?
?
MTTF (Mean Time To Failure): 10–50 years per disk in lab conditions.

Practical implication at scale:

  • At 10,000 disks: expect approximately 1 disk failure per day (10,000 disks / 10,000-day MTTF ≈ 1/day)
  • Google operates millions of machines — hardware failure is a daily, routine event, not an exception
  • AWS operates billions of disks — at that scale, any 1-in-a-million event happens many times per day

Conclusion: Hardware failure is normal and expected at scale. Design for it in software. Do not rely on hardware redundancy alone. Assume disks will fail; replicate data; test automatic failover.

What is the Amazon latency impact number and what does it imply?
?
Amazon’s finding: A 100ms increase in response time = approximately 1% decrease in sales.

Implications:

  1. Latency is a direct business metric, not just a technical concern
  2. Tail latencies (p99, p99.9) affect real users and real revenue
  3. p99 users are often your best customers (most data = most engaged = most valuable)
  4. Optimizing from 200ms to 100ms response time has measurable ROI

Application: When arguing for latency improvements or p99 SLOs to business stakeholders, use this framing: latency is a conversion rate factor, not just a developer comfort issue.

Application and Failure Modes

You’re seeing good p50 latency but terrible p99. What do you investigate?
?
This pattern suggests tail latency from outlier requests, not general slowness.

Investigation areas:

  1. Database queries: Are slow p99 requests hitting unindexed queries? Do users with more data get slower queries?
  2. GC pauses: JVM or .NET garbage collection pauses spike p99. Check GC logs.
  3. Lock contention: Rare but expensive locks (database table locks, mutex contention) affect p99
  4. External dependencies: One external API call with high tail latency? Check per-service p99.
  5. Head-of-line blocking: Is a thread pool saturated? Are slow requests queuing behind fast ones?
  6. Cold cache: First request for a new object misses cache; p99 may represent cache misses

Solutions:

  • Add caching for expensive repeated operations
  • Query optimization + proper indexing
  • Separate thread pools per operation type (bulkhead pattern)
  • Aggressive timeouts on external calls + fallbacks
  • Async processing for non-critical work

How would you design the social network timeline system to handle a celebrity with 100M followers posting a tweet?
?
Problem: Naive fan-out on write → 100M cache writes per tweet → system falls over.

Solution: Hybrid fan-out:

  1. Identify celebrities: Any user with followers > threshold (e.g., 1M) is in “celebrity mode”
  2. Celebrity posts: Do NOT write to 100M individual timeline caches
    • Store the tweet in a celebrities table / high-follower post index
  3. Regular users’ posts: Fan-out on write as normal (push to follower timeline caches)
  4. Timeline read: Merge two sources:
    • Pre-computed cache (regular users’ posts, fast O(1) read)
    • Live query against celebrities the user follows (small set: most users follow < 100 celebrities)
    • Sort and merge by timestamp

Additional optimizations:

  • Pre-warm celebrity post caches (distributed to regional caches before read requests arrive)
  • Use lazy fan-out for moderately popular users (fan out only when they have active followers)
  • Rate-limit the fan-out worker to prevent spikes

What failure modes emerge when a system lacks operability?
?
Failure modes from poor operability:

  1. Silent failures: System is degraded but no alert fires → hours of customer impact before detection
  2. Runbook gaps: On-call engineer can’t find procedure for known failure mode → slow resolution
  3. Dependency mysteries: Service fails because upstream dependency changed silently; no visibility
  4. Configuration drift: Prod environment diverges from documented state; team doesn’t know actual config
  5. Hero dependency: One engineer knows how the system works; when they’re unavailable, no one can debug
  6. Manual capacity management: Traffic spike → team manually adds servers → 30 min to respond → customers see errors

What good operability prevents:

  • Monitoring catches p99 spike before customers complain
  • CI/CD pipeline with automated rollback reverts bad deployment in minutes
  • Runbook guides any on-call engineer through known failure modes
  • Infrastructure-as-code ensures prod matches documentation

When does a monolith become a scalability bottleneck, and what are the signs?
?
Signs a monolith is becoming a bottleneck:

  1. Deployment coupling: One team’s change requires redeploying everything; slow release cycles
  2. Resource contention: One high-CPU component starves others sharing the same process
  3. Independent scaling need: One module needs 10x more compute; can’t scale it without scaling everything
  4. Team coordination overhead: > 30–50 engineers → merge conflicts and coordination costs dominate
  5. Technology mismatch: Different components would benefit from different DB/language choices

Important: Most of these are organizational/team problems, not technical ones. The technical scalability of a well-built monolith is often higher than teams assume.

Threshold for extraction: Extract a service when:

  • Team autonomy and independent deployment are the bottleneck (not raw compute)
  • The service boundary is clear and stable (not likely to need cross-service transactions)
  • Operational maturity exists to run distributed services (monitoring, service mesh, distributed tracing)

What is head-of-line blocking and why does it affect tail latency?
?
Head-of-line blocking: A slow request occupying a thread or connection prevents all requests queued behind it from being processed, even if those requests would be fast on their own.

Why it inflates tail latency:

  • A thread pool of 10 threads receives a slow request (5 seconds) and 100 fast requests (1ms each)
  • The fast requests queue up behind the slow one; their measured response time is 5001ms, not 1ms
  • p99 includes these artificially slow fast requests

Mitigations:

  • Shorter timeouts on requests (fail fast, don’t hold threads)
  • Separate thread pools per operation type (bulkhead pattern): slow operations can’t block fast ones
  • Async/non-blocking I/O: threads don’t block waiting for I/O
  • Request hedging: send duplicate request to a second server after a timeout, use whichever responds first

What is the error budget concept in SRE and how does it change team behavior?
?
Error budget: The amount of unreliability (downtime, errors) that a system’s SLO allows within a time window.

  • Formula: error budget = 1 - SLO
  • Example: 99.9% SLO → 0.1% error budget → ~43 minutes downtime per month

How it changes behavior:

  • When budget is healthy: Team can take risks (deploy risky changes, experiment in production)
  • When budget is nearly exhausted: Team must slow down risky deployments and focus on reliability
  • When budget is depleted: Feature deployments freeze until reliability is restored

Why it works: Creates a shared contract between development (wants to move fast) and operations (wants stability). Neither side can always “win” — the budget forces negotiation and trade-offs based on data, not politics.

What is chaos engineering and how does it differ from ordinary testing?
?
Chaos engineering: Intentionally injecting failures into a running system (often production) to verify that fault tolerance mechanisms actually work.

Difference from ordinary testing:

  • Unit/integration tests: Verify code correctness in controlled, isolated conditions
  • Chaos engineering: Verifies system resilience under realistic, production conditions — including unexpected combinations of failures

Why production matters: Staging environments don’t replicate all dependencies, traffic patterns, and state. Only production exposes real failure modes.

Examples:

  • Netflix Chaos Monkey: randomly terminates EC2 instances in production
  • Chaos Mesh (Kubernetes): injects network partitions, pod failures, CPU pressure
  • Gremlin: commercial platform for controlled chaos experiments

What it finds: Missing retries, inadequate timeouts, single points of failure that redundancy didn’t cover, cascading failures that tests didn’t anticipate.

What is the relationship between simplicity and evolvability?
?
Core relationship: Simple systems are easier to change. Accidental complexity is the primary enemy of evolvability.

Chain of reasoning:

  1. Complex systems have tight coupling and tangled dependencies
  2. Tight coupling means changing component A requires changing B, C, D
  3. Each required change adds risk of introducing bugs
  4. Risk discourages change → team avoids necessary modifications → technical debt grows
  5. Eventually: “big bang rewrite” rather than incremental evolution

Simplicity enables evolvability through:

  • Good abstractions: Changes localized to implementation; interface remains stable
  • Decoupling: Components that evolve at different rates can be changed independently
  • Testability: Tests catch regressions, making change safe
  • Clear naming: Engineers understand what code does → can modify it confidently

Practical implication: Every time you reduce accidental complexity, you are also improving the system’s ability to change in the future.

What distinguishes response time from latency from service time?
?
Three related but distinct concepts:

Response time (client perspective):

  • Total time from client sending request to receiving response
  • Includes: network round-trip + queuing time + service time
  • What users and SLOs should measure

Latency (loose usage):

  • Often used interchangeably with response time in common practice
  • Strict definition: time a request is “latent” (waiting to be handled), excluding service time
  • In practice: use “response time” for precision, “latency” in casual conversation

Service time (server perspective):

  • Time the server spends actually processing the request
  • Excludes network time and queuing time
  • Lower bound for response time

Why the distinction matters: A service with 1ms service time can still show 500ms response time due to network, queuing at overloaded upstream, or head-of-line blocking. Optimizing service time without addressing queuing is often wasted effort.

What is the circuit breaker pattern and when does it apply?
?
Circuit breaker: A fault tolerance pattern that stops sending requests to a failing service rather than letting each request time out individually.

States:

  1. Closed (normal): requests flow through; failures counted
  2. Open (tripped): requests immediately fail without attempting the call; entered when failure count exceeds threshold
  3. Half-open (recovery probe): a small number of requests allowed through to test if service has recovered

Why it matters:

  • Without it: every call to a failing service waits for the timeout (e.g., 30s); threads pile up; caller also starts failing
  • With it: calls fail immediately once the circuit is open; caller can fall back gracefully; failing service gets a chance to recover without being bombarded

Implementation: Hystrix (legacy), Resilience4j (Java), Polly (.NET), Envoy/Istio service mesh (infrastructure level)

Connection to reliability: Circuit breakers are a key mechanism for preventing cascading failures — one of the hardest reliability problems in distributed systems.

What is a rolling percentile and why is it preferred over batch statistics for SLO monitoring?
?
Batch statistics: Calculate p99 over all requests in a fixed historical window (e.g., all requests yesterday)

  • Problem: Yesterday’s data is stale; doesn’t reflect current system state; slow feedback loop

Rolling percentile (sliding window): Calculate p99 over the most recent N minutes or M requests

  • Example: “p99 latency over the last 5-minute rolling window”
  • Reflects current state; detects degradation within minutes
  • Actionable: on-call engineer sees the metric worsening in real time

How it’s implemented:

  • HDRHistogram (High Dynamic Range Histogram): efficient rolling percentile data structure
  • Available in Prometheus, Datadog, New Relic, Honeycomb by default

Practical SLO monitoring: Define SLO as a rolling percentile with a burn rate alert:

  • “Alert if p99 > 500ms for more than 5 minutes in a 1-hour window”
  • Burn rate: if you’re consuming error budget 14x faster than normal, alert immediately

Total Cards: 25
Review Time: ~30 minutes
Priority: HIGH
Last Updated: 2026-05-29