Chapter 2 Flashcards — Defining Nonfunctional Requirements
Definitions
What is the difference between a fault and a failure?
?
- Fault: A single component deviating from its specification (e.g., a disk crashes, a network packet is dropped, a bug is triggered in an edge case)
- Failure: The system as a whole stops providing the required service to users
A fault-tolerant (resilient) system prevents faults from escalating into failures.
- Key insight: Do NOT try to prevent faults (impossible at scale) — prevent faults from causing failures
- Example: Netflix Chaos Monkey intentionally creates faults to verify fault tolerance
What are the three types of faults and their key characteristics?
?
-
Hardware faults — Random, independent; disks crash, RAM errors, power outages
- MTTF 10–50 years per disk; at 10,000 disks, expect ~1/day
- Mitigation: Hardware redundancy (RAID, failover) + software replication
-
Software faults — Systematic, correlated across all nodes running the same code
- A single bug can take down every replica simultaneously
- Examples: Leap second bug (2012), runaway processes, cascading failures
- Mitigation: Testing, isolation, monitoring, crash-only software design
-
Human errors — Leading cause of production outages (majority of internet service outages)
- Typically configuration or operational mistakes, not hardware
- Mitigation: Good design (make wrong thing hard), gradual rollout, fast rollback, detailed monitoring
What is operability and what does a highly operable system look like?
?
Operability: Making it easy for operations teams to keep the system running smoothly day to day.
A highly operable system:
- Has visibility: metrics, logs, distributed traces, and alerts that answer “what is the system doing right now?”
- Supports automation: integrates with CI/CD, IaC (Terraform), and config management
- Has predictable behavior: no surprising state transitions or auto-tuning magic
- Provides good documentation: runbooks, architecture decision records, dependency maps
- Avoids single points of human knowledge: bus factor > 1
- Has good defaults: works without manual tuning; overridable when needed
What is the difference between essential complexity and accidental complexity?
?
Essential complexity: Inherent in the problem being solved; cannot be eliminated
- Example: Distributed consensus IS fundamentally hard; handling concurrent writes IS complex
- These must be dealt with — abstractions can hide them but not eliminate them
Accidental complexity: Introduced by poor design choices; could be avoided
- Example: Spaghetti code, circular module dependencies, inconsistent APIs, magic globals
- Can and should be eliminated through refactoring and better abstractions
Simplicity in DDIA means reducing accidental complexity, not essential complexity.
Good abstractions hide essential complexity behind clean interfaces (SQL hides B-trees, TCP hides routing).
What are SLI, SLO, and SLA? What is an error budget?
?
SLI (Service Level Indicator): A measurable metric
- Example: “Fraction of requests completing in < 200ms over a 1-minute window”
SLO (Service Level Objective): Internal target for an SLI
- Example: “SLI must be ≥ 99.5% measured over a rolling 28-day window”
SLA (Service Level Agreement): Contractual commitment with consequences for violation
- Example: “If SLO is missed, customer receives a 10% bill credit”
Error budget: The amount of unreliability the SLO allows
- Formula: error budget = 1 - SLO
- Example: 99.5% SLO → 0.5% error budget → ~3.6 hours/month of allowable failure
- When budget is exhausted, freeze risky deployments; focus on reliability
What are the three memory architecture types introduced in Ch2?
?
Shared-Memory (SMP):
- Multiple CPUs share one pool of RAM and disk
- Simple; any thread can access any data; no distribution
- Expensive at scale; non-linear cost curve; single point of failure
- Use for: workloads that fit on one large machine
Shared-Disk:
- Multiple nodes have own CPU+RAM but share central storage (NAS, SAN, or cloud object store like S3)
- Cloud version: Snowflake, Redshift — compute and storage scale independently
- Use for: cloud data warehouses where compute and storage should scale separately
Shared-Nothing:
- Each node has own CPU, RAM, and disk; communicate only via network
- Data is partitioned (sharded) across nodes
- Horizontally scalable; linear cost; no single hardware bottleneck
- Requires: sharding, distributed state management, network coordination
- Use for: large-scale distributed systems — Cassandra, Kafka, Spanner, CockroachDB
Trade-offs and Comparisons
What is tail latency amplification and why does it matter?
?
Tail latency amplification: When a request calls N services in parallel, the response time is bounded by the slowest, causing the effective tail latency to worsen with each additional service called.
Formula: P(all N calls succeed within T) = P(single call < T)
Example: 10 parallel calls, each with p99 = 100ms
- P(single call < 100ms) = 0.99
- P(all 10 < 100ms) = 0.99^10 = 0.904
- So your composite request’s p95 ≈ 100ms (not p99!)
- The composite p99 is significantly worse
Why it matters: Systems with large fan-out (timeline reads calling many services) see dramatically worse tail latency than any individual service. This is why large-scale systems obsess over p99 even when p50 looks fine.
What is the fan-out problem in social network timelines and what are the solution approaches?
?
Problem: When a user posts, deliver the post to all followers’ home timelines efficiently.
Fan-out on Read (Pull model):
- Write: store the post once in a global posts table
- Read: execute a join (posts × follows) on every timeline request
- Pro: Simple writes, no duplication
- Con: Expensive reads at scale; join over millions of posts and follows is slow
Fan-out on Write (Push model):
- Write: propagate post to each follower’s timeline cache
- Read: serve the pre-computed cache → fast
- Pro: O(1) reads
- Con: Write amplification — 1M followers × 1 post = 1M cache writes; celebrity with 30M followers is catastrophic
Hybrid (correct answer):
- Normal users (< ~100K followers): push to timeline caches
- Celebrities (> ~100K followers): pull/merge at read time
- Threshold is tunable; configurable per user
- Key insight: the load parameter is follower count distribution, not QPS
Why are percentiles better than averages for measuring response time?
?
Averages hide tail latencies: If 99 requests take 10ms and 1 takes 10,000ms:
- Arithmetic mean ≈ 110ms — overstates typical experience and understates the worst case simultaneously
- p50 = 10ms — reflects the typical user experience
- p99 = 10,000ms — captures the outlier that affects 1% of users
Why tail latencies matter:
- Users with the most data (best customers) often experience the worst latency
- Amazon: 100ms increase in response time = 1% decrease in sales
- Tail latency amplification (see above) compounds the problem in distributed systems
Practical rule: Never use averages in SLOs. Always specify percentile: “p99 < 200ms” not “average < 200ms.”
What is the difference between vertical scaling and horizontal scaling?
?
Vertical scaling (scale-up):
- Replace the current machine with a more powerful one
- More CPUs, RAM, faster storage
- Pro: No code changes required; simple operationally
- Con: Cost scales non-linearly; hardware limit; single point of failure
- Best first step; most apps never need to go beyond this
Horizontal scaling (scale-out):
- Add more machines; distribute load across them
- Requires data sharding and load balancing
- Pro: Theoretically unlimited; uses commodity hardware; fault tolerant
- Con: Complex — requires sharding, replication, distributed state management, network coordination
- Use when vertical scaling is not cost-effective or hits hardware limits
Principle: Always try vertical scaling first. Horizontal scaling introduces distributed systems complexity that compounds over time.
What makes evolvability different from just “writing good code”?
?
Evolvability is specifically about making the system easy to change as requirements evolve — it goes beyond code quality to architectural design.
Why it’s important: Requirements change constantly. A system that is hard to change calcifies — teams avoid making necessary changes because the risk is too high, leading to technical debt accumulation and eventually “big bang rewrites.”
Evolvability requires:
- Good abstractions: Changes should be localized; interfaces stable even as implementations change
- Testability: Tests serve as a safety net for refactoring; catch regressions automatically
- Decoupling: Components that evolve at different rates should be separate
- Schema evolution: Data formats must support backward/forward compatibility (connects to Ch5 Encoding)
- Agile practices: TDD, CI/CD, feature flags enable small safe changes rather than big risky ones
Connection to simplicity: Simple, well-abstracted systems are easier to change. Accidental complexity is the primary enemy of evolvability.
Numbers and Precision
What are the key uptime numbers for “nines” of availability?
?
| SLO | Downtime per year | Downtime per month |
|---|---|---|
| 99% | 87.6 hours | 7.3 hours |
| 99.9% (“three nines”) | 8.76 hours | 43.8 minutes |
| 99.99% (“four nines”) | 52.6 minutes | 4.38 minutes |
| 99.999% (“five nines”) | 5.26 minutes | 26.3 seconds |
Context: Five nines requires < 5.3 minutes downtime per year — this means deployments, maintenance, and incidents must all fit within that budget. Extremely expensive to achieve.
Practical target: Most web services target 99.9%–99.99%. Five nines is reserved for critical infrastructure (telecom, payment networks, aviation).
What is the MTTF of a hard disk and what is the practical implication at scale?
?
MTTF (Mean Time To Failure): 10–50 years per disk in lab conditions.
Practical implication at scale:
- At 10,000 disks: expect approximately 1 disk failure per day (10,000 disks / 10,000-day MTTF ≈ 1/day)
- Google operates millions of machines — hardware failure is a daily, routine event, not an exception
- AWS operates billions of disks — at that scale, any 1-in-a-million event happens many times per day
Conclusion: Hardware failure is normal and expected at scale. Design for it in software. Do not rely on hardware redundancy alone. Assume disks will fail; replicate data; test automatic failover.
What is the Amazon latency impact number and what does it imply?
?
Amazon’s finding: A 100ms increase in response time = approximately 1% decrease in sales.
Implications:
- Latency is a direct business metric, not just a technical concern
- Tail latencies (p99, p99.9) affect real users and real revenue
- p99 users are often your best customers (most data = most engaged = most valuable)
- Optimizing from 200ms to 100ms response time has measurable ROI
Application: When arguing for latency improvements or p99 SLOs to business stakeholders, use this framing: latency is a conversion rate factor, not just a developer comfort issue.
Application and Failure Modes
You’re seeing good p50 latency but terrible p99. What do you investigate?
?
This pattern suggests tail latency from outlier requests, not general slowness.
Investigation areas:
- Database queries: Are slow p99 requests hitting unindexed queries? Do users with more data get slower queries?
- GC pauses: JVM or .NET garbage collection pauses spike p99. Check GC logs.
- Lock contention: Rare but expensive locks (database table locks, mutex contention) affect p99
- External dependencies: One external API call with high tail latency? Check per-service p99.
- Head-of-line blocking: Is a thread pool saturated? Are slow requests queuing behind fast ones?
- Cold cache: First request for a new object misses cache; p99 may represent cache misses
Solutions:
- Add caching for expensive repeated operations
- Query optimization + proper indexing
- Separate thread pools per operation type (bulkhead pattern)
- Aggressive timeouts on external calls + fallbacks
- Async processing for non-critical work
How would you design the social network timeline system to handle a celebrity with 100M followers posting a tweet?
?
Problem: Naive fan-out on write → 100M cache writes per tweet → system falls over.
Solution: Hybrid fan-out:
- Identify celebrities: Any user with followers > threshold (e.g., 1M) is in “celebrity mode”
- Celebrity posts: Do NOT write to 100M individual timeline caches
- Store the tweet in a celebrities table / high-follower post index
- Regular users’ posts: Fan-out on write as normal (push to follower timeline caches)
- Timeline read: Merge two sources:
- Pre-computed cache (regular users’ posts, fast O(1) read)
- Live query against celebrities the user follows (small set: most users follow < 100 celebrities)
- Sort and merge by timestamp
Additional optimizations:
- Pre-warm celebrity post caches (distributed to regional caches before read requests arrive)
- Use lazy fan-out for moderately popular users (fan out only when they have active followers)
- Rate-limit the fan-out worker to prevent spikes
What failure modes emerge when a system lacks operability?
?
Failure modes from poor operability:
- Silent failures: System is degraded but no alert fires → hours of customer impact before detection
- Runbook gaps: On-call engineer can’t find procedure for known failure mode → slow resolution
- Dependency mysteries: Service fails because upstream dependency changed silently; no visibility
- Configuration drift: Prod environment diverges from documented state; team doesn’t know actual config
- Hero dependency: One engineer knows how the system works; when they’re unavailable, no one can debug
- Manual capacity management: Traffic spike → team manually adds servers → 30 min to respond → customers see errors
What good operability prevents:
- Monitoring catches p99 spike before customers complain
- CI/CD pipeline with automated rollback reverts bad deployment in minutes
- Runbook guides any on-call engineer through known failure modes
- Infrastructure-as-code ensures prod matches documentation
When does a monolith become a scalability bottleneck, and what are the signs?
?
Signs a monolith is becoming a bottleneck:
- Deployment coupling: One team’s change requires redeploying everything; slow release cycles
- Resource contention: One high-CPU component starves others sharing the same process
- Independent scaling need: One module needs 10x more compute; can’t scale it without scaling everything
- Team coordination overhead: > 30–50 engineers → merge conflicts and coordination costs dominate
- Technology mismatch: Different components would benefit from different DB/language choices
Important: Most of these are organizational/team problems, not technical ones. The technical scalability of a well-built monolith is often higher than teams assume.
Threshold for extraction: Extract a service when:
- Team autonomy and independent deployment are the bottleneck (not raw compute)
- The service boundary is clear and stable (not likely to need cross-service transactions)
- Operational maturity exists to run distributed services (monitoring, service mesh, distributed tracing)
What is head-of-line blocking and why does it affect tail latency?
?
Head-of-line blocking: A slow request occupying a thread or connection prevents all requests queued behind it from being processed, even if those requests would be fast on their own.
Why it inflates tail latency:
- A thread pool of 10 threads receives a slow request (5 seconds) and 100 fast requests (1ms each)
- The fast requests queue up behind the slow one; their measured response time is 5001ms, not 1ms
- p99 includes these artificially slow fast requests
Mitigations:
- Shorter timeouts on requests (fail fast, don’t hold threads)
- Separate thread pools per operation type (bulkhead pattern): slow operations can’t block fast ones
- Async/non-blocking I/O: threads don’t block waiting for I/O
- Request hedging: send duplicate request to a second server after a timeout, use whichever responds first
What is the error budget concept in SRE and how does it change team behavior?
?
Error budget: The amount of unreliability (downtime, errors) that a system’s SLO allows within a time window.
- Formula: error budget = 1 - SLO
- Example: 99.9% SLO → 0.1% error budget → ~43 minutes downtime per month
How it changes behavior:
- When budget is healthy: Team can take risks (deploy risky changes, experiment in production)
- When budget is nearly exhausted: Team must slow down risky deployments and focus on reliability
- When budget is depleted: Feature deployments freeze until reliability is restored
Why it works: Creates a shared contract between development (wants to move fast) and operations (wants stability). Neither side can always “win” — the budget forces negotiation and trade-offs based on data, not politics.
What is chaos engineering and how does it differ from ordinary testing?
?
Chaos engineering: Intentionally injecting failures into a running system (often production) to verify that fault tolerance mechanisms actually work.
Difference from ordinary testing:
- Unit/integration tests: Verify code correctness in controlled, isolated conditions
- Chaos engineering: Verifies system resilience under realistic, production conditions — including unexpected combinations of failures
Why production matters: Staging environments don’t replicate all dependencies, traffic patterns, and state. Only production exposes real failure modes.
Examples:
- Netflix Chaos Monkey: randomly terminates EC2 instances in production
- Chaos Mesh (Kubernetes): injects network partitions, pod failures, CPU pressure
- Gremlin: commercial platform for controlled chaos experiments
What it finds: Missing retries, inadequate timeouts, single points of failure that redundancy didn’t cover, cascading failures that tests didn’t anticipate.
What is the relationship between simplicity and evolvability?
?
Core relationship: Simple systems are easier to change. Accidental complexity is the primary enemy of evolvability.
Chain of reasoning:
- Complex systems have tight coupling and tangled dependencies
- Tight coupling means changing component A requires changing B, C, D
- Each required change adds risk of introducing bugs
- Risk discourages change → team avoids necessary modifications → technical debt grows
- Eventually: “big bang rewrite” rather than incremental evolution
Simplicity enables evolvability through:
- Good abstractions: Changes localized to implementation; interface remains stable
- Decoupling: Components that evolve at different rates can be changed independently
- Testability: Tests catch regressions, making change safe
- Clear naming: Engineers understand what code does → can modify it confidently
Practical implication: Every time you reduce accidental complexity, you are also improving the system’s ability to change in the future.
What distinguishes response time from latency from service time?
?
Three related but distinct concepts:
Response time (client perspective):
- Total time from client sending request to receiving response
- Includes: network round-trip + queuing time + service time
- What users and SLOs should measure
Latency (loose usage):
- Often used interchangeably with response time in common practice
- Strict definition: time a request is “latent” (waiting to be handled), excluding service time
- In practice: use “response time” for precision, “latency” in casual conversation
Service time (server perspective):
- Time the server spends actually processing the request
- Excludes network time and queuing time
- Lower bound for response time
Why the distinction matters: A service with 1ms service time can still show 500ms response time due to network, queuing at overloaded upstream, or head-of-line blocking. Optimizing service time without addressing queuing is often wasted effort.
What is the circuit breaker pattern and when does it apply?
?
Circuit breaker: A fault tolerance pattern that stops sending requests to a failing service rather than letting each request time out individually.
States:
- Closed (normal): requests flow through; failures counted
- Open (tripped): requests immediately fail without attempting the call; entered when failure count exceeds threshold
- Half-open (recovery probe): a small number of requests allowed through to test if service has recovered
Why it matters:
- Without it: every call to a failing service waits for the timeout (e.g., 30s); threads pile up; caller also starts failing
- With it: calls fail immediately once the circuit is open; caller can fall back gracefully; failing service gets a chance to recover without being bombarded
Implementation: Hystrix (legacy), Resilience4j (Java), Polly (.NET), Envoy/Istio service mesh (infrastructure level)
Connection to reliability: Circuit breakers are a key mechanism for preventing cascading failures — one of the hardest reliability problems in distributed systems.
What is a rolling percentile and why is it preferred over batch statistics for SLO monitoring?
?
Batch statistics: Calculate p99 over all requests in a fixed historical window (e.g., all requests yesterday)
- Problem: Yesterday’s data is stale; doesn’t reflect current system state; slow feedback loop
Rolling percentile (sliding window): Calculate p99 over the most recent N minutes or M requests
- Example: “p99 latency over the last 5-minute rolling window”
- Reflects current state; detects degradation within minutes
- Actionable: on-call engineer sees the metric worsening in real time
How it’s implemented:
- HDRHistogram (High Dynamic Range Histogram): efficient rolling percentile data structure
- Available in Prometheus, Datadog, New Relic, Honeycomb by default
Practical SLO monitoring: Define SLO as a rolling percentile with a burn rate alert:
- “Alert if p99 > 500ms for more than 5 minutes in a 1-hour window”
- Burn rate: if you’re consuming error budget 14x faster than normal, alert immediately
Total Cards: 25
Review Time: ~30 minutes
Priority: HIGH
Last Updated: 2026-05-29