Chapter 2 Flashcards — Defining Nonfunctional Requirements

Definitions

What is the difference between a fault and a failure?
?

Fault: A single component deviating from its specification (e.g., a disk crashes, a network packet is dropped, a bug is triggered in an edge case)
Failure: The system as a whole stops providing the required service to users

A fault-tolerant (resilient) system prevents faults from escalating into failures.

Key insight: Do NOT try to prevent faults (impossible at scale) — prevent faults from causing failures
Example: Netflix Chaos Monkey intentionally creates faults to verify fault tolerance

What are the three types of faults and their key characteristics?
?

Hardware faults — Random, independent; disks crash, RAM errors, power outages
- MTTF 10–50 years per disk; at 10,000 disks, expect ~1/day
- Mitigation: Hardware redundancy (RAID, failover) + software replication
Software faults — Systematic, correlated across all nodes running the same code
- A single bug can take down every replica simultaneously
- Examples: Leap second bug (2012), runaway processes, cascading failures
- Mitigation: Testing, isolation, monitoring, crash-only software design
Human errors — Leading cause of production outages (majority of internet service outages)
- Typically configuration or operational mistakes, not hardware
- Mitigation: Good design (make wrong thing hard), gradual rollout, fast rollback, detailed monitoring

What is operability and what does a highly operable system look like?
?
Operability: Making it easy for operations teams to keep the system running smoothly day to day.

A highly operable system:

Has visibility: metrics, logs, distributed traces, and alerts that answer “what is the system doing right now?”
Supports automation: integrates with CI/CD, IaC (Terraform), and config management
Has predictable behavior: no surprising state transitions or auto-tuning magic
Provides good documentation: runbooks, architecture decision records, dependency maps
Avoids single points of human knowledge: bus factor > 1
Has good defaults: works without manual tuning; overridable when needed

What is the difference between essential complexity and accidental complexity?
?
Essential complexity: Inherent in the problem being solved; cannot be eliminated

Example: Distributed consensus IS fundamentally hard; handling concurrent writes IS complex
These must be dealt with — abstractions can hide them but not eliminate them

Accidental complexity: Introduced by poor design choices; could be avoided

Example: Spaghetti code, circular module dependencies, inconsistent APIs, magic globals
Can and should be eliminated through refactoring and better abstractions

Simplicity in DDIA means reducing accidental complexity, not essential complexity.
Good abstractions hide essential complexity behind clean interfaces (SQL hides B-trees, TCP hides routing).

What are SLI, SLO, and SLA? What is an error budget?
?
SLI (Service Level Indicator): A measurable metric

Example: “Fraction of requests completing in < 200ms over a 1-minute window”

SLO (Service Level Objective): Internal target for an SLI

Example: “SLI must be ≥ 99.5% measured over a rolling 28-day window”

SLA (Service Level Agreement): Contractual commitment with consequences for violation

Example: “If SLO is missed, customer receives a 10% bill credit”

Error budget: The amount of unreliability the SLO allows

Formula: error budget = 1 - SLO
Example: 99.5% SLO → 0.5% error budget → ~3.6 hours/month of allowable failure
When budget is exhausted, freeze risky deployments; focus on reliability

What are the three memory architecture types introduced in Ch2?
?
Shared-Memory (SMP):

Multiple CPUs share one pool of RAM and disk
Simple; any thread can access any data; no distribution
Expensive at scale; non-linear cost curve; single point of failure
Use for: workloads that fit on one large machine

Shared-Disk:

Multiple nodes have own CPU+RAM but share central storage (NAS, SAN, or cloud object store like S3)
Cloud version: Snowflake, Redshift — compute and storage scale independently
Use for: cloud data warehouses where compute and storage should scale separately

Shared-Nothing:

Each node has own CPU, RAM, and disk; communicate only via network
Data is partitioned (sharded) across nodes
Horizontally scalable; linear cost; no single hardware bottleneck
Requires: sharding, distributed state management, network coordination
Use for: large-scale distributed systems — Cassandra, Kafka, Spanner, CockroachDB

Trade-offs and Comparisons

What is tail latency amplification and why does it matter?
?
Tail latency amplification: When a request calls N services in parallel, the response time is bounded by the slowest, causing the effective tail latency to worsen with each additional service called.

Formula: P(all N calls succeed within T) = P(single call < T)

Example: 10 parallel calls, each with p99 = 100ms

P(single call < 100ms) = 0.99
P(all 10 < 100ms) = 0.99^10 = 0.904
So your composite request’s p95 ≈ 100ms (not p99!)
The composite p99 is significantly worse

Why it matters: Systems with large fan-out (timeline reads calling many services) see dramatically worse tail latency than any individual service. This is why large-scale systems obsess over p99 even when p50 looks fine.

What is the fan-out problem in social network timelines and what are the solution approaches?
?
Problem: When a user posts, deliver the post to all followers’ home timelines efficiently.

Fan-out on Read (Pull model):

Write: store the post once in a global posts table
Read: execute a join (posts × follows) on every timeline request
Pro: Simple writes, no duplication
Con: Expensive reads at scale; join over millions of posts and follows is slow

Fan-out on Write (Push model):

Write: propagate post to each follower’s timeline cache
Read: serve the pre-computed cache → fast
Pro: O(1) reads
Con: Write amplification — 1M followers × 1 post = 1M cache writes; celebrity with 30M followers is catastrophic

Hybrid (correct answer):

Normal users (< ~100K followers): push to timeline caches
Celebrities (> ~100K followers): pull/merge at read time
Threshold is tunable; configurable per user
Key insight: the load parameter is follower count distribution, not QPS

Why are percentiles better than averages for measuring response time?
?
Averages hide tail latencies: If 99 requests take 10ms and 1 takes 10,000ms:

Arithmetic mean ≈ 110ms — overstates typical experience and understates the worst case simultaneously
p50 = 10ms — reflects the typical user experience
p99 = 10,000ms — captures the outlier that affects 1% of users

Why tail latencies matter:

Users with the most data (best customers) often experience the worst latency
Amazon: 100ms increase in response time = 1% decrease in sales
Tail latency amplification (see above) compounds the problem in distributed systems

Practical rule: Never use averages in SLOs. Always specify percentile: “p99 < 200ms” not “average < 200ms.”

What is the difference between vertical scaling and horizontal scaling?
?
Vertical scaling (scale-up):

Replace the current machine with a more powerful one
More CPUs, RAM, faster storage
Pro: No code changes required; simple operationally
Con: Cost scales non-linearly; hardware limit; single point of failure
Best first step; most apps never need to go beyond this

Horizontal scaling (scale-out):

Add more machines; distribute load across them
Requires data sharding and load balancing
Pro: Theoretically unlimited; uses commodity hardware; fault tolerant
Con: Complex — requires sharding, replication, distributed state management, network coordination
Use when vertical scaling is not cost-effective or hits hardware limits

Principle: Always try vertical scaling first. Horizontal scaling introduces distributed systems complexity that compounds over time.

What makes evolvability different from just “writing good code”?
?
Evolvability is specifically about making the system easy to change as requirements evolve — it goes beyond code quality to architectural design.

Why it’s important: Requirements change constantly. A system that is hard to change calcifies — teams avoid making necessary changes because the risk is too high, leading to technical debt accumulation and eventually “big bang rewrites.”

Evolvability requires:

Good abstractions: Changes should be localized; interfaces stable even as implementations change
Testability: Tests serve as a safety net for refactoring; catch regressions automatically
Decoupling: Components that evolve at different rates should be separate
Schema evolution: Data formats must support backward/forward compatibility (connects to Ch5 Encoding)
Agile practices: TDD, CI/CD, feature flags enable small safe changes rather than big risky ones

Connection to simplicity: Simple, well-abstracted systems are easier to change. Accidental complexity is the primary enemy of evolvability.

Numbers and Precision

What are the key uptime numbers for “nines” of availability?
?

SLO	Downtime per year	Downtime per month
99%	87.6 hours	7.3 hours
99.9% (“three nines”)	8.76 hours	43.8 minutes
99.99% (“four nines”)	52.6 minutes	4.38 minutes
99.999% (“five nines”)	5.26 minutes	26.3 seconds

Context: Five nines requires < 5.3 minutes downtime per year — this means deployments, maintenance, and incidents must all fit within that budget. Extremely expensive to achieve.

Practical target: Most web services target 99.9%–99.99%. Five nines is reserved for critical infrastructure (telecom, payment networks, aviation).

What is the MTTF of a hard disk and what is the practical implication at scale?
?
MTTF (Mean Time To Failure): 10–50 years per disk in lab conditions.

Practical implication at scale:

At 10,000 disks: expect approximately 1 disk failure per day (10,000 disks / 10,000-day MTTF ≈ 1/day)
Google operates millions of machines — hardware failure is a daily, routine event, not an exception
AWS operates billions of disks — at that scale, any 1-in-a-million event happens many times per day

Conclusion: Hardware failure is normal and expected at scale. Design for it in software. Do not rely on hardware redundancy alone. Assume disks will fail; replicate data; test automatic failover.

What is the Amazon latency impact number and what does it imply?
?
Amazon’s finding: A 100ms increase in response time = approximately 1% decrease in sales.

Implications:

Latency is a direct business metric, not just a technical concern
Tail latencies (p99, p99.9) affect real users and real revenue
p99 users are often your best customers (most data = most engaged = most valuable)
Optimizing from 200ms to 100ms response time has measurable ROI

Application: When arguing for latency improvements or p99 SLOs to business stakeholders, use this framing: latency is a conversion rate factor, not just a developer comfort issue.

Application and Failure Modes

You’re seeing good p50 latency but terrible p99. What do you investigate?
?
This pattern suggests tail latency from outlier requests, not general slowness.

Investigation areas:

Database queries: Are slow p99 requests hitting unindexed queries? Do users with more data get slower queries?
GC pauses: JVM or .NET garbage collection pauses spike p99. Check GC logs.
Lock contention: Rare but expensive locks (database table locks, mutex contention) affect p99
External dependencies: One external API call with high tail latency? Check per-service p99.
Head-of-line blocking: Is a thread pool saturated? Are slow requests queuing behind fast ones?
Cold cache: First request for a new object misses cache; p99 may represent cache misses

Solutions:

Add caching for expensive repeated operations
Query optimization + proper indexing
Separate thread pools per operation type (bulkhead pattern)
Aggressive timeouts on external calls + fallbacks
Async processing for non-critical work

How would you design the social network timeline system to handle a celebrity with 100M followers posting a tweet?
?
Problem: Naive fan-out on write → 100M cache writes per tweet → system falls over.

Solution: Hybrid fan-out:

Identify celebrities: Any user with followers > threshold (e.g., 1M) is in “celebrity mode”
Celebrity posts: Do NOT write to 100M individual timeline caches
- Store the tweet in a celebrities table / high-follower post index
Regular users’ posts: Fan-out on write as normal (push to follower timeline caches)
Timeline read: Merge two sources:
- Pre-computed cache (regular users’ posts, fast O(1) read)
- Live query against celebrities the user follows (small set: most users follow < 100 celebrities)
- Sort and merge by timestamp

Additional optimizations:

Pre-warm celebrity post caches (distributed to regional caches before read requests arrive)
Use lazy fan-out for moderately popular users (fan out only when they have active followers)
Rate-limit the fan-out worker to prevent spikes

What failure modes emerge when a system lacks operability?
?
Failure modes from poor operability:

Silent failures: System is degraded but no alert fires → hours of customer impact before detection
Runbook gaps: On-call engineer can’t find procedure for known failure mode → slow resolution
Dependency mysteries: Service fails because upstream dependency changed silently; no visibility
Configuration drift: Prod environment diverges from documented state; team doesn’t know actual config
Hero dependency: One engineer knows how the system works; when they’re unavailable, no one can debug
Manual capacity management: Traffic spike → team manually adds servers → 30 min to respond → customers see errors

What good operability prevents:

Monitoring catches p99 spike before customers complain
CI/CD pipeline with automated rollback reverts bad deployment in minutes
Runbook guides any on-call engineer through known failure modes
Infrastructure-as-code ensures prod matches documentation

When does a monolith become a scalability bottleneck, and what are the signs?
?
Signs a monolith is becoming a bottleneck:

Deployment coupling: One team’s change requires redeploying everything; slow release cycles
Resource contention: One high-CPU component starves others sharing the same process
Independent scaling need: One module needs 10x more compute; can’t scale it without scaling everything
Team coordination overhead: > 30–50 engineers → merge conflicts and coordination costs dominate
Technology mismatch: Different components would benefit from different DB/language choices

Important: Most of these are organizational/team problems, not technical ones. The technical scalability of a well-built monolith is often higher than teams assume.

Threshold for extraction: Extract a service when:

Team autonomy and independent deployment are the bottleneck (not raw compute)
The service boundary is clear and stable (not likely to need cross-service transactions)
Operational maturity exists to run distributed services (monitoring, service mesh, distributed tracing)

What is head-of-line blocking and why does it affect tail latency?
?
Head-of-line blocking: A slow request occupying a thread or connection prevents all requests queued behind it from being processed, even if those requests would be fast on their own.

Why it inflates tail latency:

A thread pool of 10 threads receives a slow request (5 seconds) and 100 fast requests (1ms each)
The fast requests queue up behind the slow one; their measured response time is 5001ms, not 1ms
p99 includes these artificially slow fast requests

Mitigations:

Shorter timeouts on requests (fail fast, don’t hold threads)
Separate thread pools per operation type (bulkhead pattern): slow operations can’t block fast ones
Async/non-blocking I/O: threads don’t block waiting for I/O
Request hedging: send duplicate request to a second server after a timeout, use whichever responds first

What is the error budget concept in SRE and how does it change team behavior?
?
Error budget: The amount of unreliability (downtime, errors) that a system’s SLO allows within a time window.

Formula: error budget = 1 - SLO
Example: 99.9% SLO → 0.1% error budget → ~43 minutes downtime per month

How it changes behavior:

When budget is healthy: Team can take risks (deploy risky changes, experiment in production)
When budget is nearly exhausted: Team must slow down risky deployments and focus on reliability
When budget is depleted: Feature deployments freeze until reliability is restored

Why it works: Creates a shared contract between development (wants to move fast) and operations (wants stability). Neither side can always “win” — the budget forces negotiation and trade-offs based on data, not politics.

What is chaos engineering and how does it differ from ordinary testing?
?
Chaos engineering: Intentionally injecting failures into a running system (often production) to verify that fault tolerance mechanisms actually work.

Difference from ordinary testing:

Unit/integration tests: Verify code correctness in controlled, isolated conditions
Chaos engineering: Verifies system resilience under realistic, production conditions — including unexpected combinations of failures

Why production matters: Staging environments don’t replicate all dependencies, traffic patterns, and state. Only production exposes real failure modes.

Examples:

Netflix Chaos Monkey: randomly terminates EC2 instances in production
Chaos Mesh (Kubernetes): injects network partitions, pod failures, CPU pressure
Gremlin: commercial platform for controlled chaos experiments

What it finds: Missing retries, inadequate timeouts, single points of failure that redundancy didn’t cover, cascading failures that tests didn’t anticipate.

What is the relationship between simplicity and evolvability?
?
Core relationship: Simple systems are easier to change. Accidental complexity is the primary enemy of evolvability.

Chain of reasoning:

Complex systems have tight coupling and tangled dependencies
Tight coupling means changing component A requires changing B, C, D
Each required change adds risk of introducing bugs
Risk discourages change → team avoids necessary modifications → technical debt grows
Eventually: “big bang rewrite” rather than incremental evolution

Simplicity enables evolvability through:

Good abstractions: Changes localized to implementation; interface remains stable
Decoupling: Components that evolve at different rates can be changed independently
Testability: Tests catch regressions, making change safe
Clear naming: Engineers understand what code does → can modify it confidently

Practical implication: Every time you reduce accidental complexity, you are also improving the system’s ability to change in the future.

What distinguishes response time from latency from service time?
?
Three related but distinct concepts:

Response time (client perspective):

Total time from client sending request to receiving response
Includes: network round-trip + queuing time + service time
What users and SLOs should measure

Latency (loose usage):

Often used interchangeably with response time in common practice
Strict definition: time a request is “latent” (waiting to be handled), excluding service time
In practice: use “response time” for precision, “latency” in casual conversation

Service time (server perspective):

Time the server spends actually processing the request
Excludes network time and queuing time
Lower bound for response time

Why the distinction matters: A service with 1ms service time can still show 500ms response time due to network, queuing at overloaded upstream, or head-of-line blocking. Optimizing service time without addressing queuing is often wasted effort.

What is the circuit breaker pattern and when does it apply?
?
Circuit breaker: A fault tolerance pattern that stops sending requests to a failing service rather than letting each request time out individually.

States:

Closed (normal): requests flow through; failures counted
Open (tripped): requests immediately fail without attempting the call; entered when failure count exceeds threshold
Half-open (recovery probe): a small number of requests allowed through to test if service has recovered

Why it matters:

Without it: every call to a failing service waits for the timeout (e.g., 30s); threads pile up; caller also starts failing
With it: calls fail immediately once the circuit is open; caller can fall back gracefully; failing service gets a chance to recover without being bombarded

Implementation: Hystrix (legacy), Resilience4j (Java), Polly (.NET), Envoy/Istio service mesh (infrastructure level)

Connection to reliability: Circuit breakers are a key mechanism for preventing cascading failures — one of the hardest reliability problems in distributed systems.

What is a rolling percentile and why is it preferred over batch statistics for SLO monitoring?
?
Batch statistics: Calculate p99 over all requests in a fixed historical window (e.g., all requests yesterday)

Problem: Yesterday’s data is stale; doesn’t reflect current system state; slow feedback loop

Rolling percentile (sliding window): Calculate p99 over the most recent N minutes or M requests

Example: “p99 latency over the last 5-minute rolling window”
Reflects current state; detects degradation within minutes
Actionable: on-call engineer sees the metric worsening in real time

How it’s implemented:

HDRHistogram (High Dynamic Range Histogram): efficient rolling percentile data structure
Available in Prometheus, Datadog, New Relic, Honeycomb by default

Practical SLO monitoring: Define SLO as a rolling percentile with a burn rate alert:

“Alert if p99 > 500ms for more than 5 minutes in a 1-hour window”
Burn rate: if you’re consuming error budget 14x faster than normal, alert immediately

Total Cards: 25
Review Time: ~30 minutes
Priority: HIGH
Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch02-flashcards

Chapter 2 Flashcards — Defining Nonfunctional Requirements

Definitions

Trade-offs and Comparisons

Numbers and Precision

Application and Failure Modes

Graph View

Table of Contents