Chapter 2 Cheat Sheet — Defining Nonfunctional Requirements
One-Line Summaries
| Concept | One-Liner |
|---|---|
| Reliability | System works correctly even when faults occur |
| Fault vs Failure | Component deviation vs system-wide breakdown |
| Fault tolerance | Preventing faults from causing failures |
| Scalability | Ability to cope with increased load; always tied to a specific load parameter |
| Load parameter | The specific metric that characterizes demand (QPS, follower count, concurrency) |
| Maintainability | Easy for teams to operate, understand, and evolve over time |
| Operability | Making it easy for ops teams to keep the system running |
| Simplicity | Managing accidental complexity; good abstractions |
| Evolvability | Making it easy to change the system as requirements change |
| p99 latency | 99% of requests complete faster than this value; captures tail latency |
| SLO | Internal performance target; precursor to SLA |
| Error budget | How much unreliability is allowed: (1 - SLO) per time window |
| Fan-out | Writing to multiple downstream locations on a single write (timeline problem) |
Quick Numbers to Remember
| Metric | Value | Context |
|---|---|---|
| Hard disk MTTF | 10–50 years | Per disk; at 10K disks, expect 1 failure/day |
| Twitter timeline reads | ~300K req/sec | Fan-out problem; key load parameter is follower count, not QPS |
| Amazon latency impact | 100ms increase = 1% sales loss | Tail latency has direct business cost |
| 99.9% uptime | 8.7 hours downtime/year | ”Three nines” |
| 99.99% uptime | 52.6 minutes/year | ”Four nines” |
| 99.999% uptime | 5.26 minutes/year | ”Five nines” |
| Tail latency amplification | p99^N for N parallel calls | 10 parallel calls: 0.99^10 = 0.904 effective success |
Fan-out Decision Framework (Home Timeline)
For each post, deliver to followers' timelines:
Fan-out on Read (Pull)
├─ Write: store post once → fast
├─ Read: execute join across follows + posts → slow at scale
├─ Best for: users with many followees, read-heavy access
└─ Problem: expensive join on every timeline view
Fan-out on Write (Push)
├─ Write: propagate to all followers' timeline caches → slow for celebrities
├─ Read: serve cached timeline → fast
├─ Best for: users with moderate follower counts
└─ Problem: 1M followers × 1 post = 1M cache writes (write amplification)
Hybrid (correct answer for Twitter-like systems):
├─ Normal users (< ~100K followers) → push to timeline caches
└─ Celebrities (> ~100K followers) → pull/merge at read time
Percentile Quick Reference
100 requests: 99 take 10ms, 1 takes 10,000ms
Arithmetic mean: ~110ms (MISLEADING)
p50 (median): 10ms (half of requests are faster)
p95: 10ms (95% are faster; outlier not captured yet)
p99: 10,000ms (captures the outlier!)
p99.9: 10,000ms (captures the worst tail)
Use percentiles, never averages, for latency SLOs.
Tail latency amplification formula:
P(all N parallel calls succeed within T) = P(single call < T)^N
Example: 5 parallel calls, each p99 = 100ms
P(all 5 < 100ms) = 0.99^5 ≈ 0.951
So your composite request's p99 is WORSE than 100ms
p99 of composite ≈ p95 behavior of individual calls
Three Memory Architectures
Shared-Memory (SMP) Shared-Disk Shared-Nothing
───────────────────── ──────────────────── ────────────────────
Multiple CPUs, Multiple nodes, Multiple nodes,
one RAM + disk shared central storage own disk per node
(NAS, SAN, S3)
Simple programming Cloud warehouse model True horizontal scale
Expensive at scale (Snowflake, Redshift) (Cassandra, Kafka)
Single PoF Compute scales freely Distributed complexity
Use: small to medium Use: cloud analytics Use: large-scale web
PostgreSQL VM data warehouses systems
Fault Types and Mitigations
Hardware Faults
├─ Characteristics: Random, independent events
├─ Example: Disk crash (MTTF 10–50 years; expect daily at 10K disks)
├─ Old approach: RAID, dual power supplies, hot-swap CPUs
└─ Modern approach: Software fault tolerance + commodity hardware
Software Faults (Harder!)
├─ Characteristics: Systematic, correlated across all nodes
├─ Example: Leap second bug (2012), runaway process, cascading failure
├─ Cannot predict; affect all replicas simultaneously
└─ Mitigation: Testing, isolation, monitoring, crash-only software
Human Errors (Most Common!)
├─ Characteristics: Config errors cause most internet outages
├─ Example: Wrong flag in deployment, misconfigured firewall
├─ Mitigation:
│ ├─ Design: Make wrong thing hard (good APIs, staging env)
│ ├─ Decouple: Feature flags, canary deployments, gradual rollout
│ └─ Recover: Fast rollback, PITR backup, detailed monitoring
SLI / SLO / SLA / Error Budget
SLI (Service Level Indicator)
└─ A measurable metric
Example: "% of requests completing in < 200ms"
SLO (Service Level Objective)
└─ Internal target for an SLI
Example: "SLI must be ≥ 99.5% over rolling 28-day window"
SLA (Service Level Agreement)
└─ Contractual commitment to an SLO with consequences
Example: "If SLO is missed, customer gets 10% credit"
Error Budget
└─ 1 - SLO = how much unreliability you can "spend"
Example: 99.5% SLO → 0.5% budget → ~3.6 hours/month
"If budget is exhausted, freeze risky deployments"
Scalability Decision Tree
System needs to handle more load. What to do?
│
├─ Read-heavy (reads >> writes)?
│ ├─ Step 1: Add caching (Redis, CDN)
│ ├─ Step 2: Add read replicas
│ └─ Step 3: Shared-disk architecture (separate read compute)
│
├─ Write-heavy (writes >> reads)?
│ ├─ Step 1: Profile single-node limits first
│ ├─ Step 2: Vertical scale (bigger machine)
│ └─ Step 3: Sharding (shared-nothing, partition writes by key)
│
├─ Latency-sensitive (p99 matters)?
│ ├─ Reduce downstream fan-out (fewer parallel calls)
│ ├─ Timeout aggressively, use bulkheads
│ └─ Use in-memory cache for hot data
│
└─ Variable load (spiky)?
├─ Elastic auto-scaling (Kubernetes HPA, AWS Auto Scaling)
└─ Serverless for stateless components
Maintainability Three Pillars
Operability
├─ Visibility: metrics, logs, traces, alerts
├─ Automation: CI/CD, config management, IaC
├─ Predictability: avoid surprising behavior
├─ Runbooks and documentation
└─ Question: "What is the system doing right now?"
Simplicity
├─ Manage accidental complexity (not essential complexity)
├─ Good abstractions hide complexity without leaking it
├─ Warning signs: tight coupling, tangled deps, inconsistent naming
└─ Question: "Can a new engineer understand this in a day?"
Evolvability
├─ Requirements change constantly — design for change
├─ TDD as safety net for refactoring
├─ Feature flags for gradual rollout
├─ Schema evolution with backward compatibility
└─ Question: "Can we add this new feature safely?"
Essential vs Accidental Complexity
Essential Complexity
├─ Inherent in the problem
├─ Cannot be eliminated
└─ Examples: distributed consensus IS hard;
handling schema evolution IS complex
Accidental Complexity
├─ Introduced by poor design choices
├─ CAN be eliminated with better abstractions
└─ Examples: spaghetti code, circular dependencies,
inconsistent APIs, magic numbers scattered everywhere
Abstraction hides essential complexity:
SQL → hides B-tree internals
TCP → hides packet routing
Good service API → hides internal data model
Reliability vs Availability vs Durability
| Property | Definition | Measured By | Example Target |
|---|---|---|---|
| Reliability | Works correctly when faults occur | Fault recovery success rate | 99% of single-disk failures recovered automatically |
| Availability | Service is accessible to users | Uptime % or error rate | 99.9% of requests succeed |
| Durability | Data survives hardware failure | Data loss rate | 0 data loss if ≤ 2 disks fail simultaneously |
These are distinct but related. A system can be available (responding) but unreliable (returning wrong answers).
Common Interview Patterns for Ch2 Topics
Pattern 1: Define the SLO first
"Before discussing architecture, let's define the SLO:
- What's the expected QPS? Read/write ratio?
- What's the acceptable p99 latency?
- What's the acceptable error rate?
- What's the durability requirement?"
Pattern 2: Identify the load parameter
"The key load parameter here isn't just QPS —
it's [follower count distribution / data size per user / cache hit rate].
Let me explain how that shapes the architecture..."
Pattern 3: Scale incrementally
1. Single node (vertical scale first)
2. Read replicas (if read-heavy)
3. Caching layer (if hot data exists)
4. Sharding (when writes outgrow single node)
5. Async fan-out (if write amplification is a problem)
Red Flags / Green Flags
Red Flags:
❌ "We'll never have more than 1000 users" (design for 10x)
❌ "Average latency is fine" (use percentiles)
❌ "We need 100% uptime" (impossible; define your SLO)
❌ "Let's go microservices from day one" (premature distribution)
❌ "The database will never fail" (it will)
Green Flags:
✅ "What's our SLO? What p-ile should we optimize for?"
✅ "Let's define the load parameter before the architecture"
✅ "We should instrument this from day one"
✅ "What happens when the database is unavailable?"
✅ "Let's start with a monolith and extract services as needed"
Quick Revision Time
5-minute review: One-Line Summaries + Percentile Quick Reference
15-minute review: Add Fan-out Decision Framework + Three Memory Architectures + Fault Types
30-minute review: Full cheatsheet + trace through Scalability Decision Tree
Last Updated: 2026-05-29