Chapter 2 Cheat Sheet — Defining Nonfunctional Requirements

One-Line Summaries

ConceptOne-Liner
ReliabilitySystem works correctly even when faults occur
Fault vs FailureComponent deviation vs system-wide breakdown
Fault tolerancePreventing faults from causing failures
ScalabilityAbility to cope with increased load; always tied to a specific load parameter
Load parameterThe specific metric that characterizes demand (QPS, follower count, concurrency)
MaintainabilityEasy for teams to operate, understand, and evolve over time
OperabilityMaking it easy for ops teams to keep the system running
SimplicityManaging accidental complexity; good abstractions
EvolvabilityMaking it easy to change the system as requirements change
p99 latency99% of requests complete faster than this value; captures tail latency
SLOInternal performance target; precursor to SLA
Error budgetHow much unreliability is allowed: (1 - SLO) per time window
Fan-outWriting to multiple downstream locations on a single write (timeline problem)

Quick Numbers to Remember

MetricValueContext
Hard disk MTTF10–50 yearsPer disk; at 10K disks, expect 1 failure/day
Twitter timeline reads~300K req/secFan-out problem; key load parameter is follower count, not QPS
Amazon latency impact100ms increase = 1% sales lossTail latency has direct business cost
99.9% uptime8.7 hours downtime/year”Three nines”
99.99% uptime52.6 minutes/year”Four nines”
99.999% uptime5.26 minutes/year”Five nines”
Tail latency amplificationp99^N for N parallel calls10 parallel calls: 0.99^10 = 0.904 effective success

Fan-out Decision Framework (Home Timeline)

For each post, deliver to followers' timelines:

Fan-out on Read (Pull)
├─ Write: store post once → fast
├─ Read: execute join across follows + posts → slow at scale
├─ Best for: users with many followees, read-heavy access
└─ Problem: expensive join on every timeline view

Fan-out on Write (Push)
├─ Write: propagate to all followers' timeline caches → slow for celebrities
├─ Read: serve cached timeline → fast
├─ Best for: users with moderate follower counts
└─ Problem: 1M followers × 1 post = 1M cache writes (write amplification)

Hybrid (correct answer for Twitter-like systems):
├─ Normal users (< ~100K followers) → push to timeline caches
└─ Celebrities (> ~100K followers) → pull/merge at read time

Percentile Quick Reference

100 requests: 99 take 10ms, 1 takes 10,000ms

Arithmetic mean: ~110ms (MISLEADING)

p50 (median): 10ms       (half of requests are faster)
p95:          10ms       (95% are faster; outlier not captured yet)
p99:          10,000ms   (captures the outlier!)
p99.9:        10,000ms   (captures the worst tail)

Use percentiles, never averages, for latency SLOs.

Tail latency amplification formula:

P(all N parallel calls succeed within T) = P(single call < T)^N

Example: 5 parallel calls, each p99 = 100ms
P(all 5 < 100ms) = 0.99^5 ≈ 0.951

So your composite request's p99 is WORSE than 100ms
p99 of composite ≈ p95 behavior of individual calls

Three Memory Architectures

Shared-Memory (SMP)           Shared-Disk              Shared-Nothing
─────────────────────         ────────────────────     ────────────────────
Multiple CPUs,                Multiple nodes,           Multiple nodes,
one RAM + disk                shared central storage    own disk per node
                              (NAS, SAN, S3)
Simple programming            Cloud warehouse model     True horizontal scale
Expensive at scale            (Snowflake, Redshift)     (Cassandra, Kafka)
Single PoF                    Compute scales freely     Distributed complexity
Use: small to medium          Use: cloud analytics      Use: large-scale web
     PostgreSQL VM                 data warehouses           systems

Fault Types and Mitigations

Hardware Faults
├─ Characteristics: Random, independent events
├─ Example: Disk crash (MTTF 10–50 years; expect daily at 10K disks)
├─ Old approach: RAID, dual power supplies, hot-swap CPUs
└─ Modern approach: Software fault tolerance + commodity hardware

Software Faults (Harder!)
├─ Characteristics: Systematic, correlated across all nodes
├─ Example: Leap second bug (2012), runaway process, cascading failure
├─ Cannot predict; affect all replicas simultaneously
└─ Mitigation: Testing, isolation, monitoring, crash-only software

Human Errors (Most Common!)
├─ Characteristics: Config errors cause most internet outages
├─ Example: Wrong flag in deployment, misconfigured firewall
├─ Mitigation:
│  ├─ Design: Make wrong thing hard (good APIs, staging env)
│  ├─ Decouple: Feature flags, canary deployments, gradual rollout
│  └─ Recover: Fast rollback, PITR backup, detailed monitoring

SLI / SLO / SLA / Error Budget

SLI (Service Level Indicator)
└─ A measurable metric
   Example: "% of requests completing in < 200ms"

SLO (Service Level Objective)
└─ Internal target for an SLI
   Example: "SLI must be ≥ 99.5% over rolling 28-day window"

SLA (Service Level Agreement)
└─ Contractual commitment to an SLO with consequences
   Example: "If SLO is missed, customer gets 10% credit"

Error Budget
└─ 1 - SLO = how much unreliability you can "spend"
   Example: 99.5% SLO → 0.5% budget → ~3.6 hours/month
   "If budget is exhausted, freeze risky deployments"

Scalability Decision Tree

System needs to handle more load. What to do?
│
├─ Read-heavy (reads >> writes)?
│  ├─ Step 1: Add caching (Redis, CDN)
│  ├─ Step 2: Add read replicas
│  └─ Step 3: Shared-disk architecture (separate read compute)
│
├─ Write-heavy (writes >> reads)?
│  ├─ Step 1: Profile single-node limits first
│  ├─ Step 2: Vertical scale (bigger machine)
│  └─ Step 3: Sharding (shared-nothing, partition writes by key)
│
├─ Latency-sensitive (p99 matters)?
│  ├─ Reduce downstream fan-out (fewer parallel calls)
│  ├─ Timeout aggressively, use bulkheads
│  └─ Use in-memory cache for hot data
│
└─ Variable load (spiky)?
   ├─ Elastic auto-scaling (Kubernetes HPA, AWS Auto Scaling)
   └─ Serverless for stateless components

Maintainability Three Pillars

Operability
├─ Visibility: metrics, logs, traces, alerts
├─ Automation: CI/CD, config management, IaC
├─ Predictability: avoid surprising behavior
├─ Runbooks and documentation
└─ Question: "What is the system doing right now?"

Simplicity
├─ Manage accidental complexity (not essential complexity)
├─ Good abstractions hide complexity without leaking it
├─ Warning signs: tight coupling, tangled deps, inconsistent naming
└─ Question: "Can a new engineer understand this in a day?"

Evolvability
├─ Requirements change constantly — design for change
├─ TDD as safety net for refactoring
├─ Feature flags for gradual rollout
├─ Schema evolution with backward compatibility
└─ Question: "Can we add this new feature safely?"

Essential vs Accidental Complexity

Essential Complexity
├─ Inherent in the problem
├─ Cannot be eliminated
└─ Examples: distributed consensus IS hard;
            handling schema evolution IS complex

Accidental Complexity
├─ Introduced by poor design choices
├─ CAN be eliminated with better abstractions
└─ Examples: spaghetti code, circular dependencies,
            inconsistent APIs, magic numbers scattered everywhere

Abstraction hides essential complexity:
SQL → hides B-tree internals
TCP → hides packet routing
Good service API → hides internal data model

Reliability vs Availability vs Durability

PropertyDefinitionMeasured ByExample Target
ReliabilityWorks correctly when faults occurFault recovery success rate99% of single-disk failures recovered automatically
AvailabilityService is accessible to usersUptime % or error rate99.9% of requests succeed
DurabilityData survives hardware failureData loss rate0 data loss if ≤ 2 disks fail simultaneously

These are distinct but related. A system can be available (responding) but unreliable (returning wrong answers).


Common Interview Patterns for Ch2 Topics

Pattern 1: Define the SLO first

"Before discussing architecture, let's define the SLO:
- What's the expected QPS? Read/write ratio?
- What's the acceptable p99 latency?
- What's the acceptable error rate?
- What's the durability requirement?"

Pattern 2: Identify the load parameter

"The key load parameter here isn't just QPS —
it's [follower count distribution / data size per user / cache hit rate].
Let me explain how that shapes the architecture..."

Pattern 3: Scale incrementally

1. Single node (vertical scale first)
2. Read replicas (if read-heavy)
3. Caching layer (if hot data exists)
4. Sharding (when writes outgrow single node)
5. Async fan-out (if write amplification is a problem)

Red Flags / Green Flags

Red Flags:
❌ "We'll never have more than 1000 users" (design for 10x)
❌ "Average latency is fine" (use percentiles)
❌ "We need 100% uptime" (impossible; define your SLO)
❌ "Let's go microservices from day one" (premature distribution)
❌ "The database will never fail" (it will)

Green Flags:
✅ "What's our SLO? What p-ile should we optimize for?"
✅ "Let's define the load parameter before the architecture"
✅ "We should instrument this from day one"
✅ "What happens when the database is unavailable?"
✅ "Let's start with a monolith and extract services as needed"

Quick Revision Time

5-minute review: One-Line Summaries + Percentile Quick Reference
15-minute review: Add Fan-out Decision Framework + Three Memory Architectures + Fault Types
30-minute review: Full cheatsheet + trace through Scalability Decision Tree
Last Updated: 2026-05-29