Chapter 2 Cheat Sheet — Defining Nonfunctional Requirements

One-Line Summaries

Concept	One-Liner
Reliability	System works correctly even when faults occur
Fault vs Failure	Component deviation vs system-wide breakdown
Fault tolerance	Preventing faults from causing failures
Scalability	Ability to cope with increased load; always tied to a specific load parameter
Load parameter	The specific metric that characterizes demand (QPS, follower count, concurrency)
Maintainability	Easy for teams to operate, understand, and evolve over time
Operability	Making it easy for ops teams to keep the system running
Simplicity	Managing accidental complexity; good abstractions
Evolvability	Making it easy to change the system as requirements change
p99 latency	99% of requests complete faster than this value; captures tail latency
SLO	Internal performance target; precursor to SLA
Error budget	How much unreliability is allowed: (1 - SLO) per time window
Fan-out	Writing to multiple downstream locations on a single write (timeline problem)

Quick Numbers to Remember

Metric	Value	Context
Hard disk MTTF	10–50 years	Per disk; at 10K disks, expect 1 failure/day
Twitter timeline reads	~300K req/sec	Fan-out problem; key load parameter is follower count, not QPS
Amazon latency impact	100ms increase = 1% sales loss	Tail latency has direct business cost
99.9% uptime	8.7 hours downtime/year	”Three nines”
99.99% uptime	52.6 minutes/year	”Four nines”
99.999% uptime	5.26 minutes/year	”Five nines”
Tail latency amplification	p99^N for N parallel calls	10 parallel calls: 0.99^10 = 0.904 effective success

Fan-out Decision Framework (Home Timeline)

For each post, deliver to followers' timelines:

Fan-out on Read (Pull)
├─ Write: store post once → fast
├─ Read: execute join across follows + posts → slow at scale
├─ Best for: users with many followees, read-heavy access
└─ Problem: expensive join on every timeline view

Fan-out on Write (Push)
├─ Write: propagate to all followers' timeline caches → slow for celebrities
├─ Read: serve cached timeline → fast
├─ Best for: users with moderate follower counts
└─ Problem: 1M followers × 1 post = 1M cache writes (write amplification)

Hybrid (correct answer for Twitter-like systems):
├─ Normal users (< ~100K followers) → push to timeline caches
└─ Celebrities (> ~100K followers) → pull/merge at read time

Percentile Quick Reference

100 requests: 99 take 10ms, 1 takes 10,000ms

Arithmetic mean: ~110ms (MISLEADING)

p50 (median): 10ms       (half of requests are faster)
p95:          10ms       (95% are faster; outlier not captured yet)
p99:          10,000ms   (captures the outlier!)
p99.9:        10,000ms   (captures the worst tail)

Use percentiles, never averages, for latency SLOs.

Tail latency amplification formula:

P(all N parallel calls succeed within T) = P(single call < T)^N

Example: 5 parallel calls, each p99 = 100ms
P(all 5 < 100ms) = 0.99^5 ≈ 0.951

So your composite request's p99 is WORSE than 100ms
p99 of composite ≈ p95 behavior of individual calls

Three Memory Architectures

Shared-Memory (SMP)           Shared-Disk              Shared-Nothing
─────────────────────         ────────────────────     ────────────────────
Multiple CPUs,                Multiple nodes,           Multiple nodes,
one RAM + disk                shared central storage    own disk per node
                              (NAS, SAN, S3)
Simple programming            Cloud warehouse model     True horizontal scale
Expensive at scale            (Snowflake, Redshift)     (Cassandra, Kafka)
Single PoF                    Compute scales freely     Distributed complexity
Use: small to medium          Use: cloud analytics      Use: large-scale web
     PostgreSQL VM                 data warehouses           systems

Fault Types and Mitigations

Hardware Faults
├─ Characteristics: Random, independent events
├─ Example: Disk crash (MTTF 10–50 years; expect daily at 10K disks)
├─ Old approach: RAID, dual power supplies, hot-swap CPUs
└─ Modern approach: Software fault tolerance + commodity hardware

Software Faults (Harder!)
├─ Characteristics: Systematic, correlated across all nodes
├─ Example: Leap second bug (2012), runaway process, cascading failure
├─ Cannot predict; affect all replicas simultaneously
└─ Mitigation: Testing, isolation, monitoring, crash-only software

Human Errors (Most Common!)
├─ Characteristics: Config errors cause most internet outages
├─ Example: Wrong flag in deployment, misconfigured firewall
├─ Mitigation:
│  ├─ Design: Make wrong thing hard (good APIs, staging env)
│  ├─ Decouple: Feature flags, canary deployments, gradual rollout
│  └─ Recover: Fast rollback, PITR backup, detailed monitoring

SLI / SLO / SLA / Error Budget

SLI (Service Level Indicator)
└─ A measurable metric
   Example: "% of requests completing in < 200ms"

SLO (Service Level Objective)
└─ Internal target for an SLI
   Example: "SLI must be ≥ 99.5% over rolling 28-day window"

SLA (Service Level Agreement)
└─ Contractual commitment to an SLO with consequences
   Example: "If SLO is missed, customer gets 10% credit"

Error Budget
└─ 1 - SLO = how much unreliability you can "spend"
   Example: 99.5% SLO → 0.5% budget → ~3.6 hours/month
   "If budget is exhausted, freeze risky deployments"

Scalability Decision Tree

System needs to handle more load. What to do?
│
├─ Read-heavy (reads >> writes)?
│  ├─ Step 1: Add caching (Redis, CDN)
│  ├─ Step 2: Add read replicas
│  └─ Step 3: Shared-disk architecture (separate read compute)
│
├─ Write-heavy (writes >> reads)?
│  ├─ Step 1: Profile single-node limits first
│  ├─ Step 2: Vertical scale (bigger machine)
│  └─ Step 3: Sharding (shared-nothing, partition writes by key)
│
├─ Latency-sensitive (p99 matters)?
│  ├─ Reduce downstream fan-out (fewer parallel calls)
│  ├─ Timeout aggressively, use bulkheads
│  └─ Use in-memory cache for hot data
│
└─ Variable load (spiky)?
   ├─ Elastic auto-scaling (Kubernetes HPA, AWS Auto Scaling)
   └─ Serverless for stateless components

Maintainability Three Pillars

Operability
├─ Visibility: metrics, logs, traces, alerts
├─ Automation: CI/CD, config management, IaC
├─ Predictability: avoid surprising behavior
├─ Runbooks and documentation
└─ Question: "What is the system doing right now?"

Simplicity
├─ Manage accidental complexity (not essential complexity)
├─ Good abstractions hide complexity without leaking it
├─ Warning signs: tight coupling, tangled deps, inconsistent naming
└─ Question: "Can a new engineer understand this in a day?"

Evolvability
├─ Requirements change constantly — design for change
├─ TDD as safety net for refactoring
├─ Feature flags for gradual rollout
├─ Schema evolution with backward compatibility
└─ Question: "Can we add this new feature safely?"

Essential vs Accidental Complexity

Essential Complexity
├─ Inherent in the problem
├─ Cannot be eliminated
└─ Examples: distributed consensus IS hard;
            handling schema evolution IS complex

Accidental Complexity
├─ Introduced by poor design choices
├─ CAN be eliminated with better abstractions
└─ Examples: spaghetti code, circular dependencies,
            inconsistent APIs, magic numbers scattered everywhere

Abstraction hides essential complexity:
SQL → hides B-tree internals
TCP → hides packet routing
Good service API → hides internal data model

Reliability vs Availability vs Durability

Property	Definition	Measured By	Example Target
Reliability	Works correctly when faults occur	Fault recovery success rate	99% of single-disk failures recovered automatically
Availability	Service is accessible to users	Uptime % or error rate	99.9% of requests succeed
Durability	Data survives hardware failure	Data loss rate	0 data loss if ≤ 2 disks fail simultaneously

These are distinct but related. A system can be available (responding) but unreliable (returning wrong answers).

Common Interview Patterns for Ch2 Topics

Pattern 1: Define the SLO first

"Before discussing architecture, let's define the SLO:
- What's the expected QPS? Read/write ratio?
- What's the acceptable p99 latency?
- What's the acceptable error rate?
- What's the durability requirement?"

Pattern 2: Identify the load parameter

"The key load parameter here isn't just QPS —
it's [follower count distribution / data size per user / cache hit rate].
Let me explain how that shapes the architecture..."

Pattern 3: Scale incrementally

1. Single node (vertical scale first)
2. Read replicas (if read-heavy)
3. Caching layer (if hot data exists)
4. Sharding (when writes outgrow single node)
5. Async fan-out (if write amplification is a problem)

Red Flags / Green Flags

Red Flags:
❌ "We'll never have more than 1000 users" (design for 10x)
❌ "Average latency is fine" (use percentiles)
❌ "We need 100% uptime" (impossible; define your SLO)
❌ "Let's go microservices from day one" (premature distribution)
❌ "The database will never fail" (it will)

Green Flags:
✅ "What's our SLO? What p-ile should we optimize for?"
✅ "Let's define the load parameter before the architecture"
✅ "We should instrument this from day one"
✅ "What happens when the database is unavailable?"
✅ "Let's start with a monolith and extract services as needed"

Quick Revision Time

5-minute review: One-Line Summaries + Percentile Quick Reference
15-minute review: Add Fan-out Decision Framework + Three Memory Architectures + Fault Types
30-minute review: Full cheatsheet + trace through Scalability Decision Tree
Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch02-cheatsheet