Chapter 1 Cheat Sheet - Quick Reference

One-Line Summaries

Concept	One-Liner
Reliability	System works correctly even when faults occur
Scalability	System copes with increased load effectively
Maintainability	Easy for teams to operate, understand, and evolve
Fault vs Failure	Component deviation vs system-wide breakdown
Percentiles	p50/p95/p99 capture performance better than averages
Fan-out	Writing to multiple locations (e.g., Twitter timeline problem)

Quick Numbers to Remember

Metric	Value	Context
Hard disk MTTF	10-50 years	But at 10,000 disks, expect 1 failure/day
Twitter timeline reads	300K req/sec	vs 12K req/sec for posting tweets
Amazon latency impact	100ms = 1% sales	Every millisecond counts
Google machine failures	1-5% per year	Expect failures at scale
SLA example	p50 < 200ms, p99 < 1s	Use percentiles in SLAs

Three Types of Faults

Hardware Faults
├── Hard disk crashes
├── RAM errors
├── Power outages
└── Solution: Redundancy + Software fault-tolerance

Software Faults (Harder!)
├── Systematic bugs
├── Cascading failures
├── Runaway processes
└── Solution: Testing, isolation, monitoring, recovery

Human Errors (Most Common!)
├── Config mistakes
├── Operational errors
└── Solution: Good design, testing, quick recovery, monitoring

Scaling Decision Tree

Need to scale?
│
├─ Read-heavy?
│  ├─ Cache (Redis, CDN)
│  └─ Read replicas
│
├─ Write-heavy?
│  ├─ Sharding
│  └─ Write-through cache
│
├─ Low latency?
│  ├─ In-memory data
│  └─ Edge locations
│
└─ High volume?
   ├─ Horizontal scaling
   └─ Load balancing

Twitter Fan-out Problem

Problem: Posting tweets to followers’ timelines

Approach 1 (Pull): Write once, join on read
├─ Pro: Simple writes
└─ Con: Expensive reads (join for every timeline view)

Approach 2 (Push): Pre-compute all timelines
├─ Pro: Fast reads
└─ Con: Write amplification (celebrity with 30M followers = 30M writes)

Solution (Hybrid):
├─ Regular users: Push (pre-compute)
└─ Celebrities: Pull (compute on read)

Percentiles Explained

100 requests: 99 take 10ms, 1 takes 1000ms

Average: ~20ms (misleading!)

p50 (median): 10ms (half are faster)
p95: 10ms (95% are faster)
p99: 1000ms (99% are faster) ← Captures outliers!

Why p99 matters: Slowest requests often from users with most data (best customers)

Maintainability Three Pillars

Operability
├─ Easy to keep running
├─ Good monitoring
├─ Automation support
└─ "What's the system doing right now?"

Simplicity
├─ Easy to understand
├─ Manage complexity via abstraction
└─ "Can a new engineer understand this?"

Evolvability
├─ Easy to change
├─ Adapt to new requirements
└─ "Can we add this feature safely?"

SLO/SLA Example

Service Level Objective (SLO):
├─ Median response time (p50) < 200ms
├─ 95th percentile (p95) < 500ms
├─ 99th percentile (p99) < 1 second
└─ Uptime: 99.9% (8.7 hours downtime/year)

Service Level Agreement (SLA):
└─ SLO + consequences (refunds, penalties)

Common Interview Patterns

Pattern 1: Identify Bottleneck

System slow? Ask:
1. What's the load parameter? (QPS, users, data size)
2. What's the bottleneck? (CPU, memory, disk, network)
3. What percentile? (p50 vs p99)
4. Solution: Scale the bottleneck

Pattern 2: Design for Reliability

Single point of failure? Add:
├─ Replication (data redundancy)
├─ Load balancing (traffic distribution)
├─ Health checks (failure detection)
└─ Automatic failover (recovery)

Pattern 3: Scale Gradually

Scale progression:
1. Single server (simple)
2. Separate DB tier (flexibility)
3. Add cache + CDN (performance)
4. Horizontal scaling (capacity)
5. Sharding + async (complexity)

Key Trade-offs

Decision	Pro	Con	When to Use
Vertical scaling	Simple, no code changes	Limited, expensive	Small to medium scale
Horizontal scaling	Unlimited, cost-effective	Complex, latency	Large scale
Synchronous replication	No data loss	Slower writes	Critical data
Asynchronous replication	Fast writes	Possible data loss	Non-critical data
Caching	Fast reads	Stale data, complexity	Read-heavy workloads
Pre-computation	Fast queries	Storage cost, staleness	Known access patterns

Red Flags in System Design

❌ “We’ll never have more than 1000 users” (Design for 10x growth)
❌ “Average response time is good” (Use percentiles)
❌ “We need 100% uptime” (Be realistic)
❌ “Let’s optimize everything” (Measure first)
❌ “This worked at my last company” (Context matters)

Green Flags in System Design

✅ “Let’s start simple and iterate”
✅ “What’s our SLO for this service?”
✅ “We should measure p99 latency”
✅ “Let’s add monitoring from day one”
✅ “What happens when this component fails?”

Modern Additions (2026)

Reliability:
├─ Chaos engineering (Netflix Chaos Monkey)
├─ SRE practices (error budgets)
└─ Observability platforms (Datadog, Honeycomb)

Scalability:
├─ Auto-scaling (Kubernetes HPA)
├─ Serverless (AWS Lambda)
└─ Edge computing (Cloudflare Workers)

Maintainability:
├─ Platform engineering
├─ Infrastructure as Code (Terraform)
└─ AI code assistants (GitHub Copilot)

Interview Response Templates

When Asked About Scaling

“First, let’s understand the scale requirements. How many users? What’s the read/write ratio? Then we can discuss whether to scale vertically first or go horizontal. We should measure with percentiles, not averages, and optimize based on our SLO.”

When Asked About Reliability

“We need to eliminate single points of failure through replication and redundancy. I’d implement health checks and automatic failover. We should also consider what happens with hardware faults, software bugs, and human errors. Our SLA will define acceptable downtime.”

When Asked About Trade-offs

“There are always trade-offs. For example, synchronous replication provides strong consistency but increases write latency. Asynchronous replication is faster but risks data loss. The choice depends on our requirements—can we afford to lose data? What’s more important, consistency or availability?”

Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-08

Study Notes by Niladri & AI

Explorer

ch01-cheatsheet