Chapter 1 Cheat Sheet - Quick Reference

One-Line Summaries

ConceptOne-Liner
ReliabilitySystem works correctly even when faults occur
ScalabilitySystem copes with increased load effectively
MaintainabilityEasy for teams to operate, understand, and evolve
Fault vs FailureComponent deviation vs system-wide breakdown
Percentilesp50/p95/p99 capture performance better than averages
Fan-outWriting to multiple locations (e.g., Twitter timeline problem)

Quick Numbers to Remember

MetricValueContext
Hard disk MTTF10-50 yearsBut at 10,000 disks, expect 1 failure/day
Twitter timeline reads300K req/secvs 12K req/sec for posting tweets
Amazon latency impact100ms = 1% salesEvery millisecond counts
Google machine failures1-5% per yearExpect failures at scale
SLA examplep50 < 200ms, p99 < 1sUse percentiles in SLAs

Three Types of Faults

Hardware Faults
├── Hard disk crashes
├── RAM errors
├── Power outages
└── Solution: Redundancy + Software fault-tolerance

Software Faults (Harder!)
├── Systematic bugs
├── Cascading failures
├── Runaway processes
└── Solution: Testing, isolation, monitoring, recovery

Human Errors (Most Common!)
├── Config mistakes
├── Operational errors
└── Solution: Good design, testing, quick recovery, monitoring

Scaling Decision Tree

Need to scale?
│
├─ Read-heavy?
│  ├─ Cache (Redis, CDN)
│  └─ Read replicas
│
├─ Write-heavy?
│  ├─ Sharding
│  └─ Write-through cache
│
├─ Low latency?
│  ├─ In-memory data
│  └─ Edge locations
│
└─ High volume?
   ├─ Horizontal scaling
   └─ Load balancing

Twitter Fan-out Problem

Problem: Posting tweets to followers’ timelines

Approach 1 (Pull): Write once, join on read
├─ Pro: Simple writes
└─ Con: Expensive reads (join for every timeline view)

Approach 2 (Push): Pre-compute all timelines
├─ Pro: Fast reads
└─ Con: Write amplification (celebrity with 30M followers = 30M writes)

Solution (Hybrid):
├─ Regular users: Push (pre-compute)
└─ Celebrities: Pull (compute on read)

Percentiles Explained

100 requests: 99 take 10ms, 1 takes 1000ms

Average: ~20ms (misleading!)

p50 (median): 10ms (half are faster)
p95: 10ms (95% are faster)
p99: 1000ms (99% are faster) ← Captures outliers!

Why p99 matters: Slowest requests often from users with most data (best customers)

Maintainability Three Pillars

Operability
├─ Easy to keep running
├─ Good monitoring
├─ Automation support
└─ "What's the system doing right now?"

Simplicity
├─ Easy to understand
├─ Manage complexity via abstraction
└─ "Can a new engineer understand this?"

Evolvability
├─ Easy to change
├─ Adapt to new requirements
└─ "Can we add this feature safely?"

SLO/SLA Example

Service Level Objective (SLO):
├─ Median response time (p50) < 200ms
├─ 95th percentile (p95) < 500ms
├─ 99th percentile (p99) < 1 second
└─ Uptime: 99.9% (8.7 hours downtime/year)

Service Level Agreement (SLA):
└─ SLO + consequences (refunds, penalties)

Common Interview Patterns

Pattern 1: Identify Bottleneck

System slow? Ask:
1. What's the load parameter? (QPS, users, data size)
2. What's the bottleneck? (CPU, memory, disk, network)
3. What percentile? (p50 vs p99)
4. Solution: Scale the bottleneck

Pattern 2: Design for Reliability

Single point of failure? Add:
├─ Replication (data redundancy)
├─ Load balancing (traffic distribution)
├─ Health checks (failure detection)
└─ Automatic failover (recovery)

Pattern 3: Scale Gradually

Scale progression:
1. Single server (simple)
2. Separate DB tier (flexibility)
3. Add cache + CDN (performance)
4. Horizontal scaling (capacity)
5. Sharding + async (complexity)

Key Trade-offs

DecisionProConWhen to Use
Vertical scalingSimple, no code changesLimited, expensiveSmall to medium scale
Horizontal scalingUnlimited, cost-effectiveComplex, latencyLarge scale
Synchronous replicationNo data lossSlower writesCritical data
Asynchronous replicationFast writesPossible data lossNon-critical data
CachingFast readsStale data, complexityRead-heavy workloads
Pre-computationFast queriesStorage cost, stalenessKnown access patterns

Red Flags in System Design

❌ “We’ll never have more than 1000 users” (Design for 10x growth)
❌ “Average response time is good” (Use percentiles)
❌ “We need 100% uptime” (Be realistic)
❌ “Let’s optimize everything” (Measure first)
❌ “This worked at my last company” (Context matters)

Green Flags in System Design

✅ “Let’s start simple and iterate”
✅ “What’s our SLO for this service?”
✅ “We should measure p99 latency”
✅ “Let’s add monitoring from day one”
✅ “What happens when this component fails?”

Modern Additions (2026)

Reliability:
├─ Chaos engineering (Netflix Chaos Monkey)
├─ SRE practices (error budgets)
└─ Observability platforms (Datadog, Honeycomb)

Scalability:
├─ Auto-scaling (Kubernetes HPA)
├─ Serverless (AWS Lambda)
└─ Edge computing (Cloudflare Workers)

Maintainability:
├─ Platform engineering
├─ Infrastructure as Code (Terraform)
└─ AI code assistants (GitHub Copilot)

Interview Response Templates

When Asked About Scaling

“First, let’s understand the scale requirements. How many users? What’s the read/write ratio? Then we can discuss whether to scale vertically first or go horizontal. We should measure with percentiles, not averages, and optimize based on our SLO.”

When Asked About Reliability

“We need to eliminate single points of failure through replication and redundancy. I’d implement health checks and automatic failover. We should also consider what happens with hardware faults, software bugs, and human errors. Our SLA will define acceptable downtime.”

When Asked About Trade-offs

“There are always trade-offs. For example, synchronous replication provides strong consistency but increases write latency. Asynchronous replication is faster but risks data loss. The choice depends on our requirements—can we afford to lose data? What’s more important, consistency or availability?”


Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-08