Chapter 1 Cheat Sheet - Quick Reference
One-Line Summaries
| Concept | One-Liner |
|---|---|
| Reliability | System works correctly even when faults occur |
| Scalability | System copes with increased load effectively |
| Maintainability | Easy for teams to operate, understand, and evolve |
| Fault vs Failure | Component deviation vs system-wide breakdown |
| Percentiles | p50/p95/p99 capture performance better than averages |
| Fan-out | Writing to multiple locations (e.g., Twitter timeline problem) |
Quick Numbers to Remember
| Metric | Value | Context |
|---|---|---|
| Hard disk MTTF | 10-50 years | But at 10,000 disks, expect 1 failure/day |
| Twitter timeline reads | 300K req/sec | vs 12K req/sec for posting tweets |
| Amazon latency impact | 100ms = 1% sales | Every millisecond counts |
| Google machine failures | 1-5% per year | Expect failures at scale |
| SLA example | p50 < 200ms, p99 < 1s | Use percentiles in SLAs |
Three Types of Faults
Hardware Faults
├── Hard disk crashes
├── RAM errors
├── Power outages
└── Solution: Redundancy + Software fault-tolerance
Software Faults (Harder!)
├── Systematic bugs
├── Cascading failures
├── Runaway processes
└── Solution: Testing, isolation, monitoring, recovery
Human Errors (Most Common!)
├── Config mistakes
├── Operational errors
└── Solution: Good design, testing, quick recovery, monitoring
Scaling Decision Tree
Need to scale?
│
├─ Read-heavy?
│ ├─ Cache (Redis, CDN)
│ └─ Read replicas
│
├─ Write-heavy?
│ ├─ Sharding
│ └─ Write-through cache
│
├─ Low latency?
│ ├─ In-memory data
│ └─ Edge locations
│
└─ High volume?
├─ Horizontal scaling
└─ Load balancing
Twitter Fan-out Problem
Problem: Posting tweets to followers’ timelines
Approach 1 (Pull): Write once, join on read
├─ Pro: Simple writes
└─ Con: Expensive reads (join for every timeline view)
Approach 2 (Push): Pre-compute all timelines
├─ Pro: Fast reads
└─ Con: Write amplification (celebrity with 30M followers = 30M writes)
Solution (Hybrid):
├─ Regular users: Push (pre-compute)
└─ Celebrities: Pull (compute on read)
Percentiles Explained
100 requests: 99 take 10ms, 1 takes 1000ms
Average: ~20ms (misleading!)
p50 (median): 10ms (half are faster)
p95: 10ms (95% are faster)
p99: 1000ms (99% are faster) ← Captures outliers!
Why p99 matters: Slowest requests often from users with most data (best customers)
Maintainability Three Pillars
Operability
├─ Easy to keep running
├─ Good monitoring
├─ Automation support
└─ "What's the system doing right now?"
Simplicity
├─ Easy to understand
├─ Manage complexity via abstraction
└─ "Can a new engineer understand this?"
Evolvability
├─ Easy to change
├─ Adapt to new requirements
└─ "Can we add this feature safely?"
SLO/SLA Example
Service Level Objective (SLO):
├─ Median response time (p50) < 200ms
├─ 95th percentile (p95) < 500ms
├─ 99th percentile (p99) < 1 second
└─ Uptime: 99.9% (8.7 hours downtime/year)
Service Level Agreement (SLA):
└─ SLO + consequences (refunds, penalties)
Common Interview Patterns
Pattern 1: Identify Bottleneck
System slow? Ask:
1. What's the load parameter? (QPS, users, data size)
2. What's the bottleneck? (CPU, memory, disk, network)
3. What percentile? (p50 vs p99)
4. Solution: Scale the bottleneck
Pattern 2: Design for Reliability
Single point of failure? Add:
├─ Replication (data redundancy)
├─ Load balancing (traffic distribution)
├─ Health checks (failure detection)
└─ Automatic failover (recovery)
Pattern 3: Scale Gradually
Scale progression:
1. Single server (simple)
2. Separate DB tier (flexibility)
3. Add cache + CDN (performance)
4. Horizontal scaling (capacity)
5. Sharding + async (complexity)
Key Trade-offs
| Decision | Pro | Con | When to Use |
|---|---|---|---|
| Vertical scaling | Simple, no code changes | Limited, expensive | Small to medium scale |
| Horizontal scaling | Unlimited, cost-effective | Complex, latency | Large scale |
| Synchronous replication | No data loss | Slower writes | Critical data |
| Asynchronous replication | Fast writes | Possible data loss | Non-critical data |
| Caching | Fast reads | Stale data, complexity | Read-heavy workloads |
| Pre-computation | Fast queries | Storage cost, staleness | Known access patterns |
Red Flags in System Design
❌ “We’ll never have more than 1000 users” (Design for 10x growth)
❌ “Average response time is good” (Use percentiles)
❌ “We need 100% uptime” (Be realistic)
❌ “Let’s optimize everything” (Measure first)
❌ “This worked at my last company” (Context matters)
Green Flags in System Design
✅ “Let’s start simple and iterate”
✅ “What’s our SLO for this service?”
✅ “We should measure p99 latency”
✅ “Let’s add monitoring from day one”
✅ “What happens when this component fails?”
Modern Additions (2026)
Reliability:
├─ Chaos engineering (Netflix Chaos Monkey)
├─ SRE practices (error budgets)
└─ Observability platforms (Datadog, Honeycomb)
Scalability:
├─ Auto-scaling (Kubernetes HPA)
├─ Serverless (AWS Lambda)
└─ Edge computing (Cloudflare Workers)
Maintainability:
├─ Platform engineering
├─ Infrastructure as Code (Terraform)
└─ AI code assistants (GitHub Copilot)
Interview Response Templates
When Asked About Scaling
“First, let’s understand the scale requirements. How many users? What’s the read/write ratio? Then we can discuss whether to scale vertically first or go horizontal. We should measure with percentiles, not averages, and optimize based on our SLO.”
When Asked About Reliability
“We need to eliminate single points of failure through replication and redundancy. I’d implement health checks and automatic failover. We should also consider what happens with hardware faults, software bugs, and human errors. Our SLA will define acceptable downtime.”
When Asked About Trade-offs
“There are always trade-offs. For example, synchronous replication provides strong consistency but increases write latency. Asynchronous replication is faster but risks data loss. The choice depends on our requirements—can we afford to lose data? What’s more important, consistency or availability?”
Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-08