Chapter 1 Flashcards - Reliability, Scalability, Maintainability
Basic Concepts
What is the difference between a fault and a failure?
?
- Fault: A component deviating from its specification (e.g., hard disk crash, network packet lost)
- Failure: The system as a whole stops providing service to users
- A fault-tolerant system prevents faults from causing failures
What are the three fundamental concerns in designing data-intensive applications?
?
- Reliability - System continues working correctly even when faults occur
- Scalability - System’s ability to cope with increased load
- Maintainability - Making life easier for engineering and operations teams
What are the three types of faults?
?
- Hardware faults - Disk crashes, RAM errors, power outages
- Software errors - Systematic bugs, cascading failures, runaway processes
- Human errors - Configuration mistakes, operational errors (most common cause of outages)
Why are software faults harder to anticipate than hardware faults?
?
- Software errors are systematic and often correlated across nodes
- Hardware faults are typically random and independent
- Software bugs can cause cascading failures affecting multiple components
- Harder to predict edge cases and interaction effects
What is the Mean Time To Failure (MTTF) for hard disks and what does this mean at scale?
?
- MTTF: 10-50 years per disk
- At scale: With 10,000 disks, expect about 1 disk to fail per day
- Key insight: Redundancy and fault tolerance are essential at scale
Reliability
What are the three strategies for handling human errors (the leading cause of outages)?
?
- Design systems that minimize opportunities for error (good abstractions, clear APIs)
- Decouple places where mistakes are made from places where they cause failures (sandbox environments, thorough testing)
- Allow quick recovery (fast rollback, gradual rollout, detailed monitoring)
What is the difference between hardware redundancy and software fault-tolerance?
?
- Hardware redundancy (traditional): RAID, dual power supplies, hot-swappable CPUs
- Software fault-tolerance (modern/cloud): Expect commodity hardware failures, handle them in software
- Cloud approach is more flexible and cost-effective, can handle datacenter-level failures
What is chaos engineering and why is it important?
?
- Definition: Intentionally injecting failures into production systems to test resilience
- Example: Netflix Chaos Monkey randomly kills production servers
- Purpose: Ensure system handles failures correctly before they happen naturally
- Result: Increased confidence in system reliability
Scalability
Why are percentiles better than averages for measuring response time?
?
- Averages hide outliers: 99 requests at 10ms + 1 request at 1000ms = average ~20ms (misleading)
- Percentiles capture distribution:
- p50 (median): 50% of requests are faster
- p95: 95% of requests are faster
- p99: 99% of requests are faster (captures tail latencies)
- Tail latencies matter: Often from users with most data (your best customers)
What is a Service Level Objective (SLO) and how does it differ from an SLA?
?
- SLO (Service Level Objective): Internal target for system performance
- Example: p50 < 200ms, p99 < 1s, uptime 99.9%
- SLA (Service Level Agreement): Contractual commitment with consequences
- Example: SLO + penalties/refunds if not met
- SLO is what you aim for, SLA is what you promise customers
What are load parameters and give examples?
?
- Definition: Metrics that describe the current load on a system
- Examples:
- Requests per second to web server
- Ratio of reads to writes in database
- Number of simultaneously active users
- Cache hit rate
- Number of concurrent connections
- Different for each application; Twitter’s key parameter is fan-out
Explain the Twitter fan-out problem and its solution.
?
Problem: When user posts tweet, how to show on all followers’ timelines efficiently?
Approach 1 (Pull): Write to global collection, join on read
- Pro: Simple writes
- Con: Expensive reads (join for every timeline view)
Approach 2 (Push): Pre-compute each user’s timeline
- Pro: Fast reads
- Con: Write amplification (celebrity with 30M followers = 30M writes per tweet)
Solution (Hybrid):
- Regular users: Push to timeline caches
- Celebrities: Pull at read time
- Optimize for common case, handle edge cases differently
What is write amplification and when does it occur?
?
- Definition: When one logical write results in multiple physical writes
- Example: Twitter celebrity with 30M followers posting one tweet = 30M cache writes
- Occurs when: Pre-computing derived data, replication, maintaining indexes
- Trade-off: Faster reads at cost of more expensive writes
What are the two main approaches to scaling?
?
- Vertical scaling (scale-up): Add more resources to single machine (CPU, RAM, disk). Pro: Simple, no code changes. Con: Limited by hardware, expensive, single point of failure.
- Horizontal scaling (scale-out): Add more machines, distribute load. Pro: Unlimited scaling, cost-effective. Con: More complex, need load balancing and distribution logic.
What real-world example demonstrates the business impact of latency?
?
- Amazon: 100ms increase in response time = 1% decrease in sales
- Shows direct correlation between performance and revenue
- Justifies investment in latency optimization
- Tail latencies (p99) often from best customers with most data
What is head-of-line blocking?
?
- Definition: When slow requests hold up subsequent requests in a queue
- Problem: One slow request can delay all requests behind it
- Solution: Async/non-blocking I/O, separate queues, timeouts
- Important consideration when analyzing tail latencies
Maintainability
What are the three design principles of maintainability?
?
- Operability: Make it easy to keep system running smoothly. Good monitoring, automation support, documentation.
- Simplicity: Make it easy to understand. Manage complexity via abstraction. Avoid: tight coupling, inconsistent naming, special cases.
- Evolvability (Extensibility): Make it easy to change. Adapt to new requirements. Agile practices, TDD, refactoring.
What is the relationship between simplicity and abstraction?
?
- Abstraction hides complexity behind clean interfaces
- Good abstraction: Hides implementation details, provides clear API
- Not the same as UI simplicity - can have simple interface with complex implementation
- Goal: Make system easier to understand and modify
- Example: SQL abstracts storage engine complexity
What are symptoms of unnecessary complexity in a system?
?
- Explosion of state space
- Tight coupling between modules
- Tangled dependencies
- Inconsistent naming and terminology
- Special-case code to work around issues
- Hacks for performance optimization
- These indicate need for refactoring/simplification
Why does the book emphasize that most software cost is in maintenance, not initial development?
?
- Reality: Most time spent on:
- Bug fixes and debugging
- Adding new features
- Operational work (deployments, migrations)
- Adapting to new platforms
- Understanding existing code
- Implication: Design for maintainability from day one
- Impact: Good maintainability reduces long-term costs significantly
What is operability and what does it include?
?
Definition: Making it easy for operations teams to keep system running
Includes:
- Good monitoring and visibility into system health
- Support for automation and tool integration
- Avoiding dependency on individual machines
- Good documentation and operational procedures
- Predictable behavior, avoiding surprises
- Self-healing capabilities with manual override options
- Good default behavior with configuration options
Modern Context (2026)
What is Site Reliability Engineering (SRE) and how does it relate to reliability?
?
- Definition: Discipline that applies software engineering to operations
- Key concepts:
- Error budgets (acceptable downtime)
- SLOs/SLIs (measurable reliability targets)
- Toil reduction (automate repetitive work)
- Goal: Balance reliability with velocity
- Origin: Pioneered by Google, now industry standard
How has scalability changed with serverless computing?
?
- Traditional: Manual provisioning, capacity planning, managing servers
- Serverless (2026):
- Automatic scaling from 0 to millions of requests
- Pay only for actual usage
- No server management
- Examples: AWS Lambda, Cloud Functions, Cloud Run
- Trade-off: Less control, potential cold starts, vendor lock-in
What is Platform Engineering and how does it improve maintainability?
?
- Definition: Building internal developer platforms (IDPs)
- Purpose: Make it easy for developers to deploy and operate services
- Includes:
- Self-service infrastructure provisioning
- Standardized deployment pipelines
- Integrated observability
- Developer portals and documentation
- Result: Improved developer experience (DevEx), faster delivery
What are DORA metrics and why are they important?
?
DORA = DevOps Research and Assessment
Four key metrics:
- Deployment frequency - How often you deploy
- Lead time for changes - Time from commit to production
- Mean time to recovery (MTTR) - How fast you recover from failures
- Change failure rate - % of deployments causing failures
Importance: Measure software delivery performance and maintainability
Interview Scenarios
You’re seeing high latency at p99 but median latency is good. What should you investigate?
?
Investigation areas:
- Database queries: Check for missing indexes, slow queries on large datasets
- User patterns: Are p99 requests from users with more data?
- Resource contention: CPU/memory saturation, lock contention
- External dependencies: Third-party API slowness
- Garbage collection: Long GC pauses in JVM/.NET
Solutions:
- Add caching for expensive operations
- Query optimization and proper indexing
- Separate thread pools for different operations (bulkheads)
- Async processing for non-critical work
- Connection pooling to reduce overhead
Design a system that needs to scale from 100 to 1 million users. What’s your progression?
?
Phase 1: Single Server (100-1K users)
- Everything on one machine
- Simple deployment
Phase 2: Separate Tiers (1K-10K users) - Separate web and database servers
- Enables independent scaling
Phase 3: Add Cache & CDN (10K-100K users) - Redis/Memcached for data
- CDN for static assets
- Reduces database load
Phase 4: Horizontal Scaling (100K-1M users) - Multiple web servers + load balancer
- Database read replicas
- Consider sharding if needed
- Message queues for async work
Key principle: Start simple, add complexity as needed
How do you prevent cascading failures in microservices?
?
Strategies:
- Circuit breaker: Stop sending requests to failing service
- Bulkhead pattern: Isolate resources (thread pools, connections)
- Timeout & retry: Fail fast, exponential backoff with jitter
- Rate limiting: Protect downstream services from overload
- Graceful degradation: Disable non-critical features
- Load shedding: Reject excess requests (return 503)
- Chaos engineering: Test failure scenarios regularly
Tools: Hystrix, Resilience4j, Envoy
When should you optimize for p99 latency vs p50 (median)?
?
Optimize for p50 when:
- High-volume, low-value requests
- Best-effort services
- Limited budget
Optimize for p99 when: - Users with most data = most valuable
- User-facing synchronous operations
- SLA specifies tail latencies
- Competitive advantage from performance
Optimize for p99.9 when: - Critical path operations
- Large fan-out (one slow call blocks many)
- High-value B2B customers
Trade-off: p99 optimization is ~10x harder than p50
Your database crashed. How do you ensure zero data loss?
?
Prevention:
- Synchronous replication: Data written to multiple nodes before ack
- Write-Ahead Log (WAL): Changes logged before applying
- Regular backups: Point-in-time recovery (PITR)
Recovery: - Automatic failover: Detect failure, promote replica
- Validate data: Check integrity, verify replication lag was zero
- Update routing: Change DNS/load balancer
Key metrics:
- RPO (Recovery Point Objective): How much data loss acceptable?
- RTO (Recovery Time Objective): How fast must we recover?
Trade-off: Sync replication = no data loss but slower writes
Quick Facts
What did Twitter’s load parameters look like in 2012?
?
- 12,000 requests/sec for posting tweets
- 300,000 requests/sec for timeline reads
- 25x more reads than writes
- Fan-out is the key scalability challenge, not absolute request volume
What percentage of Google’s machines fail per year?
?
- 1-5% of machines fail per year
- At scale, hardware failures are normal, not exceptional
- Must design systems to handle failures gracefully
What is 99.9% uptime in hours of downtime per year?
?
- 99.9% uptime = 8.7 hours downtime per year
- 99.99% = 52.6 minutes/year
- 99.999% = 5.26 minutes/year
Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-08