Chapter 1 Flashcards - Reliability, Scalability, Maintainability

flashcards chapter1 ddia

Basic Concepts

What is the difference between a fault and a failure?
?

Fault: A component deviating from its specification (e.g., hard disk crash, network packet lost)
Failure: The system as a whole stops providing service to users
A fault-tolerant system prevents faults from causing failures

What are the three fundamental concerns in designing data-intensive applications?
?

Reliability - System continues working correctly even when faults occur
Scalability - System’s ability to cope with increased load
Maintainability - Making life easier for engineering and operations teams

What are the three types of faults?
?

Hardware faults - Disk crashes, RAM errors, power outages
Software errors - Systematic bugs, cascading failures, runaway processes
Human errors - Configuration mistakes, operational errors (most common cause of outages)

Why are software faults harder to anticipate than hardware faults?
?

Software errors are systematic and often correlated across nodes
Hardware faults are typically random and independent
Software bugs can cause cascading failures affecting multiple components
Harder to predict edge cases and interaction effects

What is the Mean Time To Failure (MTTF) for hard disks and what does this mean at scale?
?

MTTF: 10-50 years per disk
At scale: With 10,000 disks, expect about 1 disk to fail per day
Key insight: Redundancy and fault tolerance are essential at scale

Reliability

What are the three strategies for handling human errors (the leading cause of outages)?
?

Design systems that minimize opportunities for error (good abstractions, clear APIs)
Decouple places where mistakes are made from places where they cause failures (sandbox environments, thorough testing)
Allow quick recovery (fast rollback, gradual rollout, detailed monitoring)

What is the difference between hardware redundancy and software fault-tolerance?
?

Hardware redundancy (traditional): RAID, dual power supplies, hot-swappable CPUs
Software fault-tolerance (modern/cloud): Expect commodity hardware failures, handle them in software
Cloud approach is more flexible and cost-effective, can handle datacenter-level failures

What is chaos engineering and why is it important?
?

Definition: Intentionally injecting failures into production systems to test resilience
Example: Netflix Chaos Monkey randomly kills production servers
Purpose: Ensure system handles failures correctly before they happen naturally
Result: Increased confidence in system reliability

Scalability

Why are percentiles better than averages for measuring response time?
?

Averages hide outliers: 99 requests at 10ms + 1 request at 1000ms = average ~20ms (misleading)
Percentiles capture distribution:
- p50 (median): 50% of requests are faster
- p95: 95% of requests are faster
- p99: 99% of requests are faster (captures tail latencies)
Tail latencies matter: Often from users with most data (your best customers)

What is a Service Level Objective (SLO) and how does it differ from an SLA?
?

SLO (Service Level Objective): Internal target for system performance
- Example: p50 < 200ms, p99 < 1s, uptime 99.9%
SLA (Service Level Agreement): Contractual commitment with consequences
- Example: SLO + penalties/refunds if not met
SLO is what you aim for, SLA is what you promise customers

What are load parameters and give examples?
?

Definition: Metrics that describe the current load on a system
Examples:
- Requests per second to web server
- Ratio of reads to writes in database
- Number of simultaneously active users
- Cache hit rate
- Number of concurrent connections
Different for each application; Twitter’s key parameter is fan-out

Explain the Twitter fan-out problem and its solution.
?
Problem: When user posts tweet, how to show on all followers’ timelines efficiently?

Approach 1 (Pull): Write to global collection, join on read

Pro: Simple writes
Con: Expensive reads (join for every timeline view)

Approach 2 (Push): Pre-compute each user’s timeline

Pro: Fast reads
Con: Write amplification (celebrity with 30M followers = 30M writes per tweet)

Solution (Hybrid):

Regular users: Push to timeline caches
Celebrities: Pull at read time
Optimize for common case, handle edge cases differently

What is write amplification and when does it occur?
?

Definition: When one logical write results in multiple physical writes
Example: Twitter celebrity with 30M followers posting one tweet = 30M cache writes
Occurs when: Pre-computing derived data, replication, maintaining indexes
Trade-off: Faster reads at cost of more expensive writes

What are the two main approaches to scaling?
?

Vertical scaling (scale-up): Add more resources to single machine (CPU, RAM, disk). Pro: Simple, no code changes. Con: Limited by hardware, expensive, single point of failure.
Horizontal scaling (scale-out): Add more machines, distribute load. Pro: Unlimited scaling, cost-effective. Con: More complex, need load balancing and distribution logic.

What real-world example demonstrates the business impact of latency?
?

Amazon: 100ms increase in response time = 1% decrease in sales
Shows direct correlation between performance and revenue
Justifies investment in latency optimization
Tail latencies (p99) often from best customers with most data

What is head-of-line blocking?
?

Definition: When slow requests hold up subsequent requests in a queue
Problem: One slow request can delay all requests behind it
Solution: Async/non-blocking I/O, separate queues, timeouts
Important consideration when analyzing tail latencies

Maintainability

What are the three design principles of maintainability?
?

Operability: Make it easy to keep system running smoothly. Good monitoring, automation support, documentation.
Simplicity: Make it easy to understand. Manage complexity via abstraction. Avoid: tight coupling, inconsistent naming, special cases.
Evolvability (Extensibility): Make it easy to change. Adapt to new requirements. Agile practices, TDD, refactoring.

What is the relationship between simplicity and abstraction?
?

Abstraction hides complexity behind clean interfaces
Good abstraction: Hides implementation details, provides clear API
Not the same as UI simplicity - can have simple interface with complex implementation
Goal: Make system easier to understand and modify
Example: SQL abstracts storage engine complexity

What are symptoms of unnecessary complexity in a system?
?

Explosion of state space
Tight coupling between modules
Tangled dependencies
Inconsistent naming and terminology
Special-case code to work around issues
Hacks for performance optimization
These indicate need for refactoring/simplification

Why does the book emphasize that most software cost is in maintenance, not initial development?
?

Reality: Most time spent on:
- Bug fixes and debugging
- Adding new features
- Operational work (deployments, migrations)
- Adapting to new platforms
- Understanding existing code
Implication: Design for maintainability from day one
Impact: Good maintainability reduces long-term costs significantly

What is operability and what does it include?
?
Definition: Making it easy for operations teams to keep system running

Includes:

Good monitoring and visibility into system health
Support for automation and tool integration
Avoiding dependency on individual machines
Good documentation and operational procedures
Predictable behavior, avoiding surprises
Self-healing capabilities with manual override options
Good default behavior with configuration options

Modern Context (2026)

What is Site Reliability Engineering (SRE) and how does it relate to reliability?
?

Definition: Discipline that applies software engineering to operations
Key concepts:
- Error budgets (acceptable downtime)
- SLOs/SLIs (measurable reliability targets)
- Toil reduction (automate repetitive work)
Goal: Balance reliability with velocity
Origin: Pioneered by Google, now industry standard

How has scalability changed with serverless computing?
?

Traditional: Manual provisioning, capacity planning, managing servers
Serverless (2026):
- Automatic scaling from 0 to millions of requests
- Pay only for actual usage
- No server management
Examples: AWS Lambda, Cloud Functions, Cloud Run
Trade-off: Less control, potential cold starts, vendor lock-in

What is Platform Engineering and how does it improve maintainability?
?

Definition: Building internal developer platforms (IDPs)
Purpose: Make it easy for developers to deploy and operate services
Includes:
- Self-service infrastructure provisioning
- Standardized deployment pipelines
- Integrated observability
- Developer portals and documentation
Result: Improved developer experience (DevEx), faster delivery

What are DORA metrics and why are they important?
?
DORA = DevOps Research and Assessment
Four key metrics:

Deployment frequency - How often you deploy
Lead time for changes - Time from commit to production
Mean time to recovery (MTTR) - How fast you recover from failures
Change failure rate - % of deployments causing failures
Importance: Measure software delivery performance and maintainability

Interview Scenarios

You’re seeing high latency at p99 but median latency is good. What should you investigate?
?
Investigation areas:

Database queries: Check for missing indexes, slow queries on large datasets
User patterns: Are p99 requests from users with more data?
Resource contention: CPU/memory saturation, lock contention
External dependencies: Third-party API slowness
Garbage collection: Long GC pauses in JVM/.NET
Solutions:

Add caching for expensive operations
Query optimization and proper indexing
Separate thread pools for different operations (bulkheads)
Async processing for non-critical work
Connection pooling to reduce overhead

Design a system that needs to scale from 100 to 1 million users. What’s your progression?
?
Phase 1: Single Server (100-1K users)

Everything on one machine
Simple deployment
Phase 2: Separate Tiers (1K-10K users)
Separate web and database servers
Enables independent scaling
Phase 3: Add Cache & CDN (10K-100K users)
Redis/Memcached for data
CDN for static assets
Reduces database load
Phase 4: Horizontal Scaling (100K-1M users)
Multiple web servers + load balancer
Database read replicas
Consider sharding if needed
Message queues for async work
Key principle: Start simple, add complexity as needed

How do you prevent cascading failures in microservices?
?
Strategies:

Circuit breaker: Stop sending requests to failing service
Bulkhead pattern: Isolate resources (thread pools, connections)
Timeout & retry: Fail fast, exponential backoff with jitter
Rate limiting: Protect downstream services from overload
Graceful degradation: Disable non-critical features
Load shedding: Reject excess requests (return 503)
Chaos engineering: Test failure scenarios regularly
Tools: Hystrix, Resilience4j, Envoy

When should you optimize for p99 latency vs p50 (median)?
?
Optimize for p50 when:

High-volume, low-value requests
Best-effort services
Limited budget
Optimize for p99 when:
Users with most data = most valuable
User-facing synchronous operations
SLA specifies tail latencies
Competitive advantage from performance
Optimize for p99.9 when:
Critical path operations
Large fan-out (one slow call blocks many)
High-value B2B customers
Trade-off: p99 optimization is ~10x harder than p50

Your database crashed. How do you ensure zero data loss?
?
Prevention:

Synchronous replication: Data written to multiple nodes before ack
Write-Ahead Log (WAL): Changes logged before applying
Regular backups: Point-in-time recovery (PITR)
Recovery:
Automatic failover: Detect failure, promote replica
Validate data: Check integrity, verify replication lag was zero
Update routing: Change DNS/load balancer
Key metrics:

RPO (Recovery Point Objective): How much data loss acceptable?
RTO (Recovery Time Objective): How fast must we recover?
Trade-off: Sync replication = no data loss but slower writes

Quick Facts

What did Twitter’s load parameters look like in 2012?
?

12,000 requests/sec for posting tweets
300,000 requests/sec for timeline reads
25x more reads than writes
Fan-out is the key scalability challenge, not absolute request volume

What percentage of Google’s machines fail per year?
?

1-5% of machines fail per year
At scale, hardware failures are normal, not exceptional
Must design systems to handle failures gracefully

What is 99.9% uptime in hours of downtime per year?
?

99.9% uptime = 8.7 hours downtime per year
99.99% = 52.6 minutes/year
99.999% = 5.26 minutes/year

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-08

Study Notes by Niladri & AI

Explorer

ch01-flashcards