Chapter 1 Flashcards - Reliability, Scalability, Maintainability

flashcards chapter1 ddia

Basic Concepts

What is the difference between a fault and a failure?
?

  • Fault: A component deviating from its specification (e.g., hard disk crash, network packet lost)
  • Failure: The system as a whole stops providing service to users
  • A fault-tolerant system prevents faults from causing failures

What are the three fundamental concerns in designing data-intensive applications?
?

  1. Reliability - System continues working correctly even when faults occur
  2. Scalability - System’s ability to cope with increased load
  3. Maintainability - Making life easier for engineering and operations teams

What are the three types of faults?
?

  1. Hardware faults - Disk crashes, RAM errors, power outages
  2. Software errors - Systematic bugs, cascading failures, runaway processes
  3. Human errors - Configuration mistakes, operational errors (most common cause of outages)

Why are software faults harder to anticipate than hardware faults?
?

  • Software errors are systematic and often correlated across nodes
  • Hardware faults are typically random and independent
  • Software bugs can cause cascading failures affecting multiple components
  • Harder to predict edge cases and interaction effects

What is the Mean Time To Failure (MTTF) for hard disks and what does this mean at scale?
?

  • MTTF: 10-50 years per disk
  • At scale: With 10,000 disks, expect about 1 disk to fail per day
  • Key insight: Redundancy and fault tolerance are essential at scale

Reliability

What are the three strategies for handling human errors (the leading cause of outages)?
?

  1. Design systems that minimize opportunities for error (good abstractions, clear APIs)
  2. Decouple places where mistakes are made from places where they cause failures (sandbox environments, thorough testing)
  3. Allow quick recovery (fast rollback, gradual rollout, detailed monitoring)

What is the difference between hardware redundancy and software fault-tolerance?
?

  • Hardware redundancy (traditional): RAID, dual power supplies, hot-swappable CPUs
  • Software fault-tolerance (modern/cloud): Expect commodity hardware failures, handle them in software
  • Cloud approach is more flexible and cost-effective, can handle datacenter-level failures

What is chaos engineering and why is it important?
?

  • Definition: Intentionally injecting failures into production systems to test resilience
  • Example: Netflix Chaos Monkey randomly kills production servers
  • Purpose: Ensure system handles failures correctly before they happen naturally
  • Result: Increased confidence in system reliability

Scalability

Why are percentiles better than averages for measuring response time?
?

  • Averages hide outliers: 99 requests at 10ms + 1 request at 1000ms = average ~20ms (misleading)
  • Percentiles capture distribution:
    • p50 (median): 50% of requests are faster
    • p95: 95% of requests are faster
    • p99: 99% of requests are faster (captures tail latencies)
  • Tail latencies matter: Often from users with most data (your best customers)

What is a Service Level Objective (SLO) and how does it differ from an SLA?
?

  • SLO (Service Level Objective): Internal target for system performance
    • Example: p50 < 200ms, p99 < 1s, uptime 99.9%
  • SLA (Service Level Agreement): Contractual commitment with consequences
    • Example: SLO + penalties/refunds if not met
  • SLO is what you aim for, SLA is what you promise customers

What are load parameters and give examples?
?

  • Definition: Metrics that describe the current load on a system
  • Examples:
    • Requests per second to web server
    • Ratio of reads to writes in database
    • Number of simultaneously active users
    • Cache hit rate
    • Number of concurrent connections
  • Different for each application; Twitter’s key parameter is fan-out

Explain the Twitter fan-out problem and its solution.
?
Problem: When user posts tweet, how to show on all followers’ timelines efficiently?

Approach 1 (Pull): Write to global collection, join on read

  • Pro: Simple writes
  • Con: Expensive reads (join for every timeline view)

Approach 2 (Push): Pre-compute each user’s timeline

  • Pro: Fast reads
  • Con: Write amplification (celebrity with 30M followers = 30M writes per tweet)

Solution (Hybrid):

  • Regular users: Push to timeline caches
  • Celebrities: Pull at read time
  • Optimize for common case, handle edge cases differently

What is write amplification and when does it occur?
?

  • Definition: When one logical write results in multiple physical writes
  • Example: Twitter celebrity with 30M followers posting one tweet = 30M cache writes
  • Occurs when: Pre-computing derived data, replication, maintaining indexes
  • Trade-off: Faster reads at cost of more expensive writes

What are the two main approaches to scaling?
?

  1. Vertical scaling (scale-up): Add more resources to single machine (CPU, RAM, disk). Pro: Simple, no code changes. Con: Limited by hardware, expensive, single point of failure.
  2. Horizontal scaling (scale-out): Add more machines, distribute load. Pro: Unlimited scaling, cost-effective. Con: More complex, need load balancing and distribution logic.

What real-world example demonstrates the business impact of latency?
?

  • Amazon: 100ms increase in response time = 1% decrease in sales
  • Shows direct correlation between performance and revenue
  • Justifies investment in latency optimization
  • Tail latencies (p99) often from best customers with most data

What is head-of-line blocking?
?

  • Definition: When slow requests hold up subsequent requests in a queue
  • Problem: One slow request can delay all requests behind it
  • Solution: Async/non-blocking I/O, separate queues, timeouts
  • Important consideration when analyzing tail latencies

Maintainability

What are the three design principles of maintainability?
?

  1. Operability: Make it easy to keep system running smoothly. Good monitoring, automation support, documentation.
  2. Simplicity: Make it easy to understand. Manage complexity via abstraction. Avoid: tight coupling, inconsistent naming, special cases.
  3. Evolvability (Extensibility): Make it easy to change. Adapt to new requirements. Agile practices, TDD, refactoring.

What is the relationship between simplicity and abstraction?
?

  • Abstraction hides complexity behind clean interfaces
  • Good abstraction: Hides implementation details, provides clear API
  • Not the same as UI simplicity - can have simple interface with complex implementation
  • Goal: Make system easier to understand and modify
  • Example: SQL abstracts storage engine complexity

What are symptoms of unnecessary complexity in a system?
?

  • Explosion of state space
  • Tight coupling between modules
  • Tangled dependencies
  • Inconsistent naming and terminology
  • Special-case code to work around issues
  • Hacks for performance optimization
  • These indicate need for refactoring/simplification

Why does the book emphasize that most software cost is in maintenance, not initial development?
?

  • Reality: Most time spent on:
    • Bug fixes and debugging
    • Adding new features
    • Operational work (deployments, migrations)
    • Adapting to new platforms
    • Understanding existing code
  • Implication: Design for maintainability from day one
  • Impact: Good maintainability reduces long-term costs significantly

What is operability and what does it include?
?
Definition: Making it easy for operations teams to keep system running

Includes:

  • Good monitoring and visibility into system health
  • Support for automation and tool integration
  • Avoiding dependency on individual machines
  • Good documentation and operational procedures
  • Predictable behavior, avoiding surprises
  • Self-healing capabilities with manual override options
  • Good default behavior with configuration options

Modern Context (2026)

What is Site Reliability Engineering (SRE) and how does it relate to reliability?
?

  • Definition: Discipline that applies software engineering to operations
  • Key concepts:
    • Error budgets (acceptable downtime)
    • SLOs/SLIs (measurable reliability targets)
    • Toil reduction (automate repetitive work)
  • Goal: Balance reliability with velocity
  • Origin: Pioneered by Google, now industry standard

How has scalability changed with serverless computing?
?

  • Traditional: Manual provisioning, capacity planning, managing servers
  • Serverless (2026):
    • Automatic scaling from 0 to millions of requests
    • Pay only for actual usage
    • No server management
  • Examples: AWS Lambda, Cloud Functions, Cloud Run
  • Trade-off: Less control, potential cold starts, vendor lock-in

What is Platform Engineering and how does it improve maintainability?
?

  • Definition: Building internal developer platforms (IDPs)
  • Purpose: Make it easy for developers to deploy and operate services
  • Includes:
    • Self-service infrastructure provisioning
    • Standardized deployment pipelines
    • Integrated observability
    • Developer portals and documentation
  • Result: Improved developer experience (DevEx), faster delivery

What are DORA metrics and why are they important?
?
DORA = DevOps Research and Assessment
Four key metrics:

  1. Deployment frequency - How often you deploy
  2. Lead time for changes - Time from commit to production
  3. Mean time to recovery (MTTR) - How fast you recover from failures
  4. Change failure rate - % of deployments causing failures
    Importance: Measure software delivery performance and maintainability

Interview Scenarios

You’re seeing high latency at p99 but median latency is good. What should you investigate?
?
Investigation areas:

  1. Database queries: Check for missing indexes, slow queries on large datasets
  2. User patterns: Are p99 requests from users with more data?
  3. Resource contention: CPU/memory saturation, lock contention
  4. External dependencies: Third-party API slowness
  5. Garbage collection: Long GC pauses in JVM/.NET
    Solutions:
  • Add caching for expensive operations
  • Query optimization and proper indexing
  • Separate thread pools for different operations (bulkheads)
  • Async processing for non-critical work
  • Connection pooling to reduce overhead

Design a system that needs to scale from 100 to 1 million users. What’s your progression?
?
Phase 1: Single Server (100-1K users)

  • Everything on one machine
  • Simple deployment
    Phase 2: Separate Tiers (1K-10K users)
  • Separate web and database servers
  • Enables independent scaling
    Phase 3: Add Cache & CDN (10K-100K users)
  • Redis/Memcached for data
  • CDN for static assets
  • Reduces database load
    Phase 4: Horizontal Scaling (100K-1M users)
  • Multiple web servers + load balancer
  • Database read replicas
  • Consider sharding if needed
  • Message queues for async work
    Key principle: Start simple, add complexity as needed

How do you prevent cascading failures in microservices?
?
Strategies:

  1. Circuit breaker: Stop sending requests to failing service
  2. Bulkhead pattern: Isolate resources (thread pools, connections)
  3. Timeout & retry: Fail fast, exponential backoff with jitter
  4. Rate limiting: Protect downstream services from overload
  5. Graceful degradation: Disable non-critical features
  6. Load shedding: Reject excess requests (return 503)
  7. Chaos engineering: Test failure scenarios regularly
    Tools: Hystrix, Resilience4j, Envoy

When should you optimize for p99 latency vs p50 (median)?
?
Optimize for p50 when:

  • High-volume, low-value requests
  • Best-effort services
  • Limited budget
    Optimize for p99 when:
  • Users with most data = most valuable
  • User-facing synchronous operations
  • SLA specifies tail latencies
  • Competitive advantage from performance
    Optimize for p99.9 when:
  • Critical path operations
  • Large fan-out (one slow call blocks many)
  • High-value B2B customers
    Trade-off: p99 optimization is ~10x harder than p50

Your database crashed. How do you ensure zero data loss?
?
Prevention:

  1. Synchronous replication: Data written to multiple nodes before ack
  2. Write-Ahead Log (WAL): Changes logged before applying
  3. Regular backups: Point-in-time recovery (PITR)
    Recovery:
  4. Automatic failover: Detect failure, promote replica
  5. Validate data: Check integrity, verify replication lag was zero
  6. Update routing: Change DNS/load balancer
    Key metrics:
  • RPO (Recovery Point Objective): How much data loss acceptable?
  • RTO (Recovery Time Objective): How fast must we recover?
    Trade-off: Sync replication = no data loss but slower writes

Quick Facts

What did Twitter’s load parameters look like in 2012?
?

  • 12,000 requests/sec for posting tweets
  • 300,000 requests/sec for timeline reads
  • 25x more reads than writes
  • Fan-out is the key scalability challenge, not absolute request volume

What percentage of Google’s machines fail per year?
?

  • 1-5% of machines fail per year
  • At scale, hardware failures are normal, not exceptional
  • Must design systems to handle failures gracefully

What is 99.9% uptime in hours of downtime per year?
?

  • 99.9% uptime = 8.7 hours downtime per year
  • 99.99% = 52.6 minutes/year
  • 99.999% = 5.26 minutes/year

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-08