Q&A Session Notes

Chapter 1: Reliability, Scalability, Maintainability

Conceptual Understanding Questions

Q1: What’s the difference between a fault and a failure?

Answer:

  • Fault: A component deviating from its specification (e.g., hard disk crashes, network packet lost)
  • Failure: The system as a whole stops providing service to users
  • Key insight: A fault-tolerant system prevents faults from causing failures
  • Example: If one server crashes (fault) but load balancer redirects to healthy servers, no user sees downtime (no failure)

Q2: Why are percentiles better than averages for measuring response time?

Answer:

  • Averages hide outliers: If 99 requests take 10ms and 1 takes 1000ms, average is ~20ms (misleading)
  • Median (p50): 50% of requests are faster than this value
  • p95/p99: Only 5%/1% of requests are slower - captures tail latencies
  • Why it matters: Slowest requests often from users with most data (your best customers)
  • Real impact: Amazon found 100ms increase in response time = 1% drop in sales

Q3: What is the Twitter fan-out problem and how was it solved?

Answer:
Problem: When user tweets, how to efficiently show it on all followers’ timelines?

Approach 1 (Pull):

  • Write tweet to global collection
  • When user loads timeline, query and join with followers
  • Issue: Expensive reads (join on every timeline view)

Approach 2 (Push):

  • Pre-compute each user’s timeline cache
  • When tweet posted, write to all followers’ caches
  • Issue: For celebrities with 30M followers, one tweet = 30M writes (write amplification)

Solution (Hybrid):

  • Most users: Approach 2 (push to caches)
  • Celebrities: Approach 1 (pull at read time)
  • Trade-off: Optimize for common case (regular users), handle edge cases differently

Q4: Why is software fault-tolerance more important in cloud than hardware redundancy?

Answer:

  • Traditional approach: RAID, dual power supplies, hot-swappable CPUs (expensive hardware)
  • Cloud approach: Use commodity hardware, expect failures, handle in software
  • Benefits:
    • More flexible (can handle entire datacenter failures, not just single machine)
    • Cost-effective (cheaper machines, failures handled in software)
    • Can do rolling upgrades without downtime
    • Better for multi-region deployments
  • Example: Netflix’s Chaos Monkey - randomly kills production servers to ensure resilience

Technical Interview Questions

Q5: Design a system that needs to scale from 100 to 1 million users. What’s your approach?

Answer Framework:

Step 1: Single Server (100-1000 users)

  • Web app + database on one machine
  • Vertical scaling as needed

Step 2: Separate Database (1K-10K users)

  • Web tier separate from data tier
  • Enables independent scaling

Step 3: Add Cache & CDN (10K-100K users)

  • Redis/Memcached for frequently accessed data
  • CDN for static assets
  • Reduce database load

Step 4: Horizontal Scaling (100K-1M users)

  • Multiple web servers behind load balancer
  • Database replication (read replicas)
  • Consider database sharding if needed

Step 5: Additional Considerations

  • Message queue for async processing
  • Monitor with percentiles (p95, p99)
  • Auto-scaling based on load
  • Multi-region for reliability

Key Points to Mention:

  • Start simple, add complexity as needed
  • Measure before optimizing
  • Different scaling strategies for reads vs writes

Q6: How would you handle a service that has p99 latency of 2 seconds (SLA requires < 500ms)?

Answer Approach:

1. Investigate & Measure

  • Where does latency come from? (DB queries, external APIs, computation)
  • Profile the slow requests specifically (not averages)
  • Check if tail latencies correlate with specific patterns (user types, data volumes)

2. Common Causes & Solutions

Database Issues:

  • Missing indexes → Add appropriate indexes
  • Slow queries → Query optimization, consider caching
  • Lock contention → Optimize transaction boundaries
  • Read replicas overloaded → Add more replicas or cache

External Dependencies:

  • Third-party API slow → Add timeout, async processing, circuit breaker
  • Downstream service slow → Implement bulkheads, rate limiting

Resource Contention:

  • CPU/Memory saturation → Vertical or horizontal scaling
  • Thread pool exhausted → Tune thread pool sizes
  • Head-of-line blocking → Use async/non-blocking I/O

3. Architectural Solutions:

  • Caching: Cache expensive operations (Redis, CDN)
  • Async Processing: Move non-critical work to background queues
  • Read Replicas: Separate read and write traffic
  • Connection Pooling: Reduce connection overhead
  • Load Shedding: Reject requests when overwhelmed (return 503)

4. Monitoring Strategy:

  • Set up alerts on p95/p99, not just averages
  • Track latency by endpoint, user type, region
  • Correlate with error rates and throughput

Q7: Your database server crashes. How do you ensure zero data loss?

Answer:

Prevention (Before Crash):

  1. Replication

    • Synchronous replication to at least one replica
    • Ensures data written to multiple servers before ack
    • Trade-off: Increased write latency
  2. Write-Ahead Log (WAL)

    • All changes written to log before data files
    • Can replay log after crash
    • Most databases have this by default
  3. Backups

    • Regular automated backups
    • Point-in-time recovery (PITR)
    • Store in different location/region

Recovery (After Crash):

  1. Automatic Failover

    • Detect failure (heartbeat timeout)
    • Promote replica to primary
    • Update DNS/load balancer
    • Trade-off: Risk of split-brain
  2. Data Validation

    • Check data integrity after recovery
    • Verify replication lag was zero
    • Run consistency checks

Key Considerations:

  • RPO (Recovery Point Objective): How much data can you afford to lose? (Dictates sync vs async replication)
  • RTO (Recovery Time Objective): How quickly must you recover? (Dictates failover automation)
  • Consistency: Strong consistency requires synchronous replication (slower writes)
  • Monitoring: Detect failures quickly (heartbeats, health checks)

Q8: You’re seeing cascading failures across your microservices. How do you prevent this?

Answer:

1. Circuit Breaker Pattern

  • Detect when service is failing
  • Stop sending requests (fail fast)
  • Periodically retry to check if service recovered
  • Tools: Netflix Hystrix, Resilience4j

2. Bulkhead Pattern

  • Isolate resources (thread pools, connection pools)
  • Failure in one area doesn’t affect others
  • Like watertight compartments in a ship

3. Timeout & Retry Strategy

  • Set aggressive timeouts (don’t wait forever)
  • Exponential backoff with jitter for retries
  • Limit retry attempts

4. Rate Limiting & Load Shedding

  • Limit requests per client/service
  • Reject excess requests early (503 Service Unavailable)
  • Protect downstream services from overload

5. Graceful Degradation

  • Identify critical vs non-critical features
  • Disable non-critical features under load
  • Return cached/stale data instead of failing

6. Monitoring & Alerting

  • Track error rates, latency, saturation
  • Alert on unusual patterns
  • Distributed tracing to identify bottlenecks (Jaeger, Zipkin)

7. Chaos Engineering

  • Regularly test failure scenarios
  • Netflix Chaos Monkey kills random servers
  • Ensures system handles failures correctly

Example Scenario:

User Service → Order Service → Payment Service → Email Service
                                      ↓ (slow)
                            Everything backs up

Solution:

  • Circuit breaker on Payment Service
  • Async email (queue)
  • Timeout on Order → Payment
  • Bulkhead: Separate thread pool for each dependency

Advanced Topics & Discussion

Q9: How do you measure and improve maintainability?

Answer:

Measuring Maintainability:

  1. DORA Metrics

    • Deployment frequency (how often)
    • Lead time for changes (idea → production)
    • Mean time to recovery (MTTR)
    • Change failure rate
  2. Code Metrics

    • Cyclomatic complexity
    • Code churn (how often code changes)
    • Test coverage
    • Documentation coverage
  3. Team Metrics

    • Time for new developer to first commit
    • Frequency of production incidents
    • Time spent on maintenance vs new features

Improving Maintainability:

  1. Operability

    • Good observability (logs, metrics, traces)
    • Automated deployments (CI/CD)
    • Runbooks for common issues
    • Self-service tools for developers
  2. Simplicity

    • Remove unnecessary complexity
    • Good abstractions (hide details)
    • Consistent patterns and naming
    • Avoid premature optimization
  3. Evolvability

    • Modular architecture (loosely coupled)
    • Comprehensive tests (enable refactoring)
    • Feature flags (gradual rollout)
    • Good documentation

Red Flags:

  • “Only Bob knows how this works”
  • Fear of making changes
  • Long debugging sessions
  • Frequent production incidents

Q10: When should you optimize for p99 vs p50 latency?

Answer:

Optimize for p50 (Median) when:

  • High-volume, low-value requests
  • Best-effort services
  • Internal tools with forgiving users
  • Limited resources/budget

Optimize for p99 when:

  • Users with most data = most valuable customers
  • Synchronous user-facing requests
  • SLA requirements specify tail latencies
  • Competitive advantage from performance
  • Financial transactions (every millisecond matters)

Optimize for p99.9 when:

  • Critical path operations
  • Large fanout operations (one slow call blocks many)
  • Health/safety critical systems
  • High-value B2B customers

Trade-offs:

  • p99 optimization is expensive (10x harder than p50)
  • Diminishing returns (p999 may be impossible to optimize cost-effectively)
  • Sometimes better to retry than optimize extreme tail

Example (Amazon):

  • Customer with most items in cart = best customer
  • Their requests hit more data = slower
  • Optimizing p99 = keeping best customers happy
  • Measurable business impact (1% sales per 100ms)

Interview Preparation Tips

System Design Framework Using Chapter 1 Concepts

Step 1: Clarify Requirements

  • Functional requirements (what features)
  • Non-functional requirements:
    • Reliability: Uptime requirement? (99.9% = 8.7h downtime/year)
    • Scalability: Users? Growth rate? Read/write ratio?
    • Maintainability: Team size? Deployment frequency?

Step 2: Back-of-envelope Estimation

  • QPS (queries per second)
  • Storage requirements
  • Bandwidth needs
  • Use round numbers for easier math

Step 3: High-level Design

  • Start simple (single server)
  • Identify bottlenecks
  • Add components as needed

Step 4: Deep Dive

  • Address reliability (replication, backups)
  • Address scalability (caching, sharding)
  • Address maintainability (monitoring, logging)

Step 5: Discuss Trade-offs

  • Always present alternatives
  • Explain why you chose your approach
  • Mention what you’d do differently at different scales

Key Phrases to Use

For Reliability:

  • “We need to consider single points of failure”
  • “Let’s add replication for redundancy”
  • “We should implement health checks and automatic failover”
  • “What’s our tolerance for data loss?” (RPO)
  • “How quickly do we need to recover?” (RTO)

For Scalability:

  • “Let’s think about this at 10x scale”
  • “This approach works for 1K users, but at 1M users…”
  • “We can scale vertically initially, then horizontally”
  • “The bottleneck here is [database/network/CPU]”
  • “We should measure p95/p99 latency, not averages”

For Maintainability:

  • “This design is simple to understand and debug”
  • “We can monitor this using [metrics]”
  • “The system is loosely coupled, easy to modify”
  • “New developers can understand this quickly”

Common Mistakes to Avoid

  1. Over-engineering: Don’t design for 1M users when you have 100
  2. Ignoring trade-offs: Every decision has pros and cons
  3. Forgetting monitoring: “We need observability to detect issues”
  4. Assuming 100% uptime: Be realistic about reliability
  5. Optimizing too early: Measure before optimizing
  6. Ignoring costs: Mention cost implications of design choices

Ongoing Questions

  • How do modern serverless architectures change the reliability/scalability/maintainability trade-offs?
  • What’s the role of AI in improving system reliability and maintainability in 2026?
  • How do you balance developer velocity vs system reliability?

Clarifications Needed

  • Deep dive into chaos engineering practices
  • Modern observability vs traditional monitoring differences
  • Platform engineering and how it improves maintainability