Q&A Session Notes
Chapter 1: Reliability, Scalability, Maintainability
Conceptual Understanding Questions
Q1: What’s the difference between a fault and a failure?
Answer:
- Fault: A component deviating from its specification (e.g., hard disk crashes, network packet lost)
- Failure: The system as a whole stops providing service to users
- Key insight: A fault-tolerant system prevents faults from causing failures
- Example: If one server crashes (fault) but load balancer redirects to healthy servers, no user sees downtime (no failure)
Q2: Why are percentiles better than averages for measuring response time?
Answer:
- Averages hide outliers: If 99 requests take 10ms and 1 takes 1000ms, average is ~20ms (misleading)
- Median (p50): 50% of requests are faster than this value
- p95/p99: Only 5%/1% of requests are slower - captures tail latencies
- Why it matters: Slowest requests often from users with most data (your best customers)
- Real impact: Amazon found 100ms increase in response time = 1% drop in sales
Q3: What is the Twitter fan-out problem and how was it solved?
Answer:
Problem: When user tweets, how to efficiently show it on all followers’ timelines?
Approach 1 (Pull):
- Write tweet to global collection
- When user loads timeline, query and join with followers
- Issue: Expensive reads (join on every timeline view)
Approach 2 (Push):
- Pre-compute each user’s timeline cache
- When tweet posted, write to all followers’ caches
- Issue: For celebrities with 30M followers, one tweet = 30M writes (write amplification)
Solution (Hybrid):
- Most users: Approach 2 (push to caches)
- Celebrities: Approach 1 (pull at read time)
- Trade-off: Optimize for common case (regular users), handle edge cases differently
Q4: Why is software fault-tolerance more important in cloud than hardware redundancy?
Answer:
- Traditional approach: RAID, dual power supplies, hot-swappable CPUs (expensive hardware)
- Cloud approach: Use commodity hardware, expect failures, handle in software
- Benefits:
- More flexible (can handle entire datacenter failures, not just single machine)
- Cost-effective (cheaper machines, failures handled in software)
- Can do rolling upgrades without downtime
- Better for multi-region deployments
- Example: Netflix’s Chaos Monkey - randomly kills production servers to ensure resilience
Technical Interview Questions
Q5: Design a system that needs to scale from 100 to 1 million users. What’s your approach?
Answer Framework:
Step 1: Single Server (100-1000 users)
- Web app + database on one machine
- Vertical scaling as needed
Step 2: Separate Database (1K-10K users)
- Web tier separate from data tier
- Enables independent scaling
Step 3: Add Cache & CDN (10K-100K users)
- Redis/Memcached for frequently accessed data
- CDN for static assets
- Reduce database load
Step 4: Horizontal Scaling (100K-1M users)
- Multiple web servers behind load balancer
- Database replication (read replicas)
- Consider database sharding if needed
Step 5: Additional Considerations
- Message queue for async processing
- Monitor with percentiles (p95, p99)
- Auto-scaling based on load
- Multi-region for reliability
Key Points to Mention:
- Start simple, add complexity as needed
- Measure before optimizing
- Different scaling strategies for reads vs writes
Q6: How would you handle a service that has p99 latency of 2 seconds (SLA requires < 500ms)?
Answer Approach:
1. Investigate & Measure
- Where does latency come from? (DB queries, external APIs, computation)
- Profile the slow requests specifically (not averages)
- Check if tail latencies correlate with specific patterns (user types, data volumes)
2. Common Causes & Solutions
Database Issues:
- Missing indexes → Add appropriate indexes
- Slow queries → Query optimization, consider caching
- Lock contention → Optimize transaction boundaries
- Read replicas overloaded → Add more replicas or cache
External Dependencies:
- Third-party API slow → Add timeout, async processing, circuit breaker
- Downstream service slow → Implement bulkheads, rate limiting
Resource Contention:
- CPU/Memory saturation → Vertical or horizontal scaling
- Thread pool exhausted → Tune thread pool sizes
- Head-of-line blocking → Use async/non-blocking I/O
3. Architectural Solutions:
- Caching: Cache expensive operations (Redis, CDN)
- Async Processing: Move non-critical work to background queues
- Read Replicas: Separate read and write traffic
- Connection Pooling: Reduce connection overhead
- Load Shedding: Reject requests when overwhelmed (return 503)
4. Monitoring Strategy:
- Set up alerts on p95/p99, not just averages
- Track latency by endpoint, user type, region
- Correlate with error rates and throughput
Q7: Your database server crashes. How do you ensure zero data loss?
Answer:
Prevention (Before Crash):
-
Replication
- Synchronous replication to at least one replica
- Ensures data written to multiple servers before ack
- Trade-off: Increased write latency
-
Write-Ahead Log (WAL)
- All changes written to log before data files
- Can replay log after crash
- Most databases have this by default
-
Backups
- Regular automated backups
- Point-in-time recovery (PITR)
- Store in different location/region
Recovery (After Crash):
-
Automatic Failover
- Detect failure (heartbeat timeout)
- Promote replica to primary
- Update DNS/load balancer
- Trade-off: Risk of split-brain
-
Data Validation
- Check data integrity after recovery
- Verify replication lag was zero
- Run consistency checks
Key Considerations:
- RPO (Recovery Point Objective): How much data can you afford to lose? (Dictates sync vs async replication)
- RTO (Recovery Time Objective): How quickly must you recover? (Dictates failover automation)
- Consistency: Strong consistency requires synchronous replication (slower writes)
- Monitoring: Detect failures quickly (heartbeats, health checks)
Q8: You’re seeing cascading failures across your microservices. How do you prevent this?
Answer:
1. Circuit Breaker Pattern
- Detect when service is failing
- Stop sending requests (fail fast)
- Periodically retry to check if service recovered
- Tools: Netflix Hystrix, Resilience4j
2. Bulkhead Pattern
- Isolate resources (thread pools, connection pools)
- Failure in one area doesn’t affect others
- Like watertight compartments in a ship
3. Timeout & Retry Strategy
- Set aggressive timeouts (don’t wait forever)
- Exponential backoff with jitter for retries
- Limit retry attempts
4. Rate Limiting & Load Shedding
- Limit requests per client/service
- Reject excess requests early (503 Service Unavailable)
- Protect downstream services from overload
5. Graceful Degradation
- Identify critical vs non-critical features
- Disable non-critical features under load
- Return cached/stale data instead of failing
6. Monitoring & Alerting
- Track error rates, latency, saturation
- Alert on unusual patterns
- Distributed tracing to identify bottlenecks (Jaeger, Zipkin)
7. Chaos Engineering
- Regularly test failure scenarios
- Netflix Chaos Monkey kills random servers
- Ensures system handles failures correctly
Example Scenario:
User Service → Order Service → Payment Service → Email Service
↓ (slow)
Everything backs up
Solution:
- Circuit breaker on Payment Service
- Async email (queue)
- Timeout on Order → Payment
- Bulkhead: Separate thread pool for each dependency
Advanced Topics & Discussion
Q9: How do you measure and improve maintainability?
Answer:
Measuring Maintainability:
-
DORA Metrics
- Deployment frequency (how often)
- Lead time for changes (idea → production)
- Mean time to recovery (MTTR)
- Change failure rate
-
Code Metrics
- Cyclomatic complexity
- Code churn (how often code changes)
- Test coverage
- Documentation coverage
-
Team Metrics
- Time for new developer to first commit
- Frequency of production incidents
- Time spent on maintenance vs new features
Improving Maintainability:
-
Operability
- Good observability (logs, metrics, traces)
- Automated deployments (CI/CD)
- Runbooks for common issues
- Self-service tools for developers
-
Simplicity
- Remove unnecessary complexity
- Good abstractions (hide details)
- Consistent patterns and naming
- Avoid premature optimization
-
Evolvability
- Modular architecture (loosely coupled)
- Comprehensive tests (enable refactoring)
- Feature flags (gradual rollout)
- Good documentation
Red Flags:
- “Only Bob knows how this works”
- Fear of making changes
- Long debugging sessions
- Frequent production incidents
Q10: When should you optimize for p99 vs p50 latency?
Answer:
Optimize for p50 (Median) when:
- High-volume, low-value requests
- Best-effort services
- Internal tools with forgiving users
- Limited resources/budget
Optimize for p99 when:
- Users with most data = most valuable customers
- Synchronous user-facing requests
- SLA requirements specify tail latencies
- Competitive advantage from performance
- Financial transactions (every millisecond matters)
Optimize for p99.9 when:
- Critical path operations
- Large fanout operations (one slow call blocks many)
- Health/safety critical systems
- High-value B2B customers
Trade-offs:
- p99 optimization is expensive (10x harder than p50)
- Diminishing returns (p999 may be impossible to optimize cost-effectively)
- Sometimes better to retry than optimize extreme tail
Example (Amazon):
- Customer with most items in cart = best customer
- Their requests hit more data = slower
- Optimizing p99 = keeping best customers happy
- Measurable business impact (1% sales per 100ms)
Interview Preparation Tips
System Design Framework Using Chapter 1 Concepts
Step 1: Clarify Requirements
- Functional requirements (what features)
- Non-functional requirements:
- Reliability: Uptime requirement? (99.9% = 8.7h downtime/year)
- Scalability: Users? Growth rate? Read/write ratio?
- Maintainability: Team size? Deployment frequency?
Step 2: Back-of-envelope Estimation
- QPS (queries per second)
- Storage requirements
- Bandwidth needs
- Use round numbers for easier math
Step 3: High-level Design
- Start simple (single server)
- Identify bottlenecks
- Add components as needed
Step 4: Deep Dive
- Address reliability (replication, backups)
- Address scalability (caching, sharding)
- Address maintainability (monitoring, logging)
Step 5: Discuss Trade-offs
- Always present alternatives
- Explain why you chose your approach
- Mention what you’d do differently at different scales
Key Phrases to Use
For Reliability:
- “We need to consider single points of failure”
- “Let’s add replication for redundancy”
- “We should implement health checks and automatic failover”
- “What’s our tolerance for data loss?” (RPO)
- “How quickly do we need to recover?” (RTO)
For Scalability:
- “Let’s think about this at 10x scale”
- “This approach works for 1K users, but at 1M users…”
- “We can scale vertically initially, then horizontally”
- “The bottleneck here is [database/network/CPU]”
- “We should measure p95/p99 latency, not averages”
For Maintainability:
- “This design is simple to understand and debug”
- “We can monitor this using [metrics]”
- “The system is loosely coupled, easy to modify”
- “New developers can understand this quickly”
Common Mistakes to Avoid
- Over-engineering: Don’t design for 1M users when you have 100
- Ignoring trade-offs: Every decision has pros and cons
- Forgetting monitoring: “We need observability to detect issues”
- Assuming 100% uptime: Be realistic about reliability
- Optimizing too early: Measure before optimizing
- Ignoring costs: Mention cost implications of design choices
Ongoing Questions
- How do modern serverless architectures change the reliability/scalability/maintainability trade-offs?
- What’s the role of AI in improving system reliability and maintainability in 2026?
- How do you balance developer velocity vs system reliability?
Clarifications Needed
- Deep dive into chaos engineering practices
- Modern observability vs traditional monitoring differences
- Platform engineering and how it improves maintainability