Q&A Session Notes

Chapter 1: Reliability, Scalability, Maintainability

Conceptual Understanding Questions

Q1: What’s the difference between a fault and a failure?

Answer:

Fault: A component deviating from its specification (e.g., hard disk crashes, network packet lost)
Failure: The system as a whole stops providing service to users
Key insight: A fault-tolerant system prevents faults from causing failures
Example: If one server crashes (fault) but load balancer redirects to healthy servers, no user sees downtime (no failure)

Q2: Why are percentiles better than averages for measuring response time?

Answer:

Averages hide outliers: If 99 requests take 10ms and 1 takes 1000ms, average is ~20ms (misleading)
Median (p50): 50% of requests are faster than this value
p95/p99: Only 5%/1% of requests are slower - captures tail latencies
Why it matters: Slowest requests often from users with most data (your best customers)
Real impact: Amazon found 100ms increase in response time = 1% drop in sales

Q3: What is the Twitter fan-out problem and how was it solved?

Answer:
Problem: When user tweets, how to efficiently show it on all followers’ timelines?

Approach 1 (Pull):

Write tweet to global collection
When user loads timeline, query and join with followers
Issue: Expensive reads (join on every timeline view)

Approach 2 (Push):

Pre-compute each user’s timeline cache
When tweet posted, write to all followers’ caches
Issue: For celebrities with 30M followers, one tweet = 30M writes (write amplification)

Solution (Hybrid):

Most users: Approach 2 (push to caches)
Celebrities: Approach 1 (pull at read time)
Trade-off: Optimize for common case (regular users), handle edge cases differently

Q4: Why is software fault-tolerance more important in cloud than hardware redundancy?

Answer:

Traditional approach: RAID, dual power supplies, hot-swappable CPUs (expensive hardware)
Cloud approach: Use commodity hardware, expect failures, handle in software
Benefits:
- More flexible (can handle entire datacenter failures, not just single machine)
- Cost-effective (cheaper machines, failures handled in software)
- Can do rolling upgrades without downtime
- Better for multi-region deployments
Example: Netflix’s Chaos Monkey - randomly kills production servers to ensure resilience

Technical Interview Questions

Q5: Design a system that needs to scale from 100 to 1 million users. What’s your approach?

Answer Framework:

Step 1: Single Server (100-1000 users)

Web app + database on one machine
Vertical scaling as needed

Step 2: Separate Database (1K-10K users)

Web tier separate from data tier
Enables independent scaling

Step 3: Add Cache & CDN (10K-100K users)

Redis/Memcached for frequently accessed data
CDN for static assets
Reduce database load

Step 4: Horizontal Scaling (100K-1M users)

Multiple web servers behind load balancer
Database replication (read replicas)
Consider database sharding if needed

Step 5: Additional Considerations

Message queue for async processing
Monitor with percentiles (p95, p99)
Auto-scaling based on load
Multi-region for reliability

Key Points to Mention:

Start simple, add complexity as needed
Measure before optimizing
Different scaling strategies for reads vs writes

Q6: How would you handle a service that has p99 latency of 2 seconds (SLA requires < 500ms)?

Answer Approach:

1. Investigate & Measure

Where does latency come from? (DB queries, external APIs, computation)
Profile the slow requests specifically (not averages)
Check if tail latencies correlate with specific patterns (user types, data volumes)

2. Common Causes & Solutions

Database Issues:

Missing indexes → Add appropriate indexes
Slow queries → Query optimization, consider caching
Lock contention → Optimize transaction boundaries
Read replicas overloaded → Add more replicas or cache

External Dependencies:

Third-party API slow → Add timeout, async processing, circuit breaker
Downstream service slow → Implement bulkheads, rate limiting

Resource Contention:

CPU/Memory saturation → Vertical or horizontal scaling
Thread pool exhausted → Tune thread pool sizes
Head-of-line blocking → Use async/non-blocking I/O

3. Architectural Solutions:

Caching: Cache expensive operations (Redis, CDN)
Async Processing: Move non-critical work to background queues
Read Replicas: Separate read and write traffic
Connection Pooling: Reduce connection overhead
Load Shedding: Reject requests when overwhelmed (return 503)

4. Monitoring Strategy:

Set up alerts on p95/p99, not just averages
Track latency by endpoint, user type, region
Correlate with error rates and throughput

Q7: Your database server crashes. How do you ensure zero data loss?

Answer:

Prevention (Before Crash):

Replication
- Synchronous replication to at least one replica
- Ensures data written to multiple servers before ack
- Trade-off: Increased write latency
Write-Ahead Log (WAL)
- All changes written to log before data files
- Can replay log after crash
- Most databases have this by default
Backups
- Regular automated backups
- Point-in-time recovery (PITR)
- Store in different location/region

Recovery (After Crash):

Automatic Failover
- Detect failure (heartbeat timeout)
- Promote replica to primary
- Update DNS/load balancer
- Trade-off: Risk of split-brain
Data Validation
- Check data integrity after recovery
- Verify replication lag was zero
- Run consistency checks

Key Considerations:

RPO (Recovery Point Objective): How much data can you afford to lose? (Dictates sync vs async replication)
RTO (Recovery Time Objective): How quickly must you recover? (Dictates failover automation)
Consistency: Strong consistency requires synchronous replication (slower writes)
Monitoring: Detect failures quickly (heartbeats, health checks)

Q8: You’re seeing cascading failures across your microservices. How do you prevent this?

Answer:

1. Circuit Breaker Pattern

Detect when service is failing
Stop sending requests (fail fast)
Periodically retry to check if service recovered
Tools: Netflix Hystrix, Resilience4j

2. Bulkhead Pattern

Isolate resources (thread pools, connection pools)
Failure in one area doesn’t affect others
Like watertight compartments in a ship

3. Timeout & Retry Strategy

Set aggressive timeouts (don’t wait forever)
Exponential backoff with jitter for retries
Limit retry attempts

4. Rate Limiting & Load Shedding

Limit requests per client/service
Reject excess requests early (503 Service Unavailable)
Protect downstream services from overload

5. Graceful Degradation

Identify critical vs non-critical features
Disable non-critical features under load
Return cached/stale data instead of failing

6. Monitoring & Alerting

Track error rates, latency, saturation
Alert on unusual patterns
Distributed tracing to identify bottlenecks (Jaeger, Zipkin)

7. Chaos Engineering

Regularly test failure scenarios
Netflix Chaos Monkey kills random servers
Ensures system handles failures correctly

Example Scenario:

User Service → Order Service → Payment Service → Email Service
                                      ↓ (slow)
                            Everything backs up

Solution:

Circuit breaker on Payment Service
Async email (queue)
Timeout on Order → Payment
Bulkhead: Separate thread pool for each dependency

Advanced Topics & Discussion

Q9: How do you measure and improve maintainability?

Answer:

Measuring Maintainability:

DORA Metrics
- Deployment frequency (how often)
- Lead time for changes (idea → production)
- Mean time to recovery (MTTR)
- Change failure rate
Code Metrics
- Cyclomatic complexity
- Code churn (how often code changes)
- Test coverage
- Documentation coverage
Team Metrics
- Time for new developer to first commit
- Frequency of production incidents
- Time spent on maintenance vs new features

Improving Maintainability:

Operability
- Good observability (logs, metrics, traces)
- Automated deployments (CI/CD)
- Runbooks for common issues
- Self-service tools for developers
Simplicity
- Remove unnecessary complexity
- Good abstractions (hide details)
- Consistent patterns and naming
- Avoid premature optimization
Evolvability
- Modular architecture (loosely coupled)
- Comprehensive tests (enable refactoring)
- Feature flags (gradual rollout)
- Good documentation

Red Flags:

“Only Bob knows how this works”
Fear of making changes
Long debugging sessions
Frequent production incidents

Q10: When should you optimize for p99 vs p50 latency?

Answer:

Optimize for p50 (Median) when:

High-volume, low-value requests
Best-effort services
Internal tools with forgiving users
Limited resources/budget

Optimize for p99 when:

Users with most data = most valuable customers
Synchronous user-facing requests
SLA requirements specify tail latencies
Competitive advantage from performance
Financial transactions (every millisecond matters)

Optimize for p99.9 when:

Critical path operations
Large fanout operations (one slow call blocks many)
Health/safety critical systems
High-value B2B customers

Trade-offs:

p99 optimization is expensive (10x harder than p50)
Diminishing returns (p999 may be impossible to optimize cost-effectively)
Sometimes better to retry than optimize extreme tail

Example (Amazon):

Customer with most items in cart = best customer
Their requests hit more data = slower
Optimizing p99 = keeping best customers happy
Measurable business impact (1% sales per 100ms)

Interview Preparation Tips

System Design Framework Using Chapter 1 Concepts

Step 1: Clarify Requirements

Functional requirements (what features)
Non-functional requirements:
- Reliability: Uptime requirement? (99.9% = 8.7h downtime/year)
- Scalability: Users? Growth rate? Read/write ratio?
- Maintainability: Team size? Deployment frequency?

Step 2: Back-of-envelope Estimation

QPS (queries per second)
Storage requirements
Bandwidth needs
Use round numbers for easier math

Step 3: High-level Design

Start simple (single server)
Identify bottlenecks
Add components as needed

Step 4: Deep Dive

Address reliability (replication, backups)
Address scalability (caching, sharding)
Address maintainability (monitoring, logging)

Step 5: Discuss Trade-offs

Always present alternatives
Explain why you chose your approach
Mention what you’d do differently at different scales

Key Phrases to Use

For Reliability:

“We need to consider single points of failure”
“Let’s add replication for redundancy”
“We should implement health checks and automatic failover”
“What’s our tolerance for data loss?” (RPO)
“How quickly do we need to recover?” (RTO)

For Scalability:

“Let’s think about this at 10x scale”
“This approach works for 1K users, but at 1M users…”
“We can scale vertically initially, then horizontally”
“The bottleneck here is [database/network/CPU]”
“We should measure p95/p99 latency, not averages”

For Maintainability:

“This design is simple to understand and debug”
“We can monitor this using [metrics]”
“The system is loosely coupled, easy to modify”
“New developers can understand this quickly”

Common Mistakes to Avoid

Over-engineering: Don’t design for 1M users when you have 100
Ignoring trade-offs: Every decision has pros and cons
Forgetting monitoring: “We need observability to detect issues”
Assuming 100% uptime: Be realistic about reliability
Optimizing too early: Measure before optimizing
Ignoring costs: Mention cost implications of design choices

Ongoing Questions

How do modern serverless architectures change the reliability/scalability/maintainability trade-offs?
What’s the role of AI in improving system reliability and maintainability in 2026?
How do you balance developer velocity vs system reliability?

Clarifications Needed

Deep dive into chaos engineering practices
Modern observability vs traditional monitoring differences
Platform engineering and how it improves maintainability

Study Notes by Niladri & AI

Explorer

session-notes

Q&A Session Notes

Chapter 1: Reliability, Scalability, Maintainability

Conceptual Understanding Questions

Q1: What’s the difference between a fault and a failure?

Q2: Why are percentiles better than averages for measuring response time?

Q3: What is the Twitter fan-out problem and how was it solved?

Q4: Why is software fault-tolerance more important in cloud than hardware redundancy?

Technical Interview Questions

Q5: Design a system that needs to scale from 100 to 1 million users. What’s your approach?

Q6: How would you handle a service that has p99 latency of 2 seconds (SLA requires < 500ms)?

Q7: Your database server crashes. How do you ensure zero data loss?

Q8: You’re seeing cascading failures across your microservices. How do you prevent this?

Advanced Topics & Discussion

Q9: How do you measure and improve maintainability?

Q10: When should you optimize for p99 vs p50 latency?

Interview Preparation Tips

System Design Framework Using Chapter 1 Concepts

Key Phrases to Use

Common Mistakes to Avoid

Ongoing Questions

Clarifications Needed

Graph View

Table of Contents