Chapter 1: Reliable, Scalable, and Maintainable Applications
Overview
This chapter introduces the three fundamental concerns in software systems design: Reliability, Scalability, and Maintainability. These properties form the foundation for building successful data-intensive applications.
Data-Intensive vs Compute-Intensive: Modern applications are often data-intensive (bottlenecked by data volume, complexity, speed of change) rather than compute-intensive (bottlenecked by CPU).
Common Building Blocks:
- Databases (store and retrieve data)
- Caches (remember expensive operations)
- Search indexes (search/filter by keywords)
- Stream processing (asynchronous message handling)
- Batch processing (periodic large-scale data crunching)
Key Concepts
Reliability
Definition: System continues to work correctly (performing the correct function at the desired performance) even when things go wrong.
Key Terms:
- Faults: Things that can go wrong (component deviating from spec)
- Failures: System as a whole stops working
- Fault-tolerant/Resilient: System that anticipates and copes with faults
Types of Faults:
-
Hardware Faults
- Hard disk crashes (MTTF ~10-50 years, but with 10,000 disks, expect 1 disk to die per day)
- RAM errors
- Power outages
- Network issues
- Traditional approach: Redundancy (RAID, dual power supplies, hot-swappable CPUs)
- Modern approach: Software fault-tolerance (cloud platforms prioritize flexibility)
-
Software Errors
- Systematic errors (correlated across nodes)
- Bug causing cascading failures
- Runaway process consuming resources
- Service dependency slowdown/unresponsiveness
- Cascading failures
- Harder to anticipate than hardware faults
- Mitigation: Testing, process isolation, monitoring, crash recovery
-
Human Errors
- Humans are unreliable (leading cause of outages)
- Strategies:
- Design systems that minimize opportunities for error
- Decouple places where mistakes are made from places where they cause failures
- Test thoroughly (unit, integration, manual tests)
- Allow quick recovery (fast rollback, gradual rollout)
- Detailed monitoring (telemetry)
- Good management practices and training
Importance: Even “non-critical” apps need reliability. Lost data = lost trust/revenue. Critical apps (healthcare, nuclear) = safety issues.
Scalability
Definition: System’s ability to cope with increased load. Not a binary yes/no, but about strategies for handling growth.
Load Parameters (describing load):
- Requests per second to web server
- Ratio of reads to writes in database
- Number of simultaneously active users
- Hit rate on cache
- Example: Twitter (2012) - 12K requests/sec for posting tweets, but 300K requests/sec for timeline reads
Twitter Case Study (Fan-out problem):
- Approach 1: Write to global collection, read by joining followers (high read load)
- Approach 2: Maintain timeline cache for each user, write to all followers’ caches (high write load)
- Hybrid: Approach 2 for most users, Approach 1 for celebrities (millions of followers)
- Key Insight: Load parameters differ per user, requiring hybrid strategies
Performance Metrics:
- Throughput: Records per second, number of operations
- Response time: Time client waits for request
- Use percentiles not averages (median p50, p95, p99, p999)
- Tail latencies matter (p99) - often the customers with most data
- Head-of-line blocking: Slow requests hold up subsequent requests
Service Level Objectives (SLOs) / Agreements (SLAs):
- Define expected performance and availability
- Example: Service is up if median response < 200ms and p99 < 1s
Approaches to Scaling:
- Vertical Scaling (Scale-up): More powerful machine
- Horizontal Scaling (Scale-out): Distribute load across multiple machines
- Elastic: Automatically add resources when load increases (cloud-friendly)
- Manual: Human analyzes capacity and provisions
Architecture Considerations:
- No magic scaling sauce - architecture depends on application
- Reads heavy vs writes heavy
- Data volume and complexity
- Response time requirements
- Access patterns
- Common wisdom: Keep system simple and pragmatic; don’t over-engineer
Maintainability
Definition: Making life easier for engineering and operations teams. Majority of software cost is ongoing maintenance, not initial development.
Three Design Principles:
-
Operability - Make it easy for operations to keep system running smoothly
- Good monitoring and visibility
- Support for automation and integration
- Good documentation
- Predictable behavior, avoiding surprises
- Self-healing where appropriate, but manual control when needed
- Good default behavior with override options
-
Simplicity - Make it easy for new engineers to understand the system
- Manage complexity
- Symptoms of complexity: Explosion of state space, tight coupling, tangled dependencies, inconsistent naming, hacks for performance, special cases
- Solution: Abstraction (hiding implementation details behind clean interfaces)
- Not same as simplicity of UI - can have simple interface with complex implementation
-
Evolvability (Extensibility/Modifiability/Plasticity) - Make it easy to make changes
- Requirements constantly change
- Agile working patterns
- Test-driven development (TDD)
- Refactoring
- Simple and easy-to-understand systems are easier to modify
Important Points
- Trade-offs are central: No one-size-fits-all solution. Every design decision involves trade-offs.
- Hardware redundancy → Software fault-tolerance shift: Cloud era emphasizes software techniques over hardware redundancy.
- Percentiles over averages: Averages hide outliers. Use p50, p95, p99 for meaningful metrics.
- Load parameters are application-specific: Twitter’s bottleneck is fan-out, not request volume.
- Simplicity through abstraction: Hide complexity behind clean interfaces, don’t eliminate necessary complexity.
- Most time spent on maintenance: Design for future developers, not just initial launch.
- Human errors are most common: Design systems that are forgiving of mistakes.
Examples & Case Studies
-
Twitter Timeline Fan-out
- Problem: Delivering tweets to followers efficiently
- Naive solution: Query on read (expensive joins)
- Better solution: Pre-compute timelines (write amplification)
- Best solution: Hybrid approach based on follower count
-
Amazon Response Times
- Internal services have strict SLAs
- p99.9 matters because customers with most data = most valuable
- 100ms increase in response time = 1% sales loss
-
Hardware Failure Rates
- Google: 1-5% of machines fail per year
- Disk MTTF: 10-50 years, but at scale expect failures daily
-
Software Bug Examples
- Leap second bug (June 30, 2012) - Linux kernel bug caused many services to hang
- Runaway process consuming all CPU/memory/disk
Questions
- How do you decide between vertical and horizontal scaling?
- What percentile should you optimize for? (p50 vs p95 vs p99)
- How do you prevent cascading failures in distributed systems?
- What’s the difference between fault tolerance and failure prevention?
- How do you measure and improve maintainability?
- What are effective strategies for reducing human error in operations?
- How do you balance simplicity with necessary complexity?
- When should you use elastic scaling vs manual provisioning?
Modern Context (2026)
Reliability:
- Chaos engineering is now standard (Netflix Chaos Monkey, Gremlin)
- Site Reliability Engineering (SRE) practices widespread
- Observability over monitoring (traces, metrics, logs)
- Progressive delivery (canary, blue-green deployments)
- Error budgets to balance reliability vs velocity
Scalability:
- Auto-scaling is default in cloud platforms (AWS Auto Scaling, Kubernetes HPA)
- Serverless abstracts scaling completely (Lambda, Cloud Functions)
- Multi-region active-active architectures common
- Edge computing and CDNs for global scale
- Cost optimization = performance optimization at scale
Maintainability:
- Platform engineering teams build internal developer platforms
- Infrastructure as Code (Terraform, Pulumi) standard
- GitOps for operations (ArgoCD, Flux)
- Developer experience (DevEx) as key metric
- AI-assisted development and documentation
New Challenges:
- Carbon footprint and sustainability considerations
- AI/ML workload scaling (GPU resources, batch inference)
- Data sovereignty and compliance (GDPR, regional requirements)
- Supply chain security (dependency vulnerabilities)
Status: Notes complete
Last Updated: 2026-04-08