Chapter 1: Reliable, Scalable, and Maintainable Applications

Overview

This chapter introduces the three fundamental concerns in software systems design: Reliability, Scalability, and Maintainability. These properties form the foundation for building successful data-intensive applications.

Data-Intensive vs Compute-Intensive: Modern applications are often data-intensive (bottlenecked by data volume, complexity, speed of change) rather than compute-intensive (bottlenecked by CPU).

Common Building Blocks:

Databases (store and retrieve data)
Caches (remember expensive operations)
Search indexes (search/filter by keywords)
Stream processing (asynchronous message handling)
Batch processing (periodic large-scale data crunching)

Key Concepts

Reliability

Definition: System continues to work correctly (performing the correct function at the desired performance) even when things go wrong.

Key Terms:

Faults: Things that can go wrong (component deviating from spec)
Failures: System as a whole stops working
Fault-tolerant/Resilient: System that anticipates and copes with faults

Types of Faults:

Hardware Faults
- Hard disk crashes (MTTF ~10-50 years, but with 10,000 disks, expect 1 disk to die per day)
- RAM errors
- Power outages
- Network issues
- Traditional approach: Redundancy (RAID, dual power supplies, hot-swappable CPUs)
- Modern approach: Software fault-tolerance (cloud platforms prioritize flexibility)
Software Errors
- Systematic errors (correlated across nodes)
- Bug causing cascading failures
- Runaway process consuming resources
- Service dependency slowdown/unresponsiveness
- Cascading failures
- Harder to anticipate than hardware faults
- Mitigation: Testing, process isolation, monitoring, crash recovery
Human Errors
- Humans are unreliable (leading cause of outages)
- Strategies:
  - Design systems that minimize opportunities for error
  - Decouple places where mistakes are made from places where they cause failures
  - Test thoroughly (unit, integration, manual tests)
  - Allow quick recovery (fast rollback, gradual rollout)
  - Detailed monitoring (telemetry)
  - Good management practices and training

Importance: Even “non-critical” apps need reliability. Lost data = lost trust/revenue. Critical apps (healthcare, nuclear) = safety issues.

Scalability

Definition: System’s ability to cope with increased load. Not a binary yes/no, but about strategies for handling growth.

Load Parameters (describing load):

Requests per second to web server
Ratio of reads to writes in database
Number of simultaneously active users
Hit rate on cache
Example: Twitter (2012) - 12K requests/sec for posting tweets, but 300K requests/sec for timeline reads

Twitter Case Study (Fan-out problem):

Approach 1: Write to global collection, read by joining followers (high read load)
Approach 2: Maintain timeline cache for each user, write to all followers’ caches (high write load)
Hybrid: Approach 2 for most users, Approach 1 for celebrities (millions of followers)
Key Insight: Load parameters differ per user, requiring hybrid strategies

Performance Metrics:

Throughput: Records per second, number of operations
Response time: Time client waits for request
- Use percentiles not averages (median p50, p95, p99, p999)
- Tail latencies matter (p99) - often the customers with most data
- Head-of-line blocking: Slow requests hold up subsequent requests

Service Level Objectives (SLOs) / Agreements (SLAs):

Define expected performance and availability
Example: Service is up if median response < 200ms and p99 < 1s

Approaches to Scaling:

Vertical Scaling (Scale-up): More powerful machine
Horizontal Scaling (Scale-out): Distribute load across multiple machines
Elastic: Automatically add resources when load increases (cloud-friendly)
Manual: Human analyzes capacity and provisions

Architecture Considerations:

No magic scaling sauce - architecture depends on application
Reads heavy vs writes heavy
Data volume and complexity
Response time requirements
Access patterns
Common wisdom: Keep system simple and pragmatic; don’t over-engineer

Maintainability

Definition: Making life easier for engineering and operations teams. Majority of software cost is ongoing maintenance, not initial development.

Three Design Principles:

Operability - Make it easy for operations to keep system running smoothly
- Good monitoring and visibility
- Support for automation and integration
- Good documentation
- Predictable behavior, avoiding surprises
- Self-healing where appropriate, but manual control when needed
- Good default behavior with override options
Simplicity - Make it easy for new engineers to understand the system
- Manage complexity
- Symptoms of complexity: Explosion of state space, tight coupling, tangled dependencies, inconsistent naming, hacks for performance, special cases
- Solution: Abstraction (hiding implementation details behind clean interfaces)
- Not same as simplicity of UI - can have simple interface with complex implementation
Evolvability (Extensibility/Modifiability/Plasticity) - Make it easy to make changes
- Requirements constantly change
- Agile working patterns
- Test-driven development (TDD)
- Refactoring
- Simple and easy-to-understand systems are easier to modify

Important Points

Trade-offs are central: No one-size-fits-all solution. Every design decision involves trade-offs.
Hardware redundancy → Software fault-tolerance shift: Cloud era emphasizes software techniques over hardware redundancy.
Percentiles over averages: Averages hide outliers. Use p50, p95, p99 for meaningful metrics.
Load parameters are application-specific: Twitter’s bottleneck is fan-out, not request volume.
Simplicity through abstraction: Hide complexity behind clean interfaces, don’t eliminate necessary complexity.
Most time spent on maintenance: Design for future developers, not just initial launch.
Human errors are most common: Design systems that are forgiving of mistakes.

Examples & Case Studies

Twitter Timeline Fan-out
- Problem: Delivering tweets to followers efficiently
- Naive solution: Query on read (expensive joins)
- Better solution: Pre-compute timelines (write amplification)
- Best solution: Hybrid approach based on follower count
Amazon Response Times
- Internal services have strict SLAs
- p99.9 matters because customers with most data = most valuable
- 100ms increase in response time = 1% sales loss
Hardware Failure Rates
- Google: 1-5% of machines fail per year
- Disk MTTF: 10-50 years, but at scale expect failures daily
Software Bug Examples
- Leap second bug (June 30, 2012) - Linux kernel bug caused many services to hang
- Runaway process consuming all CPU/memory/disk

Questions

How do you decide between vertical and horizontal scaling?
What percentile should you optimize for? (p50 vs p95 vs p99)
How do you prevent cascading failures in distributed systems?
What’s the difference between fault tolerance and failure prevention?
How do you measure and improve maintainability?
What are effective strategies for reducing human error in operations?
How do you balance simplicity with necessary complexity?
When should you use elastic scaling vs manual provisioning?

Modern Context (2026)

Reliability:

Chaos engineering is now standard (Netflix Chaos Monkey, Gremlin)
Site Reliability Engineering (SRE) practices widespread
Observability over monitoring (traces, metrics, logs)
Progressive delivery (canary, blue-green deployments)
Error budgets to balance reliability vs velocity

Scalability:

Auto-scaling is default in cloud platforms (AWS Auto Scaling, Kubernetes HPA)
Serverless abstracts scaling completely (Lambda, Cloud Functions)
Multi-region active-active architectures common
Edge computing and CDNs for global scale
Cost optimization = performance optimization at scale

Maintainability:

Platform engineering teams build internal developer platforms
Infrastructure as Code (Terraform, Pulumi) standard
GitOps for operations (ArgoCD, Flux)
Developer experience (DevEx) as key metric
AI-assisted development and documentation

New Challenges:

Carbon footprint and sustainability considerations
AI/ML workload scaling (GPU resources, batch inference)
Data sovereignty and compliance (GDPR, regional requirements)
Supply chain security (dependency vulnerabilities)

Status: Notes complete
Last Updated: 2026-04-08

Study Notes by Niladri & AI

Explorer

ch01-reliable-scalable-maintainable