Chapter 2: Defining Nonfunctional Requirements
ddia-2e nonfunctional reliability scalability maintainability performance
Status: Notes complete
Overview
This chapter is the 2nd edition’s restructuring of the 1st edition’s Chapter 1. The core topics—reliability, scalability, and maintainability—remain, but the chapter is grounded in a concrete case study (a social network home timeline system) rather than abstract Twitter statistics. The chapter adds a formal treatment of memory architectures (shared-memory, shared-disk, shared-nothing) that was absent from the first edition. The central argument is that nonfunctional requirements—how a system behaves under load, failure, and change—are as important as functional requirements, and they must be defined with precision before architecture decisions are made. As a concrete example, “the system should be fast” is useless, while “p99 read latency must be < 200ms at 100K concurrent users” is actionable. The difference between these two statements is the difference between a system that accidentally meets its goals and one that is engineered to meet them.
Key Concepts
Case Study: Social Network Home Timelines
The chapter opens with a concrete case study: build the home timeline feature of a social network (similar to Twitter/X, Instagram, or Mastodon). This is a canonical system design problem that exposes all the nonfunctional trade-offs discussed later. Before abstract definitions, the reader sees exactly why these properties matter.
Representing Users, Posts, and Follows
The domain has three entities:
- User: Account with profile information
- Post: Content created by a user (text, media, etc.)
- Follow: A directed relationship (user A follows user B)
Naive relational representation:
CREATE TABLE users (user_id INT PRIMARY KEY, username TEXT, ...);
CREATE TABLE posts (post_id INT PRIMARY KEY, author_id INT REFERENCES users, content TEXT, created_at TIMESTAMP);
CREATE TABLE follows (follower_id INT REFERENCES users, followee_id INT REFERENCES users, PRIMARY KEY (follower_id, followee_id));Fetching a user’s home timeline (naive):
SELECT posts.*
FROM posts
JOIN follows ON posts.author_id = follows.followee_id
WHERE follows.follower_id = :current_user_id
ORDER BY posts.created_at DESC
LIMIT 50;This works at small scale. The problem: with millions of users and hundreds of millions of posts, this join is re-executed on every timeline request. At 300K timeline reads/second, no single database can execute this join fast enough.
Materializing and Updating Timelines
The materialized timeline approach: precompute the timeline for each user and store it in a fast cache.
Fan-out on write (push model):
┌─────────────┐ post created ┌─────────────────────────┐
│ User Alice │─────────────────▶ │ Fan-out worker │
│ (1M followers) │ │ Write to 1M timelines │
└─────────────┘ └─────────────────────────┘
│
▼
Timeline cache per user
[user_1_timeline: [post_ids]]
[user_2_timeline: [post_ids]]
[user_3_timeline: [post_ids]]
...
Fan-out on read (pull model): Execute the join at read time. Fast writes, slow reads.
Fan-out on write (push model): Precompute at write time. Fast reads, slow writes. Problem: celebrity with 1M followers posting one tweet triggers 1M cache writes = write amplification.
Hybrid solution (the right answer for Twitter-like systems):
- Most users: fan-out on write (their posts go directly to followers’ timeline caches)
- Celebrities (high follower count): fan-out on read (merged into timeline at read time)
- Threshold is configurable (typically followers > ~100K–1M)
This case study demonstrates the key insight: there is no universally correct answer. The right approach depends on the ratio of reads to writes and the distribution of follower counts—both are load parameters that must be measured.
Describing Performance
Before you can reason about reliability or scalability, you need precise vocabulary for describing how a system performs.
Latency and Response Time
Latency and response time are often confused:
- Response time: What the client measures—the total time from sending a request to receiving a response. Includes network time, queuing time, and processing time.
- Latency: The time a request is “latent” (waiting to be processed). Strictly, it excludes processing time, but in common usage, the terms are interchangeable.
- Service time: The actual time the server spends processing the request (excludes queuing and network).
The distinction matters: a request might have 1ms service time but 500ms response time due to network round-trips and queuing at overloaded upstream services.
Average, Median, and Percentiles
Why averages are misleading: If 99 requests take 10ms and 1 request takes 10,000ms:
- Arithmetic mean: ~110ms (heavily skewed by the outlier)
- p50 (median): 10ms (half of requests are at or below this)
- p99: 10,000ms (captures the outlier)
Percentiles express: “X% of requests completed in less than Y ms.”
| Percentile | Meaning | When to use |
|---|---|---|
| p50 | Median; 50% of requests are faster | Typical user experience |
| p95 | 95% of requests are faster | Near-worst-case experience |
| p99 | 99% of requests are faster | Tail latency; most users’ worst experience |
| p99.9 | 99.9% of requests are faster | Very expensive to optimize; for critical paths |
Tail latency amplification: In a system that makes N parallel calls to downstream services, the response time is bounded by the slowest call. If each service has p99 of 100ms, and a request makes 10 parallel downstream calls, the effective p99 for the composite request is much worse than 100ms.
Fan-out request: call 10 downstream services in parallel
P(all 10 complete within T) = P(single service < T)^10
If single service p99 = 100ms:
P(single service < 100ms) = 0.99
P(all 10 < 100ms) = 0.99^10 = 0.904
So effectively ~p90 at the composite level = 100ms
p99 of composite ≈ much higher
This explains why large distributed systems relentlessly focus on tail latency—each additional hop in a request path multiplies the tail latency problem.
Head-of-line blocking: In a queue or thread pool, a slow request can block all subsequent requests waiting behind it. This causes good requests to appear slow simply because they are queued behind a slow request. Mitigation: shorter timeouts, separate thread pools for different operation types (bulkheads), async processing.
Use of Response Time Metrics
Service Level Indicators (SLIs): Measurable metrics for a service’s performance (e.g., “request success rate,” “p99 latency”).
Service Level Objectives (SLOs): Targets for SLIs (e.g., “p99 latency < 200ms measured over 5-minute windows”).
Service Level Agreements (SLAs): Contractual commitments to SLOs with consequences for violation (e.g., “if SLO is violated, customer receives 10% credit”).
Error budgets (SRE concept): If the SLO is 99.9% availability, the error budget is 0.1% downtime (~43 minutes/month). Teams can spend this budget on risky deployments. When the budget is exhausted, the team must focus on reliability, not new features.
Measuring response time correctly: client-side measurement is more accurate than server-side (captures network and queue time). Rolling percentiles over a time window (e.g., 1-minute rolling p99) are more actionable than batch statistics.
Reliability and Fault Tolerance
Reliability: A system works correctly (performing the correct function at the desired level of performance) even when things go wrong.
Fault vs Failure:
- Fault: One component deviating from its specification (a disk crashes, a network packet is dropped, a bug is triggered)
- Failure: The system as a whole stops providing service to users
A fault-tolerant (or resilient) system prevents faults from causing failures. The goal is not to prevent faults (impossible at scale) but to prevent them from causing service failures.
Why tolerating faults beats preventing them: At the scale of large systems, hardware fails every day, software has bugs, and humans make mistakes. The only achievable goal is ensuring faults don’t cascade into failures.
Fault Tolerance
Fault tolerance mechanisms work by:
- Detecting the fault quickly (health checks, watchdogs, timeouts)
- Isolating it from the rest of the system (circuit breakers, bulkheads)
- Recovering by routing around the fault (failover, retry, degrade gracefully)
Circuit breaker pattern: When a service is failing, stop sending it requests immediately (instead of waiting for each to time out). After a cooldown period, try again. Prevents cascading failures and reduces load on already-struggling services.
Hardware and Software Faults
Hardware faults are random and typically independent:
- Hard disk MTTF (Mean Time To Failure): 10–50 years per disk
- At 10,000 disks: expect ~1 failure per day
- RAM errors, NIC failures, power outages, cooling failures
- Traditional mitigation: Hardware redundancy (RAID, dual power supplies, hot-swap CPUs)
- Modern mitigation: Software fault tolerance — assume commodity hardware fails; handle it in software (replicate data, failover automatically). Cloud instances are terminated with no notice.
Software faults are systematic and often correlated:
- A bug affecting all nodes when they all run the same code
- A runaway process consuming all CPU or memory
- A slow external service dependency (e.g., DNS resolver latency spike cascades to all services)
- Cascading failures (service A slows → service B’s queue fills → service B slows → service A’s queue fills)
- Harder to anticipate than hardware faults because they often manifest only under specific conditions (high load, specific data, leap second)
- Mitigation: Testing (unit, integration, chaos engineering), process isolation, careful handling of external dependencies, crash-only software
Classic examples:
- Linux kernel leap second bug (June 30, 2012): Caused CPU spin in
clock_gettime, bringing down hundreds of services that relied on a correct monotonic clock - AWS EC2 EBS outage (2011): Single storage component failure cascaded through the retry logic of multiple services
Humans and Reliability
Human error is the leading cause of production outages in most organizations. Studies show configuration errors cause the majority of internet service outages, not hardware failures.
Strategies to reduce human error impact:
- Design to minimize errors: Good APIs and abstractions make the right thing easy and the wrong thing hard. Sandbox/staging environments let people experiment safely.
- Decouple mistakes from failures: Use feature flags for gradual rollout; canary deployments catch bugs before full rollout; schema migrations separate from code deployments.
- Allow quick recovery: Fast rollback (one-click revert), point-in-time database recovery, roll-forward through replayed events.
- Telemetry: Detailed monitoring, structured logging, distributed tracing. You cannot fix what you cannot see.
- Good operational practices: Runbooks, on-call rotations, blameless post-mortems, training.
Scalability
Scalability: The system’s ability to cope with increased load. It is not a binary yes/no property, but a question: “If the load increases by 10x, what are our options for keeping the system working?”
Scalability is always about a specific load parameter—the metric that characterizes the demand on the system. Identifying the right load parameter is half the work.
Understanding Load
Load parameters are the numbers that describe the demand on the system:
- Requests per second to a web server
- Ratio of reads to writes in a database
- Number of simultaneously active users in a chat system
- Cache hit rate
- Number of followers per user (for fan-out calculations)
Different systems have different bottleneck parameters. For the home timeline case study:
- The key parameter is not QPS—it is the distribution of follower counts. An average user has 200 followers; a celebrity has 10M. The tail of the distribution determines system architecture.
Two ways to describe scalability:
- Throughput: If load stays the same, how does performance change as we add resources? (Batch processing context)
- Response time: If load increases by X, how much do we need to scale to keep performance constant? (Online systems context)
Shared-Memory, Shared-Disk, and Shared-Nothing Architectures
This is a new section in the 2nd edition that was absent from the first edition. It provides a taxonomy for horizontal scaling approaches.
Shared-Memory Architecture (SMP):
- Multiple CPUs (or cores) share a single pool of RAM and disk
- A single server with 128 cores and 1TB RAM
- Pro: Simple programming model; any thread can access any data
- Con: Cost scales non-linearly (high-end servers are very expensive); hardware becomes a single point of failure; practical upper limit (~a few hundred cores per machine)
- When to use: When you haven’t yet hit the limits of the most powerful available machine; when simplicity outweighs cost
Shared-Memory (SMP):
┌─────────────────────────────────┐
│ CPU1 CPU2 CPU3 CPU4 ... │
│ └──────┴──────┘ │
│ Shared RAM │
│ Shared Disk │
└─────────────────────────────────┘
Single machine, all resources shared
Shared-Disk Architecture:
- Multiple machines each with their own CPU and RAM, but sharing a central disk (NAS/SAN) or object store (S3)
- Cloud version: Separate compute clusters all reading from S3 (Snowflake, Redshift Spectrum)
- Pro: Storage is highly available and elastic independently of compute; compute can be added/removed freely
- Con: Shared disk can become a bottleneck; network latency to storage vs local disk
- When to use: Cloud data warehouses (Snowflake, BigQuery); systems where compute and storage need to scale independently
Shared-Disk:
┌──────┐ ┌──────┐ ┌──────┐
│ CPU+ │ │ CPU+ │ │ CPU+ │
│ RAM │ │ RAM │ │ RAM │
└──┬───┘ └──┬───┘ └──┬───┘
│ │ │
└──────────┴──────────┘
│
┌──────▼──────┐
│ Shared Disk │
│ (NAS/SAN/ │
│ S3) │
└─────────────┘
Shared-Nothing Architecture (SN / Horizontal Scaling):
- Each machine has its own CPU, RAM, and disk; machines communicate only via a network
- Data is partitioned (sharded) across machines; each machine handles a subset
- Pro: Linear cost scaling; no single hardware bottleneck; can use commodity hardware globally
- Con: Requires data partitioning (sharding); cross-node operations require network communication; distributed transactions are expensive; consistency is harder
- When to use: When vertical scaling limits are reached; when fault tolerance across machines is required; most modern distributed databases
Shared-Nothing:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ CPU+RAM │ │ CPU+RAM │ │ CPU+RAM │
│ Disk │ │ Disk │ │ Disk │
│ (shard A)│ │ (shard B)│ │ (shard C)│
└────┬─────┘ └────┬─────┘ └────┬─────┘
└──────────────┼──────────────┘
Network
Principles for Scalability
Vertical scaling (scale-up): Move to a more powerful machine. Simple—no code changes required. Limited by the most powerful available hardware and cost curves. First step when you hit a bottleneck.
Horizontal scaling (scale-out): Add more machines. Unlimited theoretical ceiling. Requires data distribution (sharding), load balancing, and application logic for handling distributed state.
Elastic vs manual scaling:
- Elastic: System automatically adds resources when load increases (AWS Auto Scaling, Kubernetes HPA)
- Manual: Humans analyze load and provision capacity ahead of time
- Elastic is essential for highly variable workloads; manual is appropriate for predictable workloads where over-provisioning is controllable
Key principle: Architecture decisions for scalability are specific to the application’s load parameters. There is no magic one-size-fits-all scalable architecture. The Twitter hybrid timeline approach is correct for Twitter’s specific workload; it would be wrong for a different read/write ratio.
The two-pizza team rule (from Amazon, indirectly): Architecture should enable small teams to own and scale individual components independently. Coupling between components limits both organizational and system scalability.
Maintainability
Maintainability: Making it easy for engineering and operations teams to work on the system over time. The majority of software cost is maintenance (bug fixes, new features, operational work, adaptation to new platforms), not initial development.
The authors name three principles of maintainability:
Operability: Making Life Easy for Operations
Operability means making it easy for operations teams to keep the system running smoothly day to day.
Characteristics of a highly operable system:
- Visibility: Good monitoring, distributed tracing, and alerting. “What is the system doing right now?” should always have an answer.
- Automation support: Standard interfaces for deployment, configuration management, and health checking. Works with CI/CD pipelines and infrastructure-as-code tools.
- Predictable behavior: Surprises are the enemy of operations. Systems should behave deterministically under known conditions. Avoid magic, auto-tuning behavior that changes without notice.
- Good default behavior: Sensible defaults that work without manual tuning; easy-to-override when needed.
- Documentation: Operational procedures, runbooks, architecture decision records (ADRs), dependency maps.
- Avoiding single points of human knowledge: The system’s operation should not depend on one expert. Bus factor > 1.
Operations in 2026: The SRE (Site Reliability Engineering) model from Google has become the industry standard. Key SRE practices:
- Error budgets: define acceptable unreliability; balance reliability with velocity
- Toil reduction: automate repetitive manual work
- Postmortems: blameless retrospectives after incidents to prevent recurrence
Simplicity: Managing Complexity
Simplicity means making it easy for new engineers to understand the system. This is distinct from simplicity of the user interface—a system can have a simple interface hiding enormous complexity.
Accidental vs essential complexity:
- Essential complexity: The inherent complexity of the problem being solved (e.g., distributed consensus is fundamentally hard)
- Accidental complexity: Complexity introduced by the implementation—it could be avoided with better design
Symptoms of accidental complexity:
- Explosion of state space (too many special cases)
- Tight coupling between modules (changes in one require changes in many others)
- Tangled dependencies (circular references, implicit ordering)
- Inconsistent naming and terminology
- Hacks accumulated to work around earlier design mistakes
- Undocumented implicit contracts between components
Solution: Abstraction. Good abstractions hide implementation complexity behind clean, stable interfaces. SQL abstracts away B-tree implementations. TCP abstracts away packet routing. A well-designed service API abstracts away the internal data model. Abstractions let you reason about a system at the right level of detail.
The abstraction quality test: A good abstraction hides complexity without leaking it. A leaky abstraction forces users to understand the hidden complexity anyway (e.g., ORM that requires knowing about SQL indexes).
Evolvability: Making Change Easy
Evolvability (also called extensibility, modifiability, or plasticity) means making it easy to change the system as requirements change.
Requirements change constantly: new features, business pivots, regulatory requirements, performance improvements, technology migrations. A system that is hard to change eventually calcifies—it can’t adapt, and the cost of change pushes teams toward riskier “big bang” rewrites.
Connection to simplicity: Simple, well-abstracted systems are easier to change. Accidental complexity is the primary enemy of evolvability.
Agile practices that support evolvability:
- Test-driven development (TDD): Tests serve as a safety net for change; they catch regressions
- Refactoring: Continuously improve the design without changing behavior
- Continuous integration/continuous deployment (CI/CD): Small, frequent changes reduce risk vs large infrequent releases
- Feature flags: Deploy code without activating features; activate gradually
Schema evolution: Data systems must handle schema changes gracefully. This connects to later chapters on encoding (Protobuf backward/forward compatibility), replication (schema changes during rolling upgrades), and storage (LSM trees and compaction).
Comparison Tables
Architecture Comparison: Shared-Memory vs Shared-Disk vs Shared-Nothing
| Dimension | Shared-Memory (SMP) | Shared-Disk | Shared-Nothing |
|---|---|---|---|
| Memory | All CPUs share RAM | Each node has own RAM | Each node has own RAM |
| Storage | Shared | Shared (NAS/SAN/S3) | Each node has own disk |
| Communication | Memory bus | Network to storage | Network between nodes |
| Scalability ceiling | ~few hundred cores | Storage IOPS | Theoretically unlimited |
| Cost curve | Non-linear (premium for large SMP) | Linear compute + separate storage | Linear |
| Consistency | Easy (shared memory) | Medium (shared storage) | Hard (distributed) |
| Fault tolerance | Hardware RAID; single PoF | High (distributed storage) | High (replication) |
| Examples | PostgreSQL on big VM | Snowflake, Redshift on S3 | Cassandra, Spanner, Kafka |
| When to use | Haven’t hit SMP limits | Cloud warehouse scaling | Horizontal scale required |
Fault Types and Mitigations
| Fault Type | Characteristics | Detection | Mitigation |
|---|---|---|---|
| Hardware | Random, independent, predictable rate | Health checks, SMART | Redundancy, replication, failover |
| Software | Systematic, correlated, hard to predict | Monitoring, alerting, crash reports | Testing, isolation, chaos engineering |
| Human | Leading cause of outages, config-related | Audit logs, change management | Good design, staging, gradual rollout, monitoring |
Scaling Approaches
| Approach | Mechanism | Pros | Cons | When to Use |
|---|---|---|---|---|
| Vertical (scale-up) | Bigger machine | No code changes, simple | Limited ceiling, expensive, single PoF | Start here; most apps |
| Horizontal read scaling | Read replicas | Handle read-heavy workloads | Replication lag; writes still single-node | Read-heavy OLTP |
| Horizontal write scaling | Sharding | True write scalability | Complex; cross-shard queries hard | Write-heavy OLTP |
| Elastic auto-scaling | Auto add/remove nodes on load | Variable load; no over-provisioning | Warm-up time; state management | Cloud-native, variable load |
| Caching | Pre-computed results in memory | Dramatic read performance boost | Staleness; invalidation complexity | Read-heavy, repeated queries |
SLI / SLO / SLA Definitions
| Term | Full Name | What It Is | Example |
|---|---|---|---|
| SLI | Service Level Indicator | A measurable metric | ”Fraction of requests with latency < 200ms” |
| SLO | Service Level Objective | Internal target for an SLI | ”SLI must be ≥ 99.5% over a 28-day window” |
| SLA | Service Level Agreement | Contractual commitment with penalties | ”If SLO is missed, customer gets 10% credit” |
| Error budget | — | How much unreliability is allowed | ”1 - 99.5% = 0.5% failure budget per month” |
Important Points Summary
- Nonfunctional requirements must be precise: “Fast” and “reliable” are useless targets. “p99 < 200ms at 50K RPS” is actionable. Define SLOs with specific numbers before designing.
- The home timeline case study reveals the core scalability tension: Fan-out on write (push) vs fan-out on read (pull). Neither is universally right—the correct hybrid depends on the read/write ratio and follower distribution.
- Percentiles, not averages: Averages hide tail latencies. Use p50, p95, p99. Tail latency amplification means p99 compounds rapidly in systems with many downstream calls.
- Fault tolerance beats fault prevention: At scale, faults are inevitable. Design to contain faults, not to prevent them. Netflix Chaos Monkey, SRE error budgets, and circuit breakers are all applications of this principle.
- Hardware faults are random; software faults are correlated: A software bug can take down every replica simultaneously. Hardware failures are independent. Both require different mitigation strategies.
- Human error is the leading cause of outages: Design systems that make the wrong thing hard—staging environments, gradual rollout, fast rollback, detailed monitoring.
- Shared-nothing scales linearly but adds distributed complexity: The choice of memory architecture determines where complexity lives. Shared-memory is simple but expensive; shared-nothing is scalable but requires sharding and distributed state management.
- Simplicity and evolvability are connected: Accidental complexity is the primary enemy of the ability to change the system. Abstractions that hide essential complexity without leaking it are the primary tool.
- Maintainability costs more than initial development: Most software budget is maintenance. Design for the engineers who will own this system in 3 years, not just for the deadline.
- Operability requires visibility: A system you cannot observe is a system you cannot operate. Monitoring, tracing, and structured logging are not optional features—they are foundational infrastructure.
1st Edition vs 2nd Edition: What Changed
| Aspect | 1st Edition (Ch1) | 2nd Edition (Ch2) |
|---|---|---|
| Case study | Twitter (statistics cited) | Social network (full worked example with SQL and fan-out) |
| Memory architectures | Not covered | New section: Shared-Memory, Shared-Disk, Shared-Nothing |
| SLI/SLO/SLA | Mentioned briefly | Formal definitions with error budget concept |
| Tail latency amplification | Mentioned | Quantified with probability calculation |
| Fault coverage | Hardware, software, human | Same, but more structured with fault tolerance patterns |
| Operability | Bullet list | Connected to SRE practices |
Modern Context (2026)
Percentiles are now standard practice: p99 and p99.9 are reported by default in Datadog, New Relic, Honeycomb, and every major observability platform. The 1st edition had to argue for this; in 2026, it is assumed.
SRE is mainstream: Google’s SRE model (error budgets, SLIs/SLOs, toil reduction) is now standard at companies of all sizes, not just Google-scale. The “SRE Book” and “Site Reliability Workbook” are standard references.
Platform engineering has emerged: Internal developer platforms (IDPs) built by dedicated platform teams provide self-service infrastructure, automated deployments, and standardized observability. Tools: Backstage (developer portal), Crossplane (infrastructure as code), ArgoCD (GitOps).
Chaos engineering is a discipline: Netflix Chaos Monkey, Gremlin, and Chaos Mesh (Kubernetes) are production tools. Chaos engineering is not a stunt—it is a systematic approach to verifying fault tolerance before faults occur in production.
Auto-scaling is the default: Kubernetes Horizontal Pod Autoscaler (HPA), AWS Auto Scaling groups, and cloud function auto-scaling have made elastic scaling the expected baseline. Manual capacity planning is now considered a legacy pattern for most web workloads.
Shared-nothing at cloud scale: The “shared-nothing” model is the architecture of every major cloud database (DynamoDB, Spanner, Cassandra, Kafka). The 2026 landscape confirms the 2nd edition’s emphasis on shared-nothing as the dominant pattern for large-scale systems.
Questions for Reflection
- The home timeline case study uses a hybrid fan-out approach. At what follower count threshold would you switch from push to pull? How would you measure this in production?
- If your system’s p99 latency is 200ms but p99.9 is 5 seconds, what does that suggest about the system’s behavior? What would you investigate first?
- Explain tail latency amplification with a specific example: a web request that calls 5 microservices in parallel, each with p99 of 100ms. What is the approximate p99 of the overall request?
- A startup is deciding between a single PostgreSQL node with vertical scaling vs. a Cassandra cluster from the start. What questions would you ask to guide this decision?
- What is the difference between essential and accidental complexity? Give an example of each from a real system you’ve worked on or studied.
- How would you design a social network’s home timeline system to handle GDPR right to erasure for posts? Specifically, when a user deletes a post, how do you update millions of cached timelines?
Related Resources
- ch01-tradeoffs-data-systems — Architectural context (OLTP/OLAP, cloud, distributed vs single-node)
- ch06-replication — How replication enables fault tolerance and read scaling
- ch07-sharding — Shared-nothing scaling via data partitioning
- ch08-transactions — Reliability at the transaction level (ACID)
- ch09-trouble-with-distributed-systems — Hardware, network, and clock failures in distributed systems
- ch10-consistency-and-consensus — The consistency side of fault-tolerant design
- ch01-reliable-scalable-maintainable — 1st edition Chapter 1 (same topics, different structure)
Last Updated: 2026-05-29