Chapter 2: Defining Nonfunctional Requirements

ddia-2e nonfunctional reliability scalability maintainability performance

Status: Notes complete


Overview

This chapter is the 2nd edition’s restructuring of the 1st edition’s Chapter 1. The core topics—reliability, scalability, and maintainability—remain, but the chapter is grounded in a concrete case study (a social network home timeline system) rather than abstract Twitter statistics. The chapter adds a formal treatment of memory architectures (shared-memory, shared-disk, shared-nothing) that was absent from the first edition. The central argument is that nonfunctional requirements—how a system behaves under load, failure, and change—are as important as functional requirements, and they must be defined with precision before architecture decisions are made. As a concrete example, “the system should be fast” is useless, while “p99 read latency must be < 200ms at 100K concurrent users” is actionable. The difference between these two statements is the difference between a system that accidentally meets its goals and one that is engineered to meet them.


Key Concepts

Case Study: Social Network Home Timelines

The chapter opens with a concrete case study: build the home timeline feature of a social network (similar to Twitter/X, Instagram, or Mastodon). This is a canonical system design problem that exposes all the nonfunctional trade-offs discussed later. Before abstract definitions, the reader sees exactly why these properties matter.

Representing Users, Posts, and Follows

The domain has three entities:

  • User: Account with profile information
  • Post: Content created by a user (text, media, etc.)
  • Follow: A directed relationship (user A follows user B)

Naive relational representation:

CREATE TABLE users (user_id INT PRIMARY KEY, username TEXT, ...);
CREATE TABLE posts (post_id INT PRIMARY KEY, author_id INT REFERENCES users, content TEXT, created_at TIMESTAMP);
CREATE TABLE follows (follower_id INT REFERENCES users, followee_id INT REFERENCES users, PRIMARY KEY (follower_id, followee_id));

Fetching a user’s home timeline (naive):

SELECT posts.*
FROM posts
JOIN follows ON posts.author_id = follows.followee_id
WHERE follows.follower_id = :current_user_id
ORDER BY posts.created_at DESC
LIMIT 50;

This works at small scale. The problem: with millions of users and hundreds of millions of posts, this join is re-executed on every timeline request. At 300K timeline reads/second, no single database can execute this join fast enough.

Materializing and Updating Timelines

The materialized timeline approach: precompute the timeline for each user and store it in a fast cache.

Fan-out on write (push model):
┌─────────────┐    post created    ┌─────────────────────────┐
│  User Alice  │─────────────────▶ │ Fan-out worker          │
│  (1M followers) │                │ Write to 1M timelines   │
└─────────────┘                    └─────────────────────────┘
                                          │
                                          ▼
                                   Timeline cache per user
                                   [user_1_timeline: [post_ids]]
                                   [user_2_timeline: [post_ids]]
                                   [user_3_timeline: [post_ids]]
                                   ...

Fan-out on read (pull model): Execute the join at read time. Fast writes, slow reads.

Fan-out on write (push model): Precompute at write time. Fast reads, slow writes. Problem: celebrity with 1M followers posting one tweet triggers 1M cache writes = write amplification.

Hybrid solution (the right answer for Twitter-like systems):

  • Most users: fan-out on write (their posts go directly to followers’ timeline caches)
  • Celebrities (high follower count): fan-out on read (merged into timeline at read time)
  • Threshold is configurable (typically followers > ~100K–1M)

This case study demonstrates the key insight: there is no universally correct answer. The right approach depends on the ratio of reads to writes and the distribution of follower counts—both are load parameters that must be measured.


Describing Performance

Before you can reason about reliability or scalability, you need precise vocabulary for describing how a system performs.

Latency and Response Time

Latency and response time are often confused:

  • Response time: What the client measures—the total time from sending a request to receiving a response. Includes network time, queuing time, and processing time.
  • Latency: The time a request is “latent” (waiting to be processed). Strictly, it excludes processing time, but in common usage, the terms are interchangeable.
  • Service time: The actual time the server spends processing the request (excludes queuing and network).

The distinction matters: a request might have 1ms service time but 500ms response time due to network round-trips and queuing at overloaded upstream services.

Average, Median, and Percentiles

Why averages are misleading: If 99 requests take 10ms and 1 request takes 10,000ms:

  • Arithmetic mean: ~110ms (heavily skewed by the outlier)
  • p50 (median): 10ms (half of requests are at or below this)
  • p99: 10,000ms (captures the outlier)

Percentiles express: “X% of requests completed in less than Y ms.”

PercentileMeaningWhen to use
p50Median; 50% of requests are fasterTypical user experience
p9595% of requests are fasterNear-worst-case experience
p9999% of requests are fasterTail latency; most users’ worst experience
p99.999.9% of requests are fasterVery expensive to optimize; for critical paths

Tail latency amplification: In a system that makes N parallel calls to downstream services, the response time is bounded by the slowest call. If each service has p99 of 100ms, and a request makes 10 parallel downstream calls, the effective p99 for the composite request is much worse than 100ms.

Fan-out request: call 10 downstream services in parallel
P(all 10 complete within T) = P(single service < T)^10

If single service p99 = 100ms:
P(single service < 100ms) = 0.99
P(all 10 < 100ms) = 0.99^10 = 0.904

So effectively ~p90 at the composite level = 100ms
p99 of composite ≈ much higher

This explains why large distributed systems relentlessly focus on tail latency—each additional hop in a request path multiplies the tail latency problem.

Head-of-line blocking: In a queue or thread pool, a slow request can block all subsequent requests waiting behind it. This causes good requests to appear slow simply because they are queued behind a slow request. Mitigation: shorter timeouts, separate thread pools for different operation types (bulkheads), async processing.

Use of Response Time Metrics

Service Level Indicators (SLIs): Measurable metrics for a service’s performance (e.g., “request success rate,” “p99 latency”).

Service Level Objectives (SLOs): Targets for SLIs (e.g., “p99 latency < 200ms measured over 5-minute windows”).

Service Level Agreements (SLAs): Contractual commitments to SLOs with consequences for violation (e.g., “if SLO is violated, customer receives 10% credit”).

Error budgets (SRE concept): If the SLO is 99.9% availability, the error budget is 0.1% downtime (~43 minutes/month). Teams can spend this budget on risky deployments. When the budget is exhausted, the team must focus on reliability, not new features.

Measuring response time correctly: client-side measurement is more accurate than server-side (captures network and queue time). Rolling percentiles over a time window (e.g., 1-minute rolling p99) are more actionable than batch statistics.


Reliability and Fault Tolerance

Reliability: A system works correctly (performing the correct function at the desired level of performance) even when things go wrong.

Fault vs Failure:

  • Fault: One component deviating from its specification (a disk crashes, a network packet is dropped, a bug is triggered)
  • Failure: The system as a whole stops providing service to users

A fault-tolerant (or resilient) system prevents faults from causing failures. The goal is not to prevent faults (impossible at scale) but to prevent them from causing service failures.

Why tolerating faults beats preventing them: At the scale of large systems, hardware fails every day, software has bugs, and humans make mistakes. The only achievable goal is ensuring faults don’t cascade into failures.

Fault Tolerance

Fault tolerance mechanisms work by:

  1. Detecting the fault quickly (health checks, watchdogs, timeouts)
  2. Isolating it from the rest of the system (circuit breakers, bulkheads)
  3. Recovering by routing around the fault (failover, retry, degrade gracefully)

Circuit breaker pattern: When a service is failing, stop sending it requests immediately (instead of waiting for each to time out). After a cooldown period, try again. Prevents cascading failures and reduces load on already-struggling services.

Hardware and Software Faults

Hardware faults are random and typically independent:

  • Hard disk MTTF (Mean Time To Failure): 10–50 years per disk
  • At 10,000 disks: expect ~1 failure per day
  • RAM errors, NIC failures, power outages, cooling failures
  • Traditional mitigation: Hardware redundancy (RAID, dual power supplies, hot-swap CPUs)
  • Modern mitigation: Software fault tolerance — assume commodity hardware fails; handle it in software (replicate data, failover automatically). Cloud instances are terminated with no notice.

Software faults are systematic and often correlated:

  • A bug affecting all nodes when they all run the same code
  • A runaway process consuming all CPU or memory
  • A slow external service dependency (e.g., DNS resolver latency spike cascades to all services)
  • Cascading failures (service A slows → service B’s queue fills → service B slows → service A’s queue fills)
  • Harder to anticipate than hardware faults because they often manifest only under specific conditions (high load, specific data, leap second)
  • Mitigation: Testing (unit, integration, chaos engineering), process isolation, careful handling of external dependencies, crash-only software

Classic examples:

  • Linux kernel leap second bug (June 30, 2012): Caused CPU spin in clock_gettime, bringing down hundreds of services that relied on a correct monotonic clock
  • AWS EC2 EBS outage (2011): Single storage component failure cascaded through the retry logic of multiple services

Humans and Reliability

Human error is the leading cause of production outages in most organizations. Studies show configuration errors cause the majority of internet service outages, not hardware failures.

Strategies to reduce human error impact:

  1. Design to minimize errors: Good APIs and abstractions make the right thing easy and the wrong thing hard. Sandbox/staging environments let people experiment safely.
  2. Decouple mistakes from failures: Use feature flags for gradual rollout; canary deployments catch bugs before full rollout; schema migrations separate from code deployments.
  3. Allow quick recovery: Fast rollback (one-click revert), point-in-time database recovery, roll-forward through replayed events.
  4. Telemetry: Detailed monitoring, structured logging, distributed tracing. You cannot fix what you cannot see.
  5. Good operational practices: Runbooks, on-call rotations, blameless post-mortems, training.

Scalability

Scalability: The system’s ability to cope with increased load. It is not a binary yes/no property, but a question: “If the load increases by 10x, what are our options for keeping the system working?”

Scalability is always about a specific load parameter—the metric that characterizes the demand on the system. Identifying the right load parameter is half the work.

Understanding Load

Load parameters are the numbers that describe the demand on the system:

  • Requests per second to a web server
  • Ratio of reads to writes in a database
  • Number of simultaneously active users in a chat system
  • Cache hit rate
  • Number of followers per user (for fan-out calculations)

Different systems have different bottleneck parameters. For the home timeline case study:

  • The key parameter is not QPS—it is the distribution of follower counts. An average user has 200 followers; a celebrity has 10M. The tail of the distribution determines system architecture.

Two ways to describe scalability:

  1. Throughput: If load stays the same, how does performance change as we add resources? (Batch processing context)
  2. Response time: If load increases by X, how much do we need to scale to keep performance constant? (Online systems context)

Shared-Memory, Shared-Disk, and Shared-Nothing Architectures

This is a new section in the 2nd edition that was absent from the first edition. It provides a taxonomy for horizontal scaling approaches.

Shared-Memory Architecture (SMP):

  • Multiple CPUs (or cores) share a single pool of RAM and disk
  • A single server with 128 cores and 1TB RAM
  • Pro: Simple programming model; any thread can access any data
  • Con: Cost scales non-linearly (high-end servers are very expensive); hardware becomes a single point of failure; practical upper limit (~a few hundred cores per machine)
  • When to use: When you haven’t yet hit the limits of the most powerful available machine; when simplicity outweighs cost
Shared-Memory (SMP):
┌─────────────────────────────────┐
│  CPU1  CPU2  CPU3  CPU4  ...    │
│      └──────┴──────┘            │
│           Shared RAM            │
│           Shared Disk           │
└─────────────────────────────────┘
Single machine, all resources shared

Shared-Disk Architecture:

  • Multiple machines each with their own CPU and RAM, but sharing a central disk (NAS/SAN) or object store (S3)
  • Cloud version: Separate compute clusters all reading from S3 (Snowflake, Redshift Spectrum)
  • Pro: Storage is highly available and elastic independently of compute; compute can be added/removed freely
  • Con: Shared disk can become a bottleneck; network latency to storage vs local disk
  • When to use: Cloud data warehouses (Snowflake, BigQuery); systems where compute and storage need to scale independently
Shared-Disk:
┌──────┐  ┌──────┐  ┌──────┐
│ CPU+  │  │ CPU+  │  │ CPU+  │
│ RAM  │  │ RAM  │  │ RAM  │
└──┬───┘  └──┬───┘  └──┬───┘
   │          │          │
   └──────────┴──────────┘
              │
       ┌──────▼──────┐
       │ Shared Disk │
       │  (NAS/SAN/  │
       │    S3)      │
       └─────────────┘

Shared-Nothing Architecture (SN / Horizontal Scaling):

  • Each machine has its own CPU, RAM, and disk; machines communicate only via a network
  • Data is partitioned (sharded) across machines; each machine handles a subset
  • Pro: Linear cost scaling; no single hardware bottleneck; can use commodity hardware globally
  • Con: Requires data partitioning (sharding); cross-node operations require network communication; distributed transactions are expensive; consistency is harder
  • When to use: When vertical scaling limits are reached; when fault tolerance across machines is required; most modern distributed databases
Shared-Nothing:
┌──────────┐   ┌──────────┐   ┌──────────┐
│ CPU+RAM  │   │ CPU+RAM  │   │ CPU+RAM  │
│ Disk     │   │ Disk     │   │ Disk     │
│ (shard A)│   │ (shard B)│   │ (shard C)│
└────┬─────┘   └────┬─────┘   └────┬─────┘
     └──────────────┼──────────────┘
                 Network

Principles for Scalability

Vertical scaling (scale-up): Move to a more powerful machine. Simple—no code changes required. Limited by the most powerful available hardware and cost curves. First step when you hit a bottleneck.

Horizontal scaling (scale-out): Add more machines. Unlimited theoretical ceiling. Requires data distribution (sharding), load balancing, and application logic for handling distributed state.

Elastic vs manual scaling:

  • Elastic: System automatically adds resources when load increases (AWS Auto Scaling, Kubernetes HPA)
  • Manual: Humans analyze load and provision capacity ahead of time
  • Elastic is essential for highly variable workloads; manual is appropriate for predictable workloads where over-provisioning is controllable

Key principle: Architecture decisions for scalability are specific to the application’s load parameters. There is no magic one-size-fits-all scalable architecture. The Twitter hybrid timeline approach is correct for Twitter’s specific workload; it would be wrong for a different read/write ratio.

The two-pizza team rule (from Amazon, indirectly): Architecture should enable small teams to own and scale individual components independently. Coupling between components limits both organizational and system scalability.


Maintainability

Maintainability: Making it easy for engineering and operations teams to work on the system over time. The majority of software cost is maintenance (bug fixes, new features, operational work, adaptation to new platforms), not initial development.

The authors name three principles of maintainability:

Operability: Making Life Easy for Operations

Operability means making it easy for operations teams to keep the system running smoothly day to day.

Characteristics of a highly operable system:

  • Visibility: Good monitoring, distributed tracing, and alerting. “What is the system doing right now?” should always have an answer.
  • Automation support: Standard interfaces for deployment, configuration management, and health checking. Works with CI/CD pipelines and infrastructure-as-code tools.
  • Predictable behavior: Surprises are the enemy of operations. Systems should behave deterministically under known conditions. Avoid magic, auto-tuning behavior that changes without notice.
  • Good default behavior: Sensible defaults that work without manual tuning; easy-to-override when needed.
  • Documentation: Operational procedures, runbooks, architecture decision records (ADRs), dependency maps.
  • Avoiding single points of human knowledge: The system’s operation should not depend on one expert. Bus factor > 1.

Operations in 2026: The SRE (Site Reliability Engineering) model from Google has become the industry standard. Key SRE practices:

  • Error budgets: define acceptable unreliability; balance reliability with velocity
  • Toil reduction: automate repetitive manual work
  • Postmortems: blameless retrospectives after incidents to prevent recurrence

Simplicity: Managing Complexity

Simplicity means making it easy for new engineers to understand the system. This is distinct from simplicity of the user interface—a system can have a simple interface hiding enormous complexity.

Accidental vs essential complexity:

  • Essential complexity: The inherent complexity of the problem being solved (e.g., distributed consensus is fundamentally hard)
  • Accidental complexity: Complexity introduced by the implementation—it could be avoided with better design

Symptoms of accidental complexity:

  • Explosion of state space (too many special cases)
  • Tight coupling between modules (changes in one require changes in many others)
  • Tangled dependencies (circular references, implicit ordering)
  • Inconsistent naming and terminology
  • Hacks accumulated to work around earlier design mistakes
  • Undocumented implicit contracts between components

Solution: Abstraction. Good abstractions hide implementation complexity behind clean, stable interfaces. SQL abstracts away B-tree implementations. TCP abstracts away packet routing. A well-designed service API abstracts away the internal data model. Abstractions let you reason about a system at the right level of detail.

The abstraction quality test: A good abstraction hides complexity without leaking it. A leaky abstraction forces users to understand the hidden complexity anyway (e.g., ORM that requires knowing about SQL indexes).

Evolvability: Making Change Easy

Evolvability (also called extensibility, modifiability, or plasticity) means making it easy to change the system as requirements change.

Requirements change constantly: new features, business pivots, regulatory requirements, performance improvements, technology migrations. A system that is hard to change eventually calcifies—it can’t adapt, and the cost of change pushes teams toward riskier “big bang” rewrites.

Connection to simplicity: Simple, well-abstracted systems are easier to change. Accidental complexity is the primary enemy of evolvability.

Agile practices that support evolvability:

  • Test-driven development (TDD): Tests serve as a safety net for change; they catch regressions
  • Refactoring: Continuously improve the design without changing behavior
  • Continuous integration/continuous deployment (CI/CD): Small, frequent changes reduce risk vs large infrequent releases
  • Feature flags: Deploy code without activating features; activate gradually

Schema evolution: Data systems must handle schema changes gracefully. This connects to later chapters on encoding (Protobuf backward/forward compatibility), replication (schema changes during rolling upgrades), and storage (LSM trees and compaction).


Comparison Tables

Architecture Comparison: Shared-Memory vs Shared-Disk vs Shared-Nothing

DimensionShared-Memory (SMP)Shared-DiskShared-Nothing
MemoryAll CPUs share RAMEach node has own RAMEach node has own RAM
StorageSharedShared (NAS/SAN/S3)Each node has own disk
CommunicationMemory busNetwork to storageNetwork between nodes
Scalability ceiling~few hundred coresStorage IOPSTheoretically unlimited
Cost curveNon-linear (premium for large SMP)Linear compute + separate storageLinear
ConsistencyEasy (shared memory)Medium (shared storage)Hard (distributed)
Fault toleranceHardware RAID; single PoFHigh (distributed storage)High (replication)
ExamplesPostgreSQL on big VMSnowflake, Redshift on S3Cassandra, Spanner, Kafka
When to useHaven’t hit SMP limitsCloud warehouse scalingHorizontal scale required

Fault Types and Mitigations

Fault TypeCharacteristicsDetectionMitigation
HardwareRandom, independent, predictable rateHealth checks, SMARTRedundancy, replication, failover
SoftwareSystematic, correlated, hard to predictMonitoring, alerting, crash reportsTesting, isolation, chaos engineering
HumanLeading cause of outages, config-relatedAudit logs, change managementGood design, staging, gradual rollout, monitoring

Scaling Approaches

ApproachMechanismProsConsWhen to Use
Vertical (scale-up)Bigger machineNo code changes, simpleLimited ceiling, expensive, single PoFStart here; most apps
Horizontal read scalingRead replicasHandle read-heavy workloadsReplication lag; writes still single-nodeRead-heavy OLTP
Horizontal write scalingShardingTrue write scalabilityComplex; cross-shard queries hardWrite-heavy OLTP
Elastic auto-scalingAuto add/remove nodes on loadVariable load; no over-provisioningWarm-up time; state managementCloud-native, variable load
CachingPre-computed results in memoryDramatic read performance boostStaleness; invalidation complexityRead-heavy, repeated queries

SLI / SLO / SLA Definitions

TermFull NameWhat It IsExample
SLIService Level IndicatorA measurable metric”Fraction of requests with latency < 200ms”
SLOService Level ObjectiveInternal target for an SLI”SLI must be ≥ 99.5% over a 28-day window”
SLAService Level AgreementContractual commitment with penalties”If SLO is missed, customer gets 10% credit”
Error budgetHow much unreliability is allowed”1 - 99.5% = 0.5% failure budget per month”

Important Points Summary

  • Nonfunctional requirements must be precise: “Fast” and “reliable” are useless targets. “p99 < 200ms at 50K RPS” is actionable. Define SLOs with specific numbers before designing.
  • The home timeline case study reveals the core scalability tension: Fan-out on write (push) vs fan-out on read (pull). Neither is universally right—the correct hybrid depends on the read/write ratio and follower distribution.
  • Percentiles, not averages: Averages hide tail latencies. Use p50, p95, p99. Tail latency amplification means p99 compounds rapidly in systems with many downstream calls.
  • Fault tolerance beats fault prevention: At scale, faults are inevitable. Design to contain faults, not to prevent them. Netflix Chaos Monkey, SRE error budgets, and circuit breakers are all applications of this principle.
  • Hardware faults are random; software faults are correlated: A software bug can take down every replica simultaneously. Hardware failures are independent. Both require different mitigation strategies.
  • Human error is the leading cause of outages: Design systems that make the wrong thing hard—staging environments, gradual rollout, fast rollback, detailed monitoring.
  • Shared-nothing scales linearly but adds distributed complexity: The choice of memory architecture determines where complexity lives. Shared-memory is simple but expensive; shared-nothing is scalable but requires sharding and distributed state management.
  • Simplicity and evolvability are connected: Accidental complexity is the primary enemy of the ability to change the system. Abstractions that hide essential complexity without leaking it are the primary tool.
  • Maintainability costs more than initial development: Most software budget is maintenance. Design for the engineers who will own this system in 3 years, not just for the deadline.
  • Operability requires visibility: A system you cannot observe is a system you cannot operate. Monitoring, tracing, and structured logging are not optional features—they are foundational infrastructure.

1st Edition vs 2nd Edition: What Changed

Aspect1st Edition (Ch1)2nd Edition (Ch2)
Case studyTwitter (statistics cited)Social network (full worked example with SQL and fan-out)
Memory architecturesNot coveredNew section: Shared-Memory, Shared-Disk, Shared-Nothing
SLI/SLO/SLAMentioned brieflyFormal definitions with error budget concept
Tail latency amplificationMentionedQuantified with probability calculation
Fault coverageHardware, software, humanSame, but more structured with fault tolerance patterns
OperabilityBullet listConnected to SRE practices

Modern Context (2026)

Percentiles are now standard practice: p99 and p99.9 are reported by default in Datadog, New Relic, Honeycomb, and every major observability platform. The 1st edition had to argue for this; in 2026, it is assumed.

SRE is mainstream: Google’s SRE model (error budgets, SLIs/SLOs, toil reduction) is now standard at companies of all sizes, not just Google-scale. The “SRE Book” and “Site Reliability Workbook” are standard references.

Platform engineering has emerged: Internal developer platforms (IDPs) built by dedicated platform teams provide self-service infrastructure, automated deployments, and standardized observability. Tools: Backstage (developer portal), Crossplane (infrastructure as code), ArgoCD (GitOps).

Chaos engineering is a discipline: Netflix Chaos Monkey, Gremlin, and Chaos Mesh (Kubernetes) are production tools. Chaos engineering is not a stunt—it is a systematic approach to verifying fault tolerance before faults occur in production.

Auto-scaling is the default: Kubernetes Horizontal Pod Autoscaler (HPA), AWS Auto Scaling groups, and cloud function auto-scaling have made elastic scaling the expected baseline. Manual capacity planning is now considered a legacy pattern for most web workloads.

Shared-nothing at cloud scale: The “shared-nothing” model is the architecture of every major cloud database (DynamoDB, Spanner, Cassandra, Kafka). The 2026 landscape confirms the 2nd edition’s emphasis on shared-nothing as the dominant pattern for large-scale systems.


Questions for Reflection

  1. The home timeline case study uses a hybrid fan-out approach. At what follower count threshold would you switch from push to pull? How would you measure this in production?
  2. If your system’s p99 latency is 200ms but p99.9 is 5 seconds, what does that suggest about the system’s behavior? What would you investigate first?
  3. Explain tail latency amplification with a specific example: a web request that calls 5 microservices in parallel, each with p99 of 100ms. What is the approximate p99 of the overall request?
  4. A startup is deciding between a single PostgreSQL node with vertical scaling vs. a Cassandra cluster from the start. What questions would you ask to guide this decision?
  5. What is the difference between essential and accidental complexity? Give an example of each from a real system you’ve worked on or studied.
  6. How would you design a social network’s home timeline system to handle GDPR right to erasure for posts? Specifically, when a user deletes a post, how do you update millions of cached timelines?

Last Updated: 2026-05-29