Chapter 2: Defining Nonfunctional Requirements

ddia-2e nonfunctional reliability scalability maintainability performance

Status: Notes complete

Overview

This chapter is the 2nd edition’s restructuring of the 1st edition’s Chapter 1. The core topics—reliability, scalability, and maintainability—remain, but the chapter is grounded in a concrete case study (a social network home timeline system) rather than abstract Twitter statistics. The chapter adds a formal treatment of memory architectures (shared-memory, shared-disk, shared-nothing) that was absent from the first edition. The central argument is that nonfunctional requirements—how a system behaves under load, failure, and change—are as important as functional requirements, and they must be defined with precision before architecture decisions are made. As a concrete example, “the system should be fast” is useless, while “p99 read latency must be < 200ms at 100K concurrent users” is actionable. The difference between these two statements is the difference between a system that accidentally meets its goals and one that is engineered to meet them.

Key Concepts

The chapter opens with a concrete case study: build the home timeline feature of a social network (similar to Twitter/X, Instagram, or Mastodon). This is a canonical system design problem that exposes all the nonfunctional trade-offs discussed later. Before abstract definitions, the reader sees exactly why these properties matter.

Representing Users, Posts, and Follows

The domain has three entities:

User: Account with profile information
Post: Content created by a user (text, media, etc.)
Follow: A directed relationship (user A follows user B)

Naive relational representation:

CREATE TABLE users (user_id INT PRIMARY KEY, username TEXT, ...);
CREATE TABLE posts (post_id INT PRIMARY KEY, author_id INT REFERENCES users, content TEXT, created_at TIMESTAMP);
CREATE TABLE follows (follower_id INT REFERENCES users, followee_id INT REFERENCES users, PRIMARY KEY (follower_id, followee_id));

Fetching a user’s home timeline (naive):

SELECT posts.*
FROM posts
JOIN follows ON posts.author_id = follows.followee_id
WHERE follows.follower_id = :current_user_id
ORDER BY posts.created_at DESC
LIMIT 50;

This works at small scale. The problem: with millions of users and hundreds of millions of posts, this join is re-executed on every timeline request. At 300K timeline reads/second, no single database can execute this join fast enough.

Materializing and Updating Timelines

The materialized timeline approach: precompute the timeline for each user and store it in a fast cache.

Fan-out on write (push model):
┌─────────────┐    post created    ┌─────────────────────────┐
│  User Alice  │─────────────────▶ │ Fan-out worker          │
│  (1M followers) │                │ Write to 1M timelines   │
└─────────────┘                    └─────────────────────────┘
                                          │
                                          ▼
                                   Timeline cache per user
                                   [user_1_timeline: [post_ids]]
                                   [user_2_timeline: [post_ids]]
                                   [user_3_timeline: [post_ids]]
                                   ...

Fan-out on read (pull model): Execute the join at read time. Fast writes, slow reads.

Fan-out on write (push model): Precompute at write time. Fast reads, slow writes. Problem: celebrity with 1M followers posting one tweet triggers 1M cache writes = write amplification.

Hybrid solution (the right answer for Twitter-like systems):

Most users: fan-out on write (their posts go directly to followers’ timeline caches)
Celebrities (high follower count): fan-out on read (merged into timeline at read time)
Threshold is configurable (typically followers > ~100K–1M)

This case study demonstrates the key insight: there is no universally correct answer. The right approach depends on the ratio of reads to writes and the distribution of follower counts—both are load parameters that must be measured.

Describing Performance

Before you can reason about reliability or scalability, you need precise vocabulary for describing how a system performs.

Latency and Response Time

Latency and response time are often confused:

Response time: What the client measures—the total time from sending a request to receiving a response. Includes network time, queuing time, and processing time.
Latency: The time a request is “latent” (waiting to be processed). Strictly, it excludes processing time, but in common usage, the terms are interchangeable.
Service time: The actual time the server spends processing the request (excludes queuing and network).

The distinction matters: a request might have 1ms service time but 500ms response time due to network round-trips and queuing at overloaded upstream services.

Average, Median, and Percentiles

Why averages are misleading: If 99 requests take 10ms and 1 request takes 10,000ms:

Arithmetic mean: ~110ms (heavily skewed by the outlier)
p50 (median): 10ms (half of requests are at or below this)
p99: 10,000ms (captures the outlier)

Percentiles express: “X% of requests completed in less than Y ms.”

Percentile	Meaning	When to use
p50	Median; 50% of requests are faster	Typical user experience
p95	95% of requests are faster	Near-worst-case experience
p99	99% of requests are faster	Tail latency; most users’ worst experience
p99.9	99.9% of requests are faster	Very expensive to optimize; for critical paths

Tail latency amplification: In a system that makes N parallel calls to downstream services, the response time is bounded by the slowest call. If each service has p99 of 100ms, and a request makes 10 parallel downstream calls, the effective p99 for the composite request is much worse than 100ms.

Fan-out request: call 10 downstream services in parallel
P(all 10 complete within T) = P(single service < T)^10

If single service p99 = 100ms:
P(single service < 100ms) = 0.99
P(all 10 < 100ms) = 0.99^10 = 0.904

So effectively ~p90 at the composite level = 100ms
p99 of composite ≈ much higher

This explains why large distributed systems relentlessly focus on tail latency—each additional hop in a request path multiplies the tail latency problem.

Head-of-line blocking: In a queue or thread pool, a slow request can block all subsequent requests waiting behind it. This causes good requests to appear slow simply because they are queued behind a slow request. Mitigation: shorter timeouts, separate thread pools for different operation types (bulkheads), async processing.

Use of Response Time Metrics

Service Level Indicators (SLIs): Measurable metrics for a service’s performance (e.g., “request success rate,” “p99 latency”).

Service Level Objectives (SLOs): Targets for SLIs (e.g., “p99 latency < 200ms measured over 5-minute windows”).

Service Level Agreements (SLAs): Contractual commitments to SLOs with consequences for violation (e.g., “if SLO is violated, customer receives 10% credit”).

Error budgets (SRE concept): If the SLO is 99.9% availability, the error budget is 0.1% downtime (~43 minutes/month). Teams can spend this budget on risky deployments. When the budget is exhausted, the team must focus on reliability, not new features.

Measuring response time correctly: client-side measurement is more accurate than server-side (captures network and queue time). Rolling percentiles over a time window (e.g., 1-minute rolling p99) are more actionable than batch statistics.

Reliability and Fault Tolerance

Reliability: A system works correctly (performing the correct function at the desired level of performance) even when things go wrong.

Fault vs Failure:

Fault: One component deviating from its specification (a disk crashes, a network packet is dropped, a bug is triggered)
Failure: The system as a whole stops providing service to users

A fault-tolerant (or resilient) system prevents faults from causing failures. The goal is not to prevent faults (impossible at scale) but to prevent them from causing service failures.

Why tolerating faults beats preventing them: At the scale of large systems, hardware fails every day, software has bugs, and humans make mistakes. The only achievable goal is ensuring faults don’t cascade into failures.

Fault Tolerance

Fault tolerance mechanisms work by:

Detecting the fault quickly (health checks, watchdogs, timeouts)
Isolating it from the rest of the system (circuit breakers, bulkheads)
Recovering by routing around the fault (failover, retry, degrade gracefully)

Circuit breaker pattern: When a service is failing, stop sending it requests immediately (instead of waiting for each to time out). After a cooldown period, try again. Prevents cascading failures and reduces load on already-struggling services.

Hardware and Software Faults

Hardware faults are random and typically independent:

Hard disk MTTF (Mean Time To Failure): 10–50 years per disk
At 10,000 disks: expect ~1 failure per day
RAM errors, NIC failures, power outages, cooling failures
Traditional mitigation: Hardware redundancy (RAID, dual power supplies, hot-swap CPUs)
Modern mitigation: Software fault tolerance — assume commodity hardware fails; handle it in software (replicate data, failover automatically). Cloud instances are terminated with no notice.

Software faults are systematic and often correlated:

A bug affecting all nodes when they all run the same code
A runaway process consuming all CPU or memory
A slow external service dependency (e.g., DNS resolver latency spike cascades to all services)
Cascading failures (service A slows → service B’s queue fills → service B slows → service A’s queue fills)
Harder to anticipate than hardware faults because they often manifest only under specific conditions (high load, specific data, leap second)
Mitigation: Testing (unit, integration, chaos engineering), process isolation, careful handling of external dependencies, crash-only software

Classic examples:

Linux kernel leap second bug (June 30, 2012): Caused CPU spin in clock_gettime, bringing down hundreds of services that relied on a correct monotonic clock
AWS EC2 EBS outage (2011): Single storage component failure cascaded through the retry logic of multiple services

Humans and Reliability

Human error is the leading cause of production outages in most organizations. Studies show configuration errors cause the majority of internet service outages, not hardware failures.

Strategies to reduce human error impact:

Design to minimize errors: Good APIs and abstractions make the right thing easy and the wrong thing hard. Sandbox/staging environments let people experiment safely.
Decouple mistakes from failures: Use feature flags for gradual rollout; canary deployments catch bugs before full rollout; schema migrations separate from code deployments.
Allow quick recovery: Fast rollback (one-click revert), point-in-time database recovery, roll-forward through replayed events.
Telemetry: Detailed monitoring, structured logging, distributed tracing. You cannot fix what you cannot see.
Good operational practices: Runbooks, on-call rotations, blameless post-mortems, training.

Scalability

Scalability: The system’s ability to cope with increased load. It is not a binary yes/no property, but a question: “If the load increases by 10x, what are our options for keeping the system working?”

Scalability is always about a specific load parameter—the metric that characterizes the demand on the system. Identifying the right load parameter is half the work.

Understanding Load

Load parameters are the numbers that describe the demand on the system:

Requests per second to a web server
Ratio of reads to writes in a database
Number of simultaneously active users in a chat system
Cache hit rate
Number of followers per user (for fan-out calculations)

Different systems have different bottleneck parameters. For the home timeline case study:

The key parameter is not QPS—it is the distribution of follower counts. An average user has 200 followers; a celebrity has 10M. The tail of the distribution determines system architecture.

Two ways to describe scalability:

Throughput: If load stays the same, how does performance change as we add resources? (Batch processing context)
Response time: If load increases by X, how much do we need to scale to keep performance constant? (Online systems context)

Shared-Memory, Shared-Disk, and Shared-Nothing Architectures

This is a new section in the 2nd edition that was absent from the first edition. It provides a taxonomy for horizontal scaling approaches.

Shared-Memory Architecture (SMP):

Multiple CPUs (or cores) share a single pool of RAM and disk
A single server with 128 cores and 1TB RAM
Pro: Simple programming model; any thread can access any data
Con: Cost scales non-linearly (high-end servers are very expensive); hardware becomes a single point of failure; practical upper limit (~a few hundred cores per machine)
When to use: When you haven’t yet hit the limits of the most powerful available machine; when simplicity outweighs cost

Shared-Memory (SMP):
┌─────────────────────────────────┐
│  CPU1  CPU2  CPU3  CPU4  ...    │
│      └──────┴──────┘            │
│           Shared RAM            │
│           Shared Disk           │
└─────────────────────────────────┘
Single machine, all resources shared

Shared-Disk Architecture:

Multiple machines each with their own CPU and RAM, but sharing a central disk (NAS/SAN) or object store (S3)
Cloud version: Separate compute clusters all reading from S3 (Snowflake, Redshift Spectrum)
Pro: Storage is highly available and elastic independently of compute; compute can be added/removed freely
Con: Shared disk can become a bottleneck; network latency to storage vs local disk
When to use: Cloud data warehouses (Snowflake, BigQuery); systems where compute and storage need to scale independently

Shared-Disk:
┌──────┐  ┌──────┐  ┌──────┐
│ CPU+  │  │ CPU+  │  │ CPU+  │
│ RAM  │  │ RAM  │  │ RAM  │
└──┬───┘  └──┬───┘  └──┬───┘
   │          │          │
   └──────────┴──────────┘
              │
       ┌──────▼──────┐
       │ Shared Disk │
       │  (NAS/SAN/  │
       │    S3)      │
       └─────────────┘

Shared-Nothing Architecture (SN / Horizontal Scaling):

Each machine has its own CPU, RAM, and disk; machines communicate only via a network
Data is partitioned (sharded) across machines; each machine handles a subset
Pro: Linear cost scaling; no single hardware bottleneck; can use commodity hardware globally
Con: Requires data partitioning (sharding); cross-node operations require network communication; distributed transactions are expensive; consistency is harder
When to use: When vertical scaling limits are reached; when fault tolerance across machines is required; most modern distributed databases

Shared-Nothing:
┌──────────┐   ┌──────────┐   ┌──────────┐
│ CPU+RAM  │   │ CPU+RAM  │   │ CPU+RAM  │
│ Disk     │   │ Disk     │   │ Disk     │
│ (shard A)│   │ (shard B)│   │ (shard C)│
└────┬─────┘   └────┬─────┘   └────┬─────┘
     └──────────────┼──────────────┘
                 Network

Principles for Scalability

Vertical scaling (scale-up): Move to a more powerful machine. Simple—no code changes required. Limited by the most powerful available hardware and cost curves. First step when you hit a bottleneck.

Horizontal scaling (scale-out): Add more machines. Unlimited theoretical ceiling. Requires data distribution (sharding), load balancing, and application logic for handling distributed state.

Elastic vs manual scaling:

Elastic: System automatically adds resources when load increases (AWS Auto Scaling, Kubernetes HPA)
Manual: Humans analyze load and provision capacity ahead of time
Elastic is essential for highly variable workloads; manual is appropriate for predictable workloads where over-provisioning is controllable

Key principle: Architecture decisions for scalability are specific to the application’s load parameters. There is no magic one-size-fits-all scalable architecture. The Twitter hybrid timeline approach is correct for Twitter’s specific workload; it would be wrong for a different read/write ratio.

The two-pizza team rule (from Amazon, indirectly): Architecture should enable small teams to own and scale individual components independently. Coupling between components limits both organizational and system scalability.

Maintainability

Maintainability: Making it easy for engineering and operations teams to work on the system over time. The majority of software cost is maintenance (bug fixes, new features, operational work, adaptation to new platforms), not initial development.

The authors name three principles of maintainability:

Operability: Making Life Easy for Operations

Operability means making it easy for operations teams to keep the system running smoothly day to day.

Characteristics of a highly operable system:

Visibility: Good monitoring, distributed tracing, and alerting. “What is the system doing right now?” should always have an answer.
Automation support: Standard interfaces for deployment, configuration management, and health checking. Works with CI/CD pipelines and infrastructure-as-code tools.
Predictable behavior: Surprises are the enemy of operations. Systems should behave deterministically under known conditions. Avoid magic, auto-tuning behavior that changes without notice.
Good default behavior: Sensible defaults that work without manual tuning; easy-to-override when needed.
Documentation: Operational procedures, runbooks, architecture decision records (ADRs), dependency maps.
Avoiding single points of human knowledge: The system’s operation should not depend on one expert. Bus factor > 1.

Operations in 2026: The SRE (Site Reliability Engineering) model from Google has become the industry standard. Key SRE practices:

Error budgets: define acceptable unreliability; balance reliability with velocity
Toil reduction: automate repetitive manual work
Postmortems: blameless retrospectives after incidents to prevent recurrence

Simplicity: Managing Complexity

Simplicity means making it easy for new engineers to understand the system. This is distinct from simplicity of the user interface—a system can have a simple interface hiding enormous complexity.

Accidental vs essential complexity:

Essential complexity: The inherent complexity of the problem being solved (e.g., distributed consensus is fundamentally hard)
Accidental complexity: Complexity introduced by the implementation—it could be avoided with better design

Symptoms of accidental complexity:

Explosion of state space (too many special cases)
Tight coupling between modules (changes in one require changes in many others)
Tangled dependencies (circular references, implicit ordering)
Inconsistent naming and terminology
Hacks accumulated to work around earlier design mistakes
Undocumented implicit contracts between components

Solution: Abstraction. Good abstractions hide implementation complexity behind clean, stable interfaces. SQL abstracts away B-tree implementations. TCP abstracts away packet routing. A well-designed service API abstracts away the internal data model. Abstractions let you reason about a system at the right level of detail.

The abstraction quality test: A good abstraction hides complexity without leaking it. A leaky abstraction forces users to understand the hidden complexity anyway (e.g., ORM that requires knowing about SQL indexes).

Evolvability: Making Change Easy

Evolvability (also called extensibility, modifiability, or plasticity) means making it easy to change the system as requirements change.

Requirements change constantly: new features, business pivots, regulatory requirements, performance improvements, technology migrations. A system that is hard to change eventually calcifies—it can’t adapt, and the cost of change pushes teams toward riskier “big bang” rewrites.

Connection to simplicity: Simple, well-abstracted systems are easier to change. Accidental complexity is the primary enemy of evolvability.

Agile practices that support evolvability:

Test-driven development (TDD): Tests serve as a safety net for change; they catch regressions
Refactoring: Continuously improve the design without changing behavior
Continuous integration/continuous deployment (CI/CD): Small, frequent changes reduce risk vs large infrequent releases
Feature flags: Deploy code without activating features; activate gradually

Schema evolution: Data systems must handle schema changes gracefully. This connects to later chapters on encoding (Protobuf backward/forward compatibility), replication (schema changes during rolling upgrades), and storage (LSM trees and compaction).

Comparison Tables

Architecture Comparison: Shared-Memory vs Shared-Disk vs Shared-Nothing

Dimension	Shared-Memory (SMP)	Shared-Disk	Shared-Nothing
Memory	All CPUs share RAM	Each node has own RAM	Each node has own RAM
Storage	Shared	Shared (NAS/SAN/S3)	Each node has own disk
Communication	Memory bus	Network to storage	Network between nodes
Scalability ceiling	~few hundred cores	Storage IOPS	Theoretically unlimited
Cost curve	Non-linear (premium for large SMP)	Linear compute + separate storage	Linear
Consistency	Easy (shared memory)	Medium (shared storage)	Hard (distributed)
Fault tolerance	Hardware RAID; single PoF	High (distributed storage)	High (replication)
Examples	PostgreSQL on big VM	Snowflake, Redshift on S3	Cassandra, Spanner, Kafka
When to use	Haven’t hit SMP limits	Cloud warehouse scaling	Horizontal scale required

Fault Types and Mitigations

Fault Type	Characteristics	Detection	Mitigation
Hardware	Random, independent, predictable rate	Health checks, SMART	Redundancy, replication, failover
Software	Systematic, correlated, hard to predict	Monitoring, alerting, crash reports	Testing, isolation, chaos engineering
Human	Leading cause of outages, config-related	Audit logs, change management	Good design, staging, gradual rollout, monitoring

Scaling Approaches

Approach	Mechanism	Pros	Cons	When to Use
Vertical (scale-up)	Bigger machine	No code changes, simple	Limited ceiling, expensive, single PoF	Start here; most apps
Horizontal read scaling	Read replicas	Handle read-heavy workloads	Replication lag; writes still single-node	Read-heavy OLTP
Horizontal write scaling	Sharding	True write scalability	Complex; cross-shard queries hard	Write-heavy OLTP
Elastic auto-scaling	Auto add/remove nodes on load	Variable load; no over-provisioning	Warm-up time; state management	Cloud-native, variable load
Caching	Pre-computed results in memory	Dramatic read performance boost	Staleness; invalidation complexity	Read-heavy, repeated queries

SLI / SLO / SLA Definitions

Term	Full Name	What It Is	Example
SLI	Service Level Indicator	A measurable metric	”Fraction of requests with latency < 200ms”
SLO	Service Level Objective	Internal target for an SLI	”SLI must be ≥ 99.5% over a 28-day window”
SLA	Service Level Agreement	Contractual commitment with penalties	”If SLO is missed, customer gets 10% credit”
Error budget	—	How much unreliability is allowed	”1 - 99.5% = 0.5% failure budget per month”

Important Points Summary

Nonfunctional requirements must be precise: “Fast” and “reliable” are useless targets. “p99 < 200ms at 50K RPS” is actionable. Define SLOs with specific numbers before designing.
The home timeline case study reveals the core scalability tension: Fan-out on write (push) vs fan-out on read (pull). Neither is universally right—the correct hybrid depends on the read/write ratio and follower distribution.
Percentiles, not averages: Averages hide tail latencies. Use p50, p95, p99. Tail latency amplification means p99 compounds rapidly in systems with many downstream calls.
Fault tolerance beats fault prevention: At scale, faults are inevitable. Design to contain faults, not to prevent them. Netflix Chaos Monkey, SRE error budgets, and circuit breakers are all applications of this principle.
Hardware faults are random; software faults are correlated: A software bug can take down every replica simultaneously. Hardware failures are independent. Both require different mitigation strategies.
Human error is the leading cause of outages: Design systems that make the wrong thing hard—staging environments, gradual rollout, fast rollback, detailed monitoring.
Shared-nothing scales linearly but adds distributed complexity: The choice of memory architecture determines where complexity lives. Shared-memory is simple but expensive; shared-nothing is scalable but requires sharding and distributed state management.
Simplicity and evolvability are connected: Accidental complexity is the primary enemy of the ability to change the system. Abstractions that hide essential complexity without leaking it are the primary tool.
Maintainability costs more than initial development: Most software budget is maintenance. Design for the engineers who will own this system in 3 years, not just for the deadline.
Operability requires visibility: A system you cannot observe is a system you cannot operate. Monitoring, tracing, and structured logging are not optional features—they are foundational infrastructure.

1st Edition vs 2nd Edition: What Changed

Aspect	1st Edition (Ch1)	2nd Edition (Ch2)
Case study	Twitter (statistics cited)	Social network (full worked example with SQL and fan-out)
Memory architectures	Not covered	New section: Shared-Memory, Shared-Disk, Shared-Nothing
SLI/SLO/SLA	Mentioned briefly	Formal definitions with error budget concept
Tail latency amplification	Mentioned	Quantified with probability calculation
Fault coverage	Hardware, software, human	Same, but more structured with fault tolerance patterns
Operability	Bullet list	Connected to SRE practices

Modern Context (2026)

Percentiles are now standard practice: p99 and p99.9 are reported by default in Datadog, New Relic, Honeycomb, and every major observability platform. The 1st edition had to argue for this; in 2026, it is assumed.

SRE is mainstream: Google’s SRE model (error budgets, SLIs/SLOs, toil reduction) is now standard at companies of all sizes, not just Google-scale. The “SRE Book” and “Site Reliability Workbook” are standard references.

Platform engineering has emerged: Internal developer platforms (IDPs) built by dedicated platform teams provide self-service infrastructure, automated deployments, and standardized observability. Tools: Backstage (developer portal), Crossplane (infrastructure as code), ArgoCD (GitOps).

Chaos engineering is a discipline: Netflix Chaos Monkey, Gremlin, and Chaos Mesh (Kubernetes) are production tools. Chaos engineering is not a stunt—it is a systematic approach to verifying fault tolerance before faults occur in production.

Auto-scaling is the default: Kubernetes Horizontal Pod Autoscaler (HPA), AWS Auto Scaling groups, and cloud function auto-scaling have made elastic scaling the expected baseline. Manual capacity planning is now considered a legacy pattern for most web workloads.

Shared-nothing at cloud scale: The “shared-nothing” model is the architecture of every major cloud database (DynamoDB, Spanner, Cassandra, Kafka). The 2026 landscape confirms the 2nd edition’s emphasis on shared-nothing as the dominant pattern for large-scale systems.

Questions for Reflection

The home timeline case study uses a hybrid fan-out approach. At what follower count threshold would you switch from push to pull? How would you measure this in production?
If your system’s p99 latency is 200ms but p99.9 is 5 seconds, what does that suggest about the system’s behavior? What would you investigate first?
Explain tail latency amplification with a specific example: a web request that calls 5 microservices in parallel, each with p99 of 100ms. What is the approximate p99 of the overall request?
A startup is deciding between a single PostgreSQL node with vertical scaling vs. a Cassandra cluster from the start. What questions would you ask to guide this decision?
What is the difference between essential and accidental complexity? Give an example of each from a real system you’ve worked on or studied.
How would you design a social network’s home timeline system to handle GDPR right to erasure for posts? Specifically, when a user deletes a post, how do you update millions of cached timelines?

ch01-tradeoffs-data-systems — Architectural context (OLTP/OLAP, cloud, distributed vs single-node)
ch06-replication — How replication enables fault tolerance and read scaling
ch07-sharding — Shared-nothing scaling via data partitioning
ch08-transactions — Reliability at the transaction level (ACID)
ch09-trouble-with-distributed-systems — Hardware, network, and clock failures in distributed systems
ch10-consistency-and-consensus — The consistency side of fault-tolerant design
ch01-reliable-scalable-maintainable — 1st edition Chapter 1 (same topics, different structure)

Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch02-nonfunctional-requirements

Chapter 2: Defining Nonfunctional Requirements

Overview

Key Concepts

Representing Users, Posts, and Follows

Materializing and Updating Timelines

Describing Performance

Latency and Response Time

Average, Median, and Percentiles

Use of Response Time Metrics

Reliability and Fault Tolerance

Fault Tolerance

Hardware and Software Faults

Humans and Reliability

Scalability

Understanding Load

Shared-Memory, Shared-Disk, and Shared-Nothing Architectures

Principles for Scalability

Maintainability

Operability: Making Life Easy for Operations

Simplicity: Managing Complexity

Evolvability: Making Change Easy

Comparison Tables

Architecture Comparison: Shared-Memory vs Shared-Disk vs Shared-Nothing

Fault Types and Mitigations

Scaling Approaches

SLI / SLO / SLA Definitions

Important Points Summary

1st Edition vs 2nd Edition: What Changed

Modern Context (2026)

Questions for Reflection

Graph View

Table of Contents

Backlinks

Study Notes by Niladri & AI

Explorer

ch02-nonfunctional-requirements

Chapter 2: Defining Nonfunctional Requirements

Overview

Key Concepts

Case Study: Social Network Home Timelines

Representing Users, Posts, and Follows

Materializing and Updating Timelines

Describing Performance

Latency and Response Time

Average, Median, and Percentiles

Use of Response Time Metrics

Reliability and Fault Tolerance

Fault Tolerance

Hardware and Software Faults

Humans and Reliability

Scalability

Understanding Load

Shared-Memory, Shared-Disk, and Shared-Nothing Architectures

Principles for Scalability

Maintainability

Operability: Making Life Easy for Operations

Simplicity: Managing Complexity

Evolvability: Making Change Easy

Comparison Tables

Architecture Comparison: Shared-Memory vs Shared-Disk vs Shared-Nothing

Fault Types and Mitigations

Scaling Approaches

SLI / SLO / SLA Definitions

Important Points Summary

1st Edition vs 2nd Edition: What Changed

Modern Context (2026)

Questions for Reflection

Related Resources

Graph View

Table of Contents

Backlinks