Chapter 1: Trade-Offs in Data Systems Architecture
ddia-2e architecture tradeoffs oltp-olap cloud
Status: Notes complete
Overview
Chapter 1 of the 2nd edition is entirely new—it replaces the 1st edition’s opening on reliability/scalability with a broader architectural framing that situates the rest of the book. The central argument is that every data system design involves a set of recurring, high-stakes trade-offs: operational vs. analytical workloads, cloud vs. self-hosted infrastructure, and distributed vs. single-node deployments. Getting these three dimensions right before writing a line of code determines most of a system’s long-term cost, complexity, and capability. As a concrete example, a startup choosing a managed cloud warehouse (Snowflake) over a self-hosted Hadoop cluster is making all three trade-offs simultaneously—analytical over operational, cloud over self-hosted, and distributed over single-node—and that choice shapes every subsequent architectural decision.
Key Concepts
Operational Versus Analytical Systems
The most fundamental split in data systems is between systems that serve live user requests and systems that answer questions about data at rest. This distinction predates cloud computing; it emerged in the 1990s when it became clear that running complex analytical queries on the same database that served customers would cripple performance for both.
OLTP (Online Transaction Processing) systems handle the interactive operations of a running application: inserting a new order, reading a user’s profile, updating account balances. The defining characteristics are low latency (milliseconds), high concurrency, and small reads/writes that touch a handful of rows. Every time you click “buy” on an e-commerce site or swipe a payment card, an OLTP system processes that request. Examples: PostgreSQL, MySQL, DynamoDB, Amazon Aurora, Google Spanner.
OLAP (Online Analytical Processing) systems answer aggregation questions over large datasets: “What were total sales by region last quarter?”, “Which customers churned in the last 30 days?” These queries scan millions or billions of rows, compute aggregates, and return a result that no one is waiting on in real time. Latency is measured in seconds to minutes. Examples: Snowflake, Google BigQuery, Amazon Redshift, Apache Spark, ClickHouse.
The distinction matters because the two workloads have opposing optimization strategies. OLTP benefits from row-oriented storage (read whole rows fast), indexes, and low-overhead transaction processing. OLAP benefits from columnar storage (read only the columns you need), compression, and vectorized execution across full table scans.
Characterizing Transaction Processing and Analytics
“Transaction” in OLTP does not necessarily mean ACID transactions; it means any low-latency interactive read or write. The access pattern is random-access by primary key or index: find the record for user 12345, update order 98765. Write patterns are frequent small inserts and updates. Query optimizer aims to minimize the number of rows touched.
OLAP access is fundamentally different: sequential scans over entire columns, aggregation (SUM, COUNT, AVG), grouping, and filtering. A single OLAP query might read 100 GB of compressed column data. Writes are typically batch-loaded in bulk (ETL/ELT). The query optimizer aims to minimize I/O by reading only the columns and row groups that the query touches.
Key insight: Running OLAP queries on an OLTP system hurts both workloads. Long sequential scans hold locks, evict the OLTP working set from the buffer cache, and spike CPU. This is why organizations separate them.
Data Warehousing
A data warehouse is a dedicated analytical store designed to receive data from one or more OLTP source systems and serve OLAP queries. The classic architecture:
OLTP Systems (source of truth)
PostgreSQL, MySQL, DynamoDB
|
ETL / ELT pipeline
|
Data Warehouse (analytical)
Snowflake, BigQuery, Redshift
|
BI Tools / SQL Clients
Tableau, Looker, dbt
The ETL (Extract-Transform-Load) process copies data from operational databases into the warehouse, transforming it into a query-optimized schema (typically star schema or snowflake schema). Modern warehouses favor ELT—load raw data first, transform inside the warehouse—because the warehouse’s computation is cheap.
Data lakes (S3, GCS, Azure Data Lake) extend the warehouse concept by storing raw, unprocessed data in open formats (Parquet, ORC, Iceberg) at very low cost. A lakehouse (Databricks Delta Lake, Apache Iceberg + Trino) combines the cheap storage of a data lake with the query performance and ACID semantics of a warehouse.
HTAP (Hybrid Transactional/Analytical Processing) is the emerging approach where a single system serves both workloads by maintaining separate row-oriented and column-oriented storage internally (TiDB, Google AlloyDB, CockroachDB). The benefit is eliminating the ETL pipeline and getting analytical results on live operational data. The trade-off is higher system complexity.
Systems of Record and Derived Data
A critical organizing principle in data architecture is the distinction between systems of record (also called source of truth) and derived data systems.
A system of record is the authoritative version of the data. When two systems disagree, the system of record wins. It is typically normalized, write-optimized, and transactional. Examples: the operational PostgreSQL database holding current user accounts, the Kafka topic carrying raw event streams.
Derived data is computed from the system of record by applying some transformation, aggregation, or denormalization. The data warehouse, the search index, the recommendation model’s feature store, the pre-computed timeline cache—all are derived. They can be reconstructed from the system of record if corrupted or lost. They are optimized for read performance at the cost of synchronization complexity.
This distinction matters enormously for consistency: derived data can be stale, but it should be explicitly acknowledged as such. A common failure mode is treating a derived cache as a system of record, leading to split-brain inconsistencies when the transformation pipeline falls behind.
Cloud Versus Self-Hosting
Before ~2010, almost every organization hosted its own data infrastructure: racks of servers in a data center, managed by an operations team. Today most new systems are built on public cloud (AWS, Azure, GCP), and many legacy systems are migrating. But self-hosting remains relevant for specific use cases.
Pros and Cons of Cloud Services
Cloud advantages:
- Operational simplicity: Managed databases (RDS, Aurora, Cloud SQL) handle patching, failover, backups, and scaling automatically. A 3-person startup can run a highly available multi-AZ PostgreSQL cluster with point-in-time recovery without a single dedicated DBA.
- Elastic scaling: Add capacity in minutes. Scale down when load drops—this is economically impossible with owned hardware.
- Global footprint: Deploy in 30+ regions without building data centers. Crucial for low-latency global applications.
- Managed services ecosystem: Kafka (MSK, Confluent Cloud), Redis (ElastiCache), Elasticsearch (OpenSearch)—use battle-hardened infrastructure without operating it.
- Pay-per-use economics: No upfront hardware purchase; cost tracks usage. Favorable for variable workloads.
Cloud disadvantages:
- Cost at scale: For predictable, sustained, high-volume workloads, reserved cloud capacity often costs more than equivalent owned hardware. Dropbox famously saved ~$75M over 2 years by repatriating storage from AWS.
- Vendor lock-in: Proprietary APIs (DynamoDB, Bigtable, Cosmos DB) create switching costs. Open standards (S3-compatible APIs, PostgreSQL wire protocol) mitigate but don’t eliminate this.
- Data gravity and egress costs: Moving large datasets between cloud providers or out of the cloud is expensive. Data tends to “stay” near where it was created.
- Regulatory and sovereignty constraints: Some data (healthcare in the EU, government data in many countries) must reside in specific jurisdictions, which cloud providers may or may not support.
- Reduced control over hardware and network topology: Critical for ultra-low-latency financial systems or specialized hardware (GPU clusters, FPGAs).
Cloud Native System Architecture
Cloud-native means designing systems specifically for cloud primitives rather than porting on-premises patterns to the cloud. Key principles:
- Treat infrastructure as disposable: Instances terminate unexpectedly. Design for stateless compute; store all durable state in managed storage services.
- Decouple compute and storage: Cloud warehouses (Snowflake, BigQuery) separate query processing from object storage. Scale each independently.
- Prefer managed services over self-operated components: Let the cloud provider operate Kafka, Redis, and PostgreSQL unless you have specific reasons not to.
- Design for availability zones: Spread replicas across AZs within a region for fault tolerance without cross-region latency costs.
- Use object storage as the universal substrate: S3 and equivalents offer 11 nines durability at low cost. Build data pipelines around Parquet-on-S3 rather than HDFS.
Cloud-Native Data Architecture (2026):
Sources Ingestion Storage Serving
┌─────────┐ ┌──────────────────┐ ┌──────────┐ ┌──────────────┐
│ Apps │───▶│ Kafka / Kinesis │─▶│ S3 / │─▶│ Snowflake / │
│ APIs │ │ (event streams) │ │ GCS │ │ BigQuery │
│ CDC │ └──────────────────┘ │ (Parquet)│ │ (warehouse) │
└─────────┘ └──────────┘ └──────────────┘
│ │
┌────▼──────────────▼────┐
│ dbt / Spark transform │
└────────────────────────┘
Operations in the Cloud Era
The cloud shifts operations from infrastructure management to configuration management. Instead of racking servers and installing OS patches, platform teams manage IAM policies, VPC configurations, and Terraform modules. The SRE role evolves: less capacity planning, more reliability engineering at the application and pipeline level.
Key operational difference: On-premises failures are typically hardware events (disk dies, NIC fails) that ops teams detect and remediate physically. Cloud failures are often transient (spot instance preemption, throttling, AZ availability) and should be handled by the application itself via retries, circuit breakers, and redundancy across AZs.
Distributed Versus Single-Node Systems
Adding distribution—spreading a system across multiple machines—dramatically increases complexity. Many systems that “go distributed” do so prematurely, paying the complexity cost before the scale need materializes.
Problems with Distributed Systems
Distribution introduces a class of problems that simply do not exist on a single machine:
- Partial failures: In a distributed system, some components can fail while others continue running. A single-node system either works or doesn’t. This makes reasoning about system state far harder.
- Network partitions: The network between nodes can fail, creating a split where each partition thinks the other is down. The CAP theorem formalizes the impossible trade-off: during a partition, choose consistency (reject writes on the minority side) or availability (accept writes that may diverge).
- Clock skew: There is no global clock in a distributed system. Event ordering must be achieved through logical clocks (Lamport timestamps, vector clocks) or specialized hardware (Google TrueTime in Spanner). This matters enormously for conflict resolution in replicated data.
- Byzantine faults: In open networks (blockchains, multi-tenant environments), nodes may lie or behave arbitrarily. Most distributed databases assume crash-fail (nodes fail silently) not Byzantine behavior.
- Coordination overhead: Any operation requiring agreement across nodes (distributed transactions, leader election, schema changes) incurs latency and throughput costs proportional to the number of nodes involved.
The authors’ key recommendation: do not distribute until you must. A single beefy PostgreSQL instance can handle more load than most applications will ever see. Distributed systems are the right tool when you need fault tolerance across availability zones, horizontal write scaling beyond what one machine can handle, or geographic distribution of data.
Microservices and Serverless
Microservices decompose a monolithic application into independently deployable services, each with its own database. This is a distributed architecture even if each service runs on a single node. The trade-offs:
- Benefits: Independent deployment and scaling, team autonomy, fault isolation between services, technology heterogeneity.
- Costs: Network calls replace in-process function calls (latency increases, partial failures possible), data consistency across services requires careful design (no cross-service ACID transactions), operational complexity multiplies.
A common failure mode is adopting microservices too early, before team size and deployment frequency justify the overhead. A monolith with a well-structured codebase often outperforms a premature microservices architecture on every dimension until the team reaches 50+ engineers or deploys dozens of times per day.
Serverless (AWS Lambda, Google Cloud Functions, Azure Functions) takes distribution further by removing the server concept entirely. Each function invocation runs independently, with no persistent state. Benefits: zero operational overhead, automatic scaling to zero. Costs: cold start latency, execution time limits, no persistent in-memory state, vendor lock-in.
Serverless is not a universal pattern. It excels for event-driven, stateless workloads (API handlers, image resizing, webhook processing). It struggles for long-running computations, stateful streaming, or latency-sensitive workloads where cold starts are unacceptable.
Cloud Computing Versus Supercomputing
Distributed data systems share ancestry with high-performance computing (HPC) but have diverged in philosophy:
| Aspect | Cloud / Commodity | Supercomputer / HPC |
|---|---|---|
| Hardware | Commodity servers, high failure rate | Specialized hardware, high reliability |
| Networking | Ethernet (lossy, variable latency) | InfiniBand (lossless, sub-microsecond) |
| Failure model | Design around frequent failures | Checkpoint/restart, assume rare failures |
| Programming model | Shared-nothing, message passing | MPI, shared memory, tight coupling |
| Workload | Varied, interactive + batch | Primarily batch (simulation, ML training) |
| Cost | Pay-per-use, elastic | Large upfront investment |
Modern ML training (GPUs, TPUs) blurs this line—large transformer training runs look more like HPC than cloud, while inference serving looks like cloud.
Data Systems, Law, and Society
A chapter unique to the 2nd edition: data systems do not exist in a legal or ethical vacuum. Engineers building data pipelines and storage systems are making decisions with real societal consequences.
Key legal frameworks (as of 2026):
- GDPR (EU): Right to erasure, data minimization, purpose limitation, cross-border transfer restrictions. Requires technical mechanisms: the ability to delete all data for a given user, even in denormalized derived stores.
- CCPA/CPRA (California): Similar consumer rights in the US. “Right to know,” “right to delete,” “right to opt-out of sale.”
- HIPAA (US healthcare): Strict rules on storing and transmitting protected health information (PHI). Requires encryption, audit logs, access controls.
- AI Act (EU, 2024): Classifies AI systems by risk; high-risk systems (hiring, credit scoring, medical) require transparency and human oversight.
Technical implications for architects:
- Store user IDs, not names/emails, in event logs and analytics pipelines. Personal data should be in a single authoritative store, referenced by ID everywhere else.
- Design for right to erasure: Can you delete one user’s data without corrupting aggregate analytics? Cryptographic erasure (encrypt all user data under a per-user key; delete the key) is a practical approach.
- Data lineage: Know where every piece of data came from and what transformations it has passed through. Required for audit trails and demonstrating compliance.
- Purpose limitation: Data collected for one purpose should not be repurposed without consent. Architectural separation of data stores by purpose enforces this technically.
Comparison Tables
OLTP vs OLAP
| Dimension | OLTP | OLAP |
|---|---|---|
| Primary use | Serve live application requests | Answer analytical questions |
| Read pattern | Fetch 1–100 rows by key or index | Scan millions–billions of rows, aggregate columns |
| Write pattern | Frequent small inserts/updates/deletes | Bulk batch loads (ETL/ELT) |
| Latency target | < 10ms (p99 < 100ms) | Seconds to minutes acceptable |
| Concurrency | Thousands of concurrent transactions | Tens of concurrent queries |
| Storage layout | Row-oriented | Column-oriented |
| Indexes | B-tree indexes on many columns | Sparse indexes, zone maps, bloom filters |
| Schema style | Normalized (3NF) | Denormalized (star/snowflake schema) |
| Examples | PostgreSQL, MySQL, DynamoDB, Spanner | Snowflake, BigQuery, Redshift, ClickHouse |
| Scale challenge | Write throughput, low-latency reads | Query throughput on large datasets |
Cloud vs Self-Hosting
| Dimension | Cloud (Managed) | Self-Hosted (On-Prem / Colocation) |
|---|---|---|
| Cost structure | Variable (pay-per-use); high at sustained scale | High upfront capex; lower marginal cost at scale |
| Ops burden | Low (patching, failover managed) | High (full stack owned by ops team) |
| Scaling | Elastic, minutes | Slow (procurement cycles: weeks–months) |
| Control | Limited (black-box services) | Full (tune kernel, choose hardware) |
| Global reach | 30+ regions instantly | Requires building/leasing data centers |
| Vendor risk | Lock-in; price changes; outages outside your control | Full control; no external dependency |
| Latency | Virtualization overhead; AZ network latency | Bare-metal, predictable latency |
| Best for | Startups, variable workloads, fast iteration | Finance, regulated industries, large sustained workloads |
| Examples | RDS, DynamoDB, Snowflake, GKE | PostgreSQL on-prem, Cassandra in colocation, bare-metal Kafka |
Distributed vs Single-Node
| Dimension | Single-Node | Distributed |
|---|---|---|
| Consistency | Simple—no coordination needed | Hard—must choose consistency model |
| Fault tolerance | Single point of failure | Can survive node failures |
| Scalability | Limited by hardware (vertical only) | Horizontal scale-out |
| Complexity | Low | High (partitioning, replication, consensus) |
| Latency | In-process calls are fast | Network calls add latency |
| Transactions | ACID is easy | Distributed transactions are expensive |
| Operational cost | Low | High (more machines, more failure modes) |
| When to use | Start here; use until bottleneck | When fault tolerance or scale demands it |
| Examples | PostgreSQL, SQLite, Redis (single node) | Cassandra, Spanner, CockroachDB, Kafka clusters |
Important Points Summary
- OLTP and OLAP have incompatible optimization strategies: Serving both from one system requires careful isolation (HTAP adds complexity) or explicit separation (warehouse + pipeline).
- ETL pipelines create derived data: The data warehouse is not a second source of truth—it is derived from operational systems. Treat it as such to avoid consistency confusion.
- Cloud economics favor variable workloads: For predictable high-volume workloads, cloud costs can exceed self-hosting. Dropbox, Basecamp, and others demonstrate this at scale.
- Cloud-native means designing around managed primitives: Decouple compute and storage; design for transient failures; treat instances as disposable.
- Distribution is a last resort for complexity management: The problems introduced by distribution (partial failures, clock skew, network partitions) are qualitatively harder than single-node challenges. Distribute when fault tolerance or scale demands it, not by default.
- Microservices trade deployment independence for distributed systems complexity: Appropriate at sufficient team and deployment scale; dangerous when premature.
- Serverless is optimized for stateless, event-driven workloads: Cold starts, time limits, and no persistent state make it unsuitable for stateful or latency-sensitive applications.
- GDPR and similar laws create concrete technical requirements: Right to erasure, data lineage, and purpose limitation must be designed in, not bolted on.
- Systems of record vs derived data is the most important architectural distinction: Confusion between the two is the root cause of many data consistency bugs.
- HTAP (Hybrid OLTP+OLAP) is converging the two worlds: TiDB, AlloyDB, and SingleStore attempt both in one system—watch this space but understand the complexity trade-off.
Modern Context (2026)
The cloud data warehouse has become the default OLAP tier: Snowflake and BigQuery have effectively commoditized analytical query execution. The competitive differentiation has moved to the pipeline (dbt, Fivetran, Airbyte) and the semantic layer (Cube, Lightdash).
Data lakehouses are displacing pure warehouses at scale: Apache Iceberg + Trino (or Spark) on S3 offers warehouse-like semantics (ACID, time travel, schema evolution) on cheap object storage. Databricks and Snowflake are converging on open table formats.
The “modern data stack” is now the baseline: CDC (Debezium, Airbyte) → Kafka/Kinesis → Iceberg/Delta on S3 → dbt → Snowflake/BigQuery → Looker/Tableau. This 5-layer pipeline is standard enough that “data platform” teams spend most time on governance, quality, and cost rather than infrastructure.
Kubernetes has become the distributed systems abstraction layer: Where teams once chose between microservices and monoliths, they now also choose between K8s-native microservices and serverless functions. The operational complexity of K8s is non-trivial but manageable with managed clusters (EKS, GKE, AKS).
AI inference workloads are creating a third access pattern beyond OLTP and OLAP: Low-latency vector search (sub-10ms approximate nearest neighbor over billions of vectors) doesn’t fit cleanly into either category. Dedicated vector databases (Pinecone, Weaviate, Qdrant) and vector extensions to PostgreSQL (pgvector) are addressing this.
Regulatory complexity is increasing: The EU AI Act (2024), US state privacy laws, and India’s PDPB all add jurisdiction-specific technical requirements. Data residency, audit logging, and model governance are now table stakes for enterprise data systems.
Questions for Reflection
- A startup is building a product analytics pipeline. They’re debating Kafka → BigQuery vs. direct PostgreSQL inserts with a materialized views layer. What questions would you ask to guide this decision?
- Why can’t you simply run OLAP queries against your OLTP database during off-peak hours? What goes wrong?
- A company processes 50 TB/day of clickstream data. At what point does the economics of Snowflake vs. self-hosted Spark on EC2 flip? What factors drive that calculation?
- Explain why “right to erasure” under GDPR is technically hard to implement in a data warehouse. What design patterns make it tractable?
- Your team is considering splitting a monolith into microservices. What thresholds (team size, deployment frequency, service count) would you use to decide if this is the right time?
- A financial services company is debating moving their transaction processing database from on-premise Oracle to AWS Aurora. What are the three most important questions they need to answer before deciding?
Related Resources
- ch02-nonfunctional-requirements — How to define reliability, scalability, maintainability for the systems described here
- ch06-replication — Distributed vs single-node replication trade-offs
- ch07-sharding — When and how to partition data across nodes
- ch08-transactions — ACID in OLTP; why distributed transactions are expensive
- ch11-batch-processing — The ETL/ELT pipelines feeding data warehouses
- ch12-stream-processing — Event-driven ingestion (Kafka, CDC) into derived stores
- ch01-reliable-scalable-maintainable — 1st edition Chapter 1 for comparison
Last Updated: 2026-05-29