Chapter 1 Cheat Sheet — Trade-Offs in Data Systems Architecture
One-Line Summaries
Concept
One-Liner
OLTP
Low-latency row reads/writes for live application requests
OLAP
High-throughput column scans for analytical aggregations
Data Warehouse
Derived analytical store fed by ETL from OLTP systems
Data Lake
Cheap raw storage (S3 + Parquet); query via lakehouse engine
HTAP
Single system serving both OLTP and OLAP (TiDB, AlloyDB)
System of Record
Authoritative source of truth; wins all conflicts
Derived Data
Computed from system of record; can be rebuilt if lost
Cloud-Native
Designed around managed cloud primitives, not ported from on-prem
Microservices
Independent deployable services, each with its own DB
Serverless
Stateless function invocations; no persistent server state
Quick Numbers to Remember
Metric
Value
Context
OLTP latency target
< 10ms p50, < 100ms p99
User-facing interactive requests
OLAP latency tolerance
Seconds to minutes
Analytical queries, no user waiting
Snowflake / BigQuery
Petabyte scale
Cloud warehouses, pay-per-query
Dropbox repatriation
~$75M saved over 2 years
Moving storage from AWS to own hardware
GDPR right to erasure
30 days to comply
Technical: cryptographic erasure is practical
Microservices inflection
~50+ engineers
Before this, monolith often wins
S3 durability
99.999999999% (11 nines)
Object storage as universal substrate
OLTP vs OLAP: Key Distinctions
OLTP OLAP
──────────────────────────────── ──────────────────────────────────
Access: random by key/index Access: sequential column scan
Write: frequent small updates Write: bulk batch loads
Latency: milliseconds Latency: seconds to minutes
Concurrency: thousands of txns Concurrency: tens of queries
Storage: row-oriented Storage: column-oriented
Schema: normalized (3NF) Schema: star/snowflake
Examples: Postgres, DynamoDB Examples: Snowflake, BigQuery
Data Architecture Decision Tree
What's the primary workload?
│
├─ Serve live app requests → OLTP
│ ├─ Single region, < 10K QPS → PostgreSQL / MySQL (single node)
│ ├─ Multi-region / global → Spanner / CockroachDB / DynamoDB
│ └─ Serverless / event-driven → DynamoDB / Aurora Serverless
│
├─ Answer analytics questions → OLAP
│ ├─ < 10TB, team < 10 engineers → BigQuery / Snowflake (pay-per-query)
│ ├─ > 100TB, cost-sensitive → Iceberg on S3 + Trino (lakehouse)
│ └─ Real-time on live data → HTAP (TiDB, AlloyDB)
│
└─ Both workloads → Separate systems + ETL/CDC pipeline
├─ OLTP (system of record)
└─ OLAP warehouse (derived)
Cloud vs Self-Hosting Decision Tree
Choosing hosting model?
│
├─ Variable / unpredictable load → Cloud (elastic scaling)
├─ Startup / fast iteration → Cloud (managed ops)
├─ Regulated data (EU, HIPAA) → Cloud with data residency controls
├─ Sustained high-volume, cost-sensitive → Self-hosted (break-even ~$5M ARR)
├─ Ultra-low latency (< 1ms) → Self-hosted bare metal
└─ Global footprint from day one → Cloud (30+ regions)
System of Record vs Derived Data
System of Record (Source of Truth)
├─ Authoritative; wins conflicts
├─ Normalized, write-optimized
├─ Examples: PostgreSQL (user accounts), Kafka (event log)
└─ Ask: "Is this the ONLY place this data lives?"
│
│ ETL / CDC / transformation
▼
Derived Data (Computed Copy)
├─ Can be rebuilt from system of record
├─ Optimized for specific read patterns
├─ Examples: Snowflake warehouse, Elasticsearch index, Redis cache
└─ Ask: "If I delete this, can I recreate it from source?"
Red flag: Treating derived data as a system of record. Leads to split-brain when pipelines lag.
Distributed System Problems (Quick Reference)
Problem | Cause | Mitigation
────────────────────|──────────────────────────|──────────────────────────
Partial failures | Some nodes fail, others | Circuit breakers, retries,
| continue | idempotency
Network partition | Network splits cluster | CAP trade-off: choose
| into isolated groups | CP or AP
Clock skew | No global clock | Logical clocks, NTP bounds
Byzantine faults | Nodes lie/corrupt msgs | BFT protocols (rare in
| | internal systems)
Coordination cost | Consensus requires | Minimize coordination;
| round trips | avoid distributed txns
Microservices vs Monolith Trade-offs
Aspect
Monolith
Microservices
Deployment
Deploy everything together
Deploy each service independently
Data consistency
ACID within one DB
Eventual consistency across services
Team scaling
Harder past ~20-30 engineers
Better with large, autonomous teams
Latency
In-process function calls
Network calls (add latency + failure modes)
Debugging
Single stack trace
Distributed tracing needed
When to use
Start here; most teams
> 50 engineers, > 10 deploys/day
Serverless vs Containers vs VMs
Serverless (Lambda, Cloud Functions)
├─ Pro: Zero ops, auto-scales to 0, pay-per-invocation
└─ Con: Cold starts (~100ms-1s), time limits, no persistent state
Containers (ECS, Kubernetes, Cloud Run)
├─ Pro: Flexible, portable, good density
└─ Con: Operational overhead (orchestration, networking, scaling)
VMs (EC2, GCE)
├─ Pro: Full control, predictable performance, long-running workloads
└─ Con: Slow to provision, pay for idle capacity
Legal / Compliance Quick Reference
Regulation
Region
Key Technical Requirement
GDPR
EU
Right to erasure, data minimization, 72h breach notification
CCPA/CPRA
California
Right to know, right to delete, opt-out of sale
HIPAA
US Healthcare
PHI encryption, audit logs, access controls
AI Act
EU (2024)
High-risk AI needs transparency, human oversight
Practical pattern for right to erasure:
Instead of: Store name/email in every event log
Do this: Store userId (opaque) in events
Name/email only in user profile store
To "erase": delete profile; events become anonymous
Alternative: Encrypt all user data under per-user key
Delete the key → data is cryptographically erased
Note: Each layer is swappable. The architecture pattern is stable even as specific tools evolve.
Key Trade-offs Summary
Decision
Choose A
Choose B
Deciding Factor
OLTP vs OLAP
OLTP
OLAP
Interactive writes vs analytical reads
Warehouse vs Lakehouse
Snowflake/BigQuery
Iceberg on S3
Cost sensitivity, query engine flexibility
Cloud vs self-host
Cloud
Self-host
Scale predictability, team ops capacity
Monolith vs microservices
Monolith
Microservices
Team size, deployment frequency
Serverless vs containers
Serverless
Containers
Statefulness, latency requirements
Distributed vs single-node
Single-node
Distributed
Fault tolerance and scale requirements
Common Architectural Mistakes
Mistake 1: OLAP queries on OLTP DB
Problem: Kills transaction performance, evicts buffer cache
Fix: Separate warehouse + ETL/CDC pipeline
Mistake 2: Treating derived data as system of record
Problem: Split-brain when pipeline lags; corrupt analytics
Fix: Document which store is authoritative
Mistake 3: Premature microservices
Problem: All distributed system complexity, no team scale benefit
Fix: Monolith first; migrate when team/deploy frequency demands it
Mistake 4: No plan for GDPR right to erasure
Problem: Can't delete user data from event logs/analytics
Fix: Store userId (not PII) in events; use cryptographic erasure
Mistake 5: Distributing before hitting single-node limits
Problem: Unnecessary complexity
Fix: Profile single-node PostgreSQL under realistic load first
Quick Revision Time
5-minute review: One-Line Summaries table + OLTP vs OLAP key distinctions 15-minute review: Add all decision trees + comparison tables 30-minute review: Full cheatsheet + trace through Modern Data Stack Last Updated: 2026-05-29