Chapter 1 Cheat Sheet — Trade-Offs in Data Systems Architecture

One-Line Summaries

Concept	One-Liner
OLTP	Low-latency row reads/writes for live application requests
OLAP	High-throughput column scans for analytical aggregations
Data Warehouse	Derived analytical store fed by ETL from OLTP systems
Data Lake	Cheap raw storage (S3 + Parquet); query via lakehouse engine
HTAP	Single system serving both OLTP and OLAP (TiDB, AlloyDB)
System of Record	Authoritative source of truth; wins all conflicts
Derived Data	Computed from system of record; can be rebuilt if lost
Cloud-Native	Designed around managed cloud primitives, not ported from on-prem
Microservices	Independent deployable services, each with its own DB
Serverless	Stateless function invocations; no persistent server state

Quick Numbers to Remember

Metric	Value	Context
OLTP latency target	< 10ms p50, < 100ms p99	User-facing interactive requests
OLAP latency tolerance	Seconds to minutes	Analytical queries, no user waiting
Snowflake / BigQuery	Petabyte scale	Cloud warehouses, pay-per-query
Dropbox repatriation	~$75M saved over 2 years	Moving storage from AWS to own hardware
GDPR right to erasure	30 days to comply	Technical: cryptographic erasure is practical
Microservices inflection	~50+ engineers	Before this, monolith often wins
S3 durability	99.999999999% (11 nines)	Object storage as universal substrate

OLTP vs OLAP: Key Distinctions

OLTP                                OLAP
────────────────────────────────    ──────────────────────────────────
Access: random by key/index         Access: sequential column scan
Write: frequent small updates       Write: bulk batch loads
Latency: milliseconds               Latency: seconds to minutes
Concurrency: thousands of txns      Concurrency: tens of queries
Storage: row-oriented               Storage: column-oriented
Schema: normalized (3NF)            Schema: star/snowflake
Examples: Postgres, DynamoDB        Examples: Snowflake, BigQuery

Data Architecture Decision Tree

What's the primary workload?
│
├─ Serve live app requests → OLTP
│  ├─ Single region, < 10K QPS → PostgreSQL / MySQL (single node)
│  ├─ Multi-region / global → Spanner / CockroachDB / DynamoDB
│  └─ Serverless / event-driven → DynamoDB / Aurora Serverless
│
├─ Answer analytics questions → OLAP
│  ├─ < 10TB, team < 10 engineers → BigQuery / Snowflake (pay-per-query)
│  ├─ > 100TB, cost-sensitive → Iceberg on S3 + Trino (lakehouse)
│  └─ Real-time on live data → HTAP (TiDB, AlloyDB)
│
└─ Both workloads → Separate systems + ETL/CDC pipeline
   ├─ OLTP (system of record)
   └─ OLAP warehouse (derived)

Cloud vs Self-Hosting Decision Tree

Choosing hosting model?
│
├─ Variable / unpredictable load → Cloud (elastic scaling)
├─ Startup / fast iteration → Cloud (managed ops)
├─ Regulated data (EU, HIPAA) → Cloud with data residency controls
├─ Sustained high-volume, cost-sensitive → Self-hosted (break-even ~$5M ARR)
├─ Ultra-low latency (< 1ms) → Self-hosted bare metal
└─ Global footprint from day one → Cloud (30+ regions)

System of Record vs Derived Data

System of Record (Source of Truth)
├─ Authoritative; wins conflicts
├─ Normalized, write-optimized
├─ Examples: PostgreSQL (user accounts), Kafka (event log)
└─ Ask: "Is this the ONLY place this data lives?"

         │
         │  ETL / CDC / transformation
         ▼

Derived Data (Computed Copy)
├─ Can be rebuilt from system of record
├─ Optimized for specific read patterns
├─ Examples: Snowflake warehouse, Elasticsearch index, Redis cache
└─ Ask: "If I delete this, can I recreate it from source?"

Red flag: Treating derived data as a system of record. Leads to split-brain when pipelines lag.

Distributed System Problems (Quick Reference)

Problem             | Cause                    | Mitigation
────────────────────|──────────────────────────|──────────────────────────
Partial failures    | Some nodes fail, others  | Circuit breakers, retries,
                    | continue                 | idempotency
Network partition   | Network splits cluster   | CAP trade-off: choose
                    | into isolated groups     | CP or AP
Clock skew          | No global clock          | Logical clocks, NTP bounds
Byzantine faults    | Nodes lie/corrupt msgs   | BFT protocols (rare in
                    |                          | internal systems)
Coordination cost   | Consensus requires       | Minimize coordination;
                    | round trips              | avoid distributed txns

Microservices vs Monolith Trade-offs

Aspect	Monolith	Microservices
Deployment	Deploy everything together	Deploy each service independently
Data consistency	ACID within one DB	Eventual consistency across services
Team scaling	Harder past ~20-30 engineers	Better with large, autonomous teams
Latency	In-process function calls	Network calls (add latency + failure modes)
Debugging	Single stack trace	Distributed tracing needed
When to use	Start here; most teams	> 50 engineers, > 10 deploys/day

Serverless vs Containers vs VMs

Serverless (Lambda, Cloud Functions)
├─ Pro: Zero ops, auto-scales to 0, pay-per-invocation
└─ Con: Cold starts (~100ms-1s), time limits, no persistent state

Containers (ECS, Kubernetes, Cloud Run)
├─ Pro: Flexible, portable, good density
└─ Con: Operational overhead (orchestration, networking, scaling)

VMs (EC2, GCE)
├─ Pro: Full control, predictable performance, long-running workloads
└─ Con: Slow to provision, pay for idle capacity

Legal / Compliance Quick Reference

Regulation	Region	Key Technical Requirement
GDPR	EU	Right to erasure, data minimization, 72h breach notification
CCPA/CPRA	California	Right to know, right to delete, opt-out of sale
HIPAA	US Healthcare	PHI encryption, audit logs, access controls
AI Act	EU (2024)	High-risk AI needs transparency, human oversight

Practical pattern for right to erasure:

Instead of: Store name/email in every event log
Do this:    Store userId (opaque) in events
            Name/email only in user profile store
            To "erase": delete profile; events become anonymous
Alternative: Encrypt all user data under per-user key
             Delete the key → data is cryptographically erased

Modern Data Stack (2026 Baseline)

Sources → Ingestion → Storage → Transform → Serve → Consume
──────────────────────────────────────────────────────────────
App DBs   Debezium   S3/GCS    dbt        Snowflake  Tableau
APIs      Airbyte    Parquet   Spark      BigQuery   Looker
Kafka     Fivetran   Iceberg   Flink      Redshift   Superset
Webhooks  Kinesis    Delta     Python     Trino      Custom

Note: Each layer is swappable. The architecture pattern is stable even as specific tools evolve.

Key Trade-offs Summary

Decision	Choose A	Choose B	Deciding Factor
OLTP vs OLAP	OLTP	OLAP	Interactive writes vs analytical reads
Warehouse vs Lakehouse	Snowflake/BigQuery	Iceberg on S3	Cost sensitivity, query engine flexibility
Cloud vs self-host	Cloud	Self-host	Scale predictability, team ops capacity
Monolith vs microservices	Monolith	Microservices	Team size, deployment frequency
Serverless vs containers	Serverless	Containers	Statefulness, latency requirements
Distributed vs single-node	Single-node	Distributed	Fault tolerance and scale requirements

Common Architectural Mistakes

Mistake 1: OLAP queries on OLTP DB
  Problem: Kills transaction performance, evicts buffer cache
  Fix: Separate warehouse + ETL/CDC pipeline

Mistake 2: Treating derived data as system of record
  Problem: Split-brain when pipeline lags; corrupt analytics
  Fix: Document which store is authoritative

Mistake 3: Premature microservices
  Problem: All distributed system complexity, no team scale benefit
  Fix: Monolith first; migrate when team/deploy frequency demands it

Mistake 4: No plan for GDPR right to erasure
  Problem: Can't delete user data from event logs/analytics
  Fix: Store userId (not PII) in events; use cryptographic erasure

Mistake 5: Distributing before hitting single-node limits
  Problem: Unnecessary complexity
  Fix: Profile single-node PostgreSQL under realistic load first

Quick Revision Time

5-minute review: One-Line Summaries table + OLTP vs OLAP key distinctions
15-minute review: Add all decision trees + comparison tables
30-minute review: Full cheatsheet + trace through Modern Data Stack
Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch01-cheatsheet