Chapter 1 Cheat Sheet — Trade-Offs in Data Systems Architecture

One-Line Summaries

ConceptOne-Liner
OLTPLow-latency row reads/writes for live application requests
OLAPHigh-throughput column scans for analytical aggregations
Data WarehouseDerived analytical store fed by ETL from OLTP systems
Data LakeCheap raw storage (S3 + Parquet); query via lakehouse engine
HTAPSingle system serving both OLTP and OLAP (TiDB, AlloyDB)
System of RecordAuthoritative source of truth; wins all conflicts
Derived DataComputed from system of record; can be rebuilt if lost
Cloud-NativeDesigned around managed cloud primitives, not ported from on-prem
MicroservicesIndependent deployable services, each with its own DB
ServerlessStateless function invocations; no persistent server state

Quick Numbers to Remember

MetricValueContext
OLTP latency target< 10ms p50, < 100ms p99User-facing interactive requests
OLAP latency toleranceSeconds to minutesAnalytical queries, no user waiting
Snowflake / BigQueryPetabyte scaleCloud warehouses, pay-per-query
Dropbox repatriation~$75M saved over 2 yearsMoving storage from AWS to own hardware
GDPR right to erasure30 days to complyTechnical: cryptographic erasure is practical
Microservices inflection~50+ engineersBefore this, monolith often wins
S3 durability99.999999999% (11 nines)Object storage as universal substrate

OLTP vs OLAP: Key Distinctions

OLTP                                OLAP
────────────────────────────────    ──────────────────────────────────
Access: random by key/index         Access: sequential column scan
Write: frequent small updates       Write: bulk batch loads
Latency: milliseconds               Latency: seconds to minutes
Concurrency: thousands of txns      Concurrency: tens of queries
Storage: row-oriented               Storage: column-oriented
Schema: normalized (3NF)            Schema: star/snowflake
Examples: Postgres, DynamoDB        Examples: Snowflake, BigQuery

Data Architecture Decision Tree

What's the primary workload?
│
├─ Serve live app requests → OLTP
│  ├─ Single region, < 10K QPS → PostgreSQL / MySQL (single node)
│  ├─ Multi-region / global → Spanner / CockroachDB / DynamoDB
│  └─ Serverless / event-driven → DynamoDB / Aurora Serverless
│
├─ Answer analytics questions → OLAP
│  ├─ < 10TB, team < 10 engineers → BigQuery / Snowflake (pay-per-query)
│  ├─ > 100TB, cost-sensitive → Iceberg on S3 + Trino (lakehouse)
│  └─ Real-time on live data → HTAP (TiDB, AlloyDB)
│
└─ Both workloads → Separate systems + ETL/CDC pipeline
   ├─ OLTP (system of record)
   └─ OLAP warehouse (derived)

Cloud vs Self-Hosting Decision Tree

Choosing hosting model?
│
├─ Variable / unpredictable load → Cloud (elastic scaling)
├─ Startup / fast iteration → Cloud (managed ops)
├─ Regulated data (EU, HIPAA) → Cloud with data residency controls
├─ Sustained high-volume, cost-sensitive → Self-hosted (break-even ~$5M ARR)
├─ Ultra-low latency (< 1ms) → Self-hosted bare metal
└─ Global footprint from day one → Cloud (30+ regions)

System of Record vs Derived Data

System of Record (Source of Truth)
├─ Authoritative; wins conflicts
├─ Normalized, write-optimized
├─ Examples: PostgreSQL (user accounts), Kafka (event log)
└─ Ask: "Is this the ONLY place this data lives?"

         │
         │  ETL / CDC / transformation
         ▼

Derived Data (Computed Copy)
├─ Can be rebuilt from system of record
├─ Optimized for specific read patterns
├─ Examples: Snowflake warehouse, Elasticsearch index, Redis cache
└─ Ask: "If I delete this, can I recreate it from source?"

Red flag: Treating derived data as a system of record. Leads to split-brain when pipelines lag.


Distributed System Problems (Quick Reference)

Problem             | Cause                    | Mitigation
────────────────────|──────────────────────────|──────────────────────────
Partial failures    | Some nodes fail, others  | Circuit breakers, retries,
                    | continue                 | idempotency
Network partition   | Network splits cluster   | CAP trade-off: choose
                    | into isolated groups     | CP or AP
Clock skew          | No global clock          | Logical clocks, NTP bounds
Byzantine faults    | Nodes lie/corrupt msgs   | BFT protocols (rare in
                    |                          | internal systems)
Coordination cost   | Consensus requires       | Minimize coordination;
                    | round trips              | avoid distributed txns

Microservices vs Monolith Trade-offs

AspectMonolithMicroservices
DeploymentDeploy everything togetherDeploy each service independently
Data consistencyACID within one DBEventual consistency across services
Team scalingHarder past ~20-30 engineersBetter with large, autonomous teams
LatencyIn-process function callsNetwork calls (add latency + failure modes)
DebuggingSingle stack traceDistributed tracing needed
When to useStart here; most teams> 50 engineers, > 10 deploys/day

Serverless vs Containers vs VMs

Serverless (Lambda, Cloud Functions)
├─ Pro: Zero ops, auto-scales to 0, pay-per-invocation
└─ Con: Cold starts (~100ms-1s), time limits, no persistent state

Containers (ECS, Kubernetes, Cloud Run)
├─ Pro: Flexible, portable, good density
└─ Con: Operational overhead (orchestration, networking, scaling)

VMs (EC2, GCE)
├─ Pro: Full control, predictable performance, long-running workloads
└─ Con: Slow to provision, pay for idle capacity

RegulationRegionKey Technical Requirement
GDPREURight to erasure, data minimization, 72h breach notification
CCPA/CPRACaliforniaRight to know, right to delete, opt-out of sale
HIPAAUS HealthcarePHI encryption, audit logs, access controls
AI ActEU (2024)High-risk AI needs transparency, human oversight

Practical pattern for right to erasure:

Instead of: Store name/email in every event log
Do this:    Store userId (opaque) in events
            Name/email only in user profile store
            To "erase": delete profile; events become anonymous
Alternative: Encrypt all user data under per-user key
             Delete the key → data is cryptographically erased

Modern Data Stack (2026 Baseline)

Sources → Ingestion → Storage → Transform → Serve → Consume
──────────────────────────────────────────────────────────────
App DBs   Debezium   S3/GCS    dbt        Snowflake  Tableau
APIs      Airbyte    Parquet   Spark      BigQuery   Looker
Kafka     Fivetran   Iceberg   Flink      Redshift   Superset
Webhooks  Kinesis    Delta     Python     Trino      Custom

Note: Each layer is swappable. The architecture pattern is stable even as specific tools evolve.


Key Trade-offs Summary

DecisionChoose AChoose BDeciding Factor
OLTP vs OLAPOLTPOLAPInteractive writes vs analytical reads
Warehouse vs LakehouseSnowflake/BigQueryIceberg on S3Cost sensitivity, query engine flexibility
Cloud vs self-hostCloudSelf-hostScale predictability, team ops capacity
Monolith vs microservicesMonolithMicroservicesTeam size, deployment frequency
Serverless vs containersServerlessContainersStatefulness, latency requirements
Distributed vs single-nodeSingle-nodeDistributedFault tolerance and scale requirements

Common Architectural Mistakes

Mistake 1: OLAP queries on OLTP DB
  Problem: Kills transaction performance, evicts buffer cache
  Fix: Separate warehouse + ETL/CDC pipeline

Mistake 2: Treating derived data as system of record
  Problem: Split-brain when pipeline lags; corrupt analytics
  Fix: Document which store is authoritative

Mistake 3: Premature microservices
  Problem: All distributed system complexity, no team scale benefit
  Fix: Monolith first; migrate when team/deploy frequency demands it

Mistake 4: No plan for GDPR right to erasure
  Problem: Can't delete user data from event logs/analytics
  Fix: Store userId (not PII) in events; use cryptographic erasure

Mistake 5: Distributing before hitting single-node limits
  Problem: Unnecessary complexity
  Fix: Profile single-node PostgreSQL under realistic load first

Quick Revision Time

5-minute review: One-Line Summaries table + OLTP vs OLAP key distinctions
15-minute review: Add all decision trees + comparison tables
30-minute review: Full cheatsheet + trace through Modern Data Stack
Last Updated: 2026-05-29