Chapter 1 Flashcards — Trade-Offs in Data Systems Architecture

Definitions

What is OLTP and what are its defining characteristics?
?
OLTP (Online Transaction Processing): Systems that handle live application requests — inserting records, reading user profiles, updating balances.

Access pattern: Random reads/writes by primary key or index
Latency target: < 10ms p50, < 100ms p99
Write pattern: Frequent small inserts, updates, deletes
Concurrency: Thousands of simultaneous transactions
Storage: Row-oriented (read whole rows)
Schema: Normalized (3NF)
Examples: PostgreSQL, MySQL, DynamoDB, Aurora, Spanner

What is OLAP and what are its defining characteristics?
?
OLAP (Online Analytical Processing): Systems that answer aggregation questions over large historical datasets.

Access pattern: Sequential column scans over millions–billions of rows
Latency: Seconds to minutes acceptable
Write pattern: Bulk batch loads (ETL/ELT), not frequent small writes
Concurrency: Tens of concurrent queries
Storage: Column-oriented (read only needed columns)
Schema: Denormalized (star/snowflake schema)
Examples: Snowflake, BigQuery, Redshift, ClickHouse

What is a data warehouse and how does it relate to OLTP systems?
?
A data warehouse is a dedicated analytical store designed to receive data from one or more OLTP source systems and serve OLAP queries.

Data flows from OLTP → warehouse via ETL (Extract-Transform-Load) or ELT pipelines
Warehouse is derived data: it can be rebuilt from the OLTP source if corrupted
Optimized for column-oriented reads, not transactional writes
Modern examples: Snowflake, BigQuery, Amazon Redshift
Key insight: the warehouse is NOT a second source of truth — it is derived from the operational system

What is the difference between a system of record and derived data?
?
System of Record (Source of Truth):

The authoritative version of the data; wins all conflicts
Normalized, write-optimized, transactional
Examples: PostgreSQL user accounts DB, raw Kafka event log
Criterion: if two systems disagree, the system of record is correct

Derived Data:

Computed from the system of record via transformations
Can be rebuilt if corrupted or lost
Optimized for specific read patterns (speed, denormalization)
Examples: Snowflake warehouse, Elasticsearch index, Redis cache, ML feature store

Critical failure mode: Treating derived data as a system of record leads to split-brain inconsistencies when the transformation pipeline lags.

What is a data lake and how does it differ from a data warehouse?
?
Data Lake: Storage layer of raw, unprocessed data in open formats (Parquet, ORC, Iceberg) on cheap object storage (S3, GCS).

Cost: Much cheaper per TB than warehouse compute+storage
Schema: Schema-on-read (flexible, no pre-defined schema)
Governance: Less structured; data quality varies
Query engine: Needs external engine (Athena, Trino, Spark)

Data Warehouse: Structured, curated, transformed data with enforced schema.

Cost: Higher per TB but better query performance
Schema: Schema-on-write (enforced)
Governance: Strong

Lakehouse (Iceberg + Trino, Databricks Delta): Combines cheap lake storage with warehouse-like ACID semantics and query performance.

What is HTAP?
?
HTAP (Hybrid Transactional/Analytical Processing): A system architecture that serves both OLTP and OLAP workloads from a single data store.

Internally maintains both row-oriented (for transactions) and column-oriented (for analytics) storage
Benefit: Eliminates ETL pipeline; analytical queries see live operational data
Trade-off: Significantly higher system complexity
Examples: TiDB, Google AlloyDB, SingleStore, CockroachDB (developing)
When to use: When real-time analytics on operational data is required and the complexity cost is worth it

Trade-offs and Comparisons

What are the main advantages of cloud hosting over self-hosting?
?
Cloud advantages:

Operational simplicity: Managed services handle patching, failover, backups, scaling
Elastic scaling: Add capacity in minutes; scale down when load drops
Global footprint: Deploy in 30+ regions without building data centers
Managed ecosystem: Managed Kafka (MSK), Redis (ElastiCache), Postgres (RDS) — use without operating
Pay-per-use economics: Cost tracks usage; favorable for variable workloads
No upfront hardware capex: Critical for startups

What are the main disadvantages of cloud hosting versus self-hosting?
?
Cloud disadvantages:

Cost at scale: For predictable, sustained high-volume workloads, cloud can cost more than owned hardware (Dropbox saved ~$75M repatriating from AWS)
Vendor lock-in: Proprietary APIs (DynamoDB, Bigtable) create switching costs
Data egress costs: Moving large datasets out of cloud is expensive
Data sovereignty: Some regulated data must reside in specific jurisdictions
Reduced control: Cannot tune kernel, choose hardware, or guarantee sub-millisecond latency
Opaque failures: Cloud outages are outside your control; AZ failures require application-level handling

What are the core problems introduced by distributed systems that don’t exist on a single machine?
?
Five problems unique to distributed systems:

Partial failures: Some components fail while others run — system state becomes ambiguous
Network partitions: Network split creates isolated groups; CAP theorem forces choose consistency or availability
Clock skew: No global clock; event ordering requires logical clocks (Lamport timestamps, vector clocks)
Byzantine faults: In open networks, nodes may lie or corrupt messages (most internal systems assume crash-fail, not Byzantine)
Coordination overhead: Agreement across nodes (distributed transactions, leader election) adds latency proportional to node count

Key principle: Do not distribute until you must. Start with a single node.

What are the trade-offs of microservices versus a monolith?
?
Monolith advantages:

In-process function calls (no network latency or partial failures)
Single ACID database (no cross-service consistency problems)
Simple debugging (single stack trace)
Lower operational complexity (one deployment unit)

Microservices advantages:

Independent deployment and scaling per service
Team autonomy (each team owns their service)
Fault isolation (one service failure doesn’t cascade)
Technology heterogeneity

When microservices are appropriate: > 50 engineers, > 10 deploys/day, clear service boundaries
Warning: Premature microservices introduce all distributed system complexity with no team-scale benefit

What are the trade-offs of serverless vs containers vs VMs?
?
Serverless (Lambda, Cloud Functions):

Pro: Zero ops, auto-scales to 0, pay-per-invocation
Con: Cold start latency (~100ms–1s), execution time limits, no persistent state, vendor lock-in
Best for: Stateless, event-driven, variable workloads

Containers (Kubernetes, ECS):

Pro: Flexible, portable, good density, persistent state possible
Con: Operational overhead (orchestration, networking, scaling config)
Best for: Long-running services, stateful workloads

VMs (EC2, GCE):

Pro: Full control, predictable performance, suitable for long-running workloads
Con: Slow to provision, pay for idle capacity
Best for: Specialized hardware needs, compliance requirements

Numbers and Precision

What are typical latency targets for OLTP vs OLAP systems?
?
OLTP:

p50 (median): < 10ms
p99: < 100ms
User impact: direct; slow OLTP = poor user experience

OLAP:

Acceptable range: seconds to minutes
User is not waiting interactively; query is background analysis
Key metric: query throughput (queries per hour), not individual latency

HTAP attempts to bridge this: sub-second analytical queries on live data, but at the cost of system complexity

What did Dropbox’s repatriation from AWS demonstrate about cloud economics?
?
Dropbox moved storage workloads from AWS back to their own hardware and saved approximately $75 million over 2 years.

Key condition: Dropbox had predictable, high-volume, sustained storage workload
Cloud pay-per-use is economical for variable workloads; at sustained scale, owned hardware can be cheaper
Break-even point: Roughly ~ $5 M A R R or$ 1M/year cloud spend justifies dedicated hardware evaluation
Lesson: Cloud is not always cheaper — evaluate based on workload predictability and volume

What are the key GDPR deadlines and requirements architects must plan for?
?
Key GDPR requirements with technical implications:

Right to erasure: Must delete all personal data for a user within 30 days of request
Breach notification: Notify supervisory authority within 72 hours of discovering a breach
Data minimization: Don’t collect data you don’t need
Purpose limitation: Don’t repurpose data beyond stated collection purpose

Technical challenge: Right to erasure in analytics pipelines

Event logs with user actions are hard to delete individual records from
Solution 1: Store userId (not PII) in events; delete only the profile store
Solution 2: Cryptographic erasure — encrypt user data under per-user key; delete the key

What is the S3 durability guarantee and why does it matter architecturally?
?
S3 durability: 99.999999999% (11 nines)

Equivalent to expected loss of 1 object per 10,000 years if you have 10,000 objects
Achieved through redundant storage across multiple AZs and integrity checksums

Architectural implication: Object storage is reliable enough to serve as the universal substrate for data pipelines.

Build data lakes on S3 (or GCS/Azure Blob) rather than HDFS
Lakehouse architectures (Iceberg, Delta) built entirely on S3 + compute-storage separation
Cloud-native means: compute is disposable; state lives in object storage

Application and Failure Modes

When should you NOT use microservices?
?
Avoid microservices when:

Team size is < 30–50 engineers (coordination overhead outweighs autonomy benefits)
Deployment frequency is low (< few deploys/day; independence benefit doesn’t materialize)
Service boundaries are unclear (premature decomposition creates wrong seams)
Strong transactional consistency is required across what would be service boundaries (cross-service ACID is expensive)

Failure mode of premature microservices:

Each service call becomes a distributed systems problem (retries, timeouts, tracing)
Data consistency requires sagas or 2PC — complex and error-prone
Operational overhead multiplies: N services to monitor, deploy, and debug
Rule: Start with a monolith; extract services when team and deployment frequency demand it

Why can’t you run OLAP queries directly against your OLTP database?
?
Running OLAP queries against an OLTP database causes problems for both workloads:

Harm to OLTP:

Long sequential scans hold table locks, blocking concurrent transactions
Large scans evict the OLTP working set from the buffer pool (cache thrashing)
CPU and I/O spikes degrade response times for live users

Harm to OLAP:

Row-oriented storage is inefficient for column scans (reads unnecessary data)
Missing column compression and vectorized execution → slow queries
No isolation from OLTP write traffic

Solution: Separate systems with ETL/CDC pipeline. The warehouse is derived from the OLTP system.
Exception: HTAP systems (TiDB, AlloyDB) mitigate this at the cost of higher system complexity.

What are the failure modes of treating derived data as a system of record?
?
What goes wrong:

Stale data mistaken for current: Pipeline lag causes downstream systems to believe old data is current
Split-brain inconsistency: Two derived stores diverge because one pipeline is behind; neither is authoritative
Incorrect conflict resolution: When two sources disagree, you don’t know which is right
Data loss on rebuild: If you treat the derived store as authoritative, you may discard the true source
Schema confusion: Schema changes in source don’t propagate; derived store has different structure

Correct mental model:

System of record = gold copy; always correct; if others disagree, it wins
Derived = bronze copy; may be stale; rebuild from source if corrupted

What happens when a company adopts cloud-native architecture but keeps on-premises thinking?
?
Anti-patterns from porting on-premises patterns to cloud:

Stateful instances: Storing session state in application server memory — fails when cloud instances are terminated (spot preemption, auto-scaling events)
Coupled compute and storage: Running databases on single large EC2 instances instead of managed services — loses elasticity
Manual capacity planning: Pre-allocating instance count instead of using auto-scaling groups
Long-lived instances as pets: Treating cloud VMs like physical servers (manual patching, no replacements) — fails when cloud hardware is recycled

Cloud-native principles instead:

Stateless compute; all durable state in managed storage
Decouple compute and storage (Snowflake model)
Treat every instance as disposable; design for termination at any time
Use managed services; avoid operating infrastructure that a cloud provider can operate for you

What are the practical technical steps to comply with GDPR right to erasure in a data pipeline?
?
The problem: Event logs and analytical tables contain PII mixed with behavioral data. Can’t delete individual rows from columnar files.

Approach 1 — Separation by design:

Store only userId (opaque integer) in event logs and warehouse
Personal data (name, email, DOB) lives only in the user profile database
On erasure: delete user profile row → events become pseudonymous, not personal data
Trade-off: Loses ability to re-identify events if needed

Approach 2 — Cryptographic erasure:

Encrypt all user-specific data with a per-user encryption key
Store the key separately (key management service)
On erasure: delete the key → all encrypted data is unreadable = effectively erased
Trade-off: Adds encryption/decryption overhead; key management complexity

Approach 3 — Partitioned storage with TTL:

Partition event data by time; apply retention policies
Accept that old events beyond retention period are automatically purged

What is the difference between ETL and ELT, and when does each make sense?
?
ETL (Extract-Transform-Load):

Transform data before loading into the warehouse
Data is clean and structured when it arrives
Traditional approach when warehouse compute was expensive
Best for: Complex transformations that need specialized tools; regulated data that must be clean before storage

ELT (Extract-Load-Transform):

Load raw data into warehouse first, then transform inside the warehouse using SQL
Takes advantage of cheap and powerful warehouse compute (Snowflake, BigQuery)
dbt (data build tool) is the dominant ELT transformation layer in 2026
Best for: Modern cloud warehouses; when you want to preserve raw data for reprocessing; faster ingestion

Trend: ELT has largely displaced ETL for cloud warehouse workloads. The raw data layer serves as a replay buffer.

What is cloud-native architecture and what are its core principles?
?
Cloud-native means designing systems for cloud primitives rather than porting on-premises patterns to the cloud.

Core principles:

Disposable infrastructure: Instances terminate unexpectedly — design for it; no state in application servers
Compute-storage separation: Scale each independently (Snowflake separates query compute from S3 storage)
Prefer managed services: Let the cloud provider operate Kafka, Redis, Postgres unless you have specific reasons not to
Design for availability zones: Spread replicas across AZs for fault tolerance without cross-region latency
Object storage as substrate: S3/GCS with 11-nines durability is the universal data foundation

Anti-patterns (on-premises thinking in the cloud):

Storing session state in application server memory
Treating cloud VMs as pets rather than cattle
Manual capacity planning instead of auto-scaling

What is the difference between star schema and snowflake schema in data warehouses?
?
Both are dimensional modeling schemas for OLAP warehouses. They differ in how dimension tables are structured.

Star schema:

One central fact table (events/transactions) surrounded by dimension tables (product, customer, date)
Dimension tables are denormalized — contain redundant data
Simpler joins; typically 2 joins to answer a question
Better query performance for simple queries

Snowflake schema:

Dimension tables are normalized — hierarchies split into separate tables
Example: product → product_category → department (3 tables instead of 1)
More joins required; lower storage cost; harder to query

2026 practice: Most modern data warehouses (Snowflake, BigQuery) use star schema with some denormalization. dbt models typically implement star schema patterns.

What is the difference between ETL and ELT?
?
ETL (Extract-Transform-Load):

Extract data from source systems
Transform (clean, join, aggregate) outside the warehouse in a dedicated ETL tool
Load cleaned data into the warehouse

ELT (Extract-Load-Transform):

Extract raw data from source systems
Load raw data directly into the warehouse
Transform inside the warehouse using SQL (dbt, stored procedures)

Why ELT dominates in 2026:

Cloud warehouses (Snowflake, BigQuery) have cheap, powerful compute — transform there
Preserves raw data for reprocessing as requirements change
dbt has made SQL-based transformation the standard
Faster ingestion; raw data available sooner

When ETL still makes sense: Complex transformations requiring specialized tools; regulated data that must be cleaned before storage

What is write amplification in the fan-out context?
?
Write amplification is when one logical write operation results in many physical writes.

Fan-out context: When a user posts, the message must be written to all followers’ timeline caches.

A user with 1M followers posts once → 1M cache writes
A celebrity with 30M followers (e.g., Taylor Swift on a social network) posts → 30M writes
If each write takes 1ms, 30M writes take 30,000 seconds sequentially — must be parallelized

Why it matters: Write amplification determines the system’s write capacity ceiling and is the primary reason celebrities are handled differently from regular users in fan-out architectures.

Other contexts where write amplification appears:

LSM-tree compaction (Ch4): data rewritten multiple times during compaction
Replication (Ch6): single write replicated to N replicas
Indexed writes: each write updates multiple index structures

What are the key characteristics of a data lakehouse?
?
A lakehouse combines the low cost and flexibility of a data lake with the query performance and ACID semantics of a data warehouse.

Key properties:

Storage: Open table formats (Apache Iceberg, Delta Lake, Apache Hudi) on object storage (S3, GCS)
ACID transactions: Atomic writes, consistent reads across concurrent writers
Time travel: Query data as it existed at a past point in time
Schema evolution: Add/rename/reorder columns without rewriting data
Multiple compute engines: Same data accessed by Spark, Trino, Flink, DuckDB, Snowflake

Examples: Databricks Delta Lake, Apache Iceberg (used by Netflix, Apple), Apache Hudi (used by Uber)

Why it’s replacing pure warehouses: Avoids data duplication (lake + warehouse); cheaper at scale; open formats prevent vendor lock-in; supports both batch and streaming ingestion.

What is Change Data Capture (CDC) and why is it used in data pipelines?
?
Change Data Capture (CDC): A technique that continuously captures row-level changes (INSERT, UPDATE, DELETE) from a source database’s transaction log and publishes them as events.

How it works:

Database writes every change to its transaction log (WAL in PostgreSQL, binlog in MySQL)
CDC tool reads the log (Debezium, AWS DMS, Airbyte)
Changes are published to a message queue (Kafka, Kinesis)
Downstream systems consume the change events to update derived data

Why CDC over polling:

Real-time (milliseconds vs minutes for scheduled queries)
Low source DB load (reads the log, doesn’t query the table)
Captures all changes including deletes (polling misses deletes)
Preserves change order

Use cases: Feeding data warehouse with near-real-time data; keeping search indexes in sync; cache invalidation; audit logging

Total Cards: 27
Review Time: ~35 minutes
Priority: HIGH
Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch01-flashcards

Chapter 1 Flashcards — Trade-Offs in Data Systems Architecture

Definitions

Trade-offs and Comparisons

Numbers and Precision

Application and Failure Modes

Graph View

Table of Contents