Chapter 1 Flashcards — Trade-Offs in Data Systems Architecture

flashcards ddia-2e chapter1

Definitions

What is OLTP and what are its defining characteristics?
?
OLTP (Online Transaction Processing): Systems that handle live application requests — inserting records, reading user profiles, updating balances.

  • Access pattern: Random reads/writes by primary key or index
  • Latency target: < 10ms p50, < 100ms p99
  • Write pattern: Frequent small inserts, updates, deletes
  • Concurrency: Thousands of simultaneous transactions
  • Storage: Row-oriented (read whole rows)
  • Schema: Normalized (3NF)
  • Examples: PostgreSQL, MySQL, DynamoDB, Aurora, Spanner

What is OLAP and what are its defining characteristics?
?
OLAP (Online Analytical Processing): Systems that answer aggregation questions over large historical datasets.

  • Access pattern: Sequential column scans over millions–billions of rows
  • Latency: Seconds to minutes acceptable
  • Write pattern: Bulk batch loads (ETL/ELT), not frequent small writes
  • Concurrency: Tens of concurrent queries
  • Storage: Column-oriented (read only needed columns)
  • Schema: Denormalized (star/snowflake schema)
  • Examples: Snowflake, BigQuery, Redshift, ClickHouse

What is a data warehouse and how does it relate to OLTP systems?
?
A data warehouse is a dedicated analytical store designed to receive data from one or more OLTP source systems and serve OLAP queries.

  • Data flows from OLTP → warehouse via ETL (Extract-Transform-Load) or ELT pipelines
  • Warehouse is derived data: it can be rebuilt from the OLTP source if corrupted
  • Optimized for column-oriented reads, not transactional writes
  • Modern examples: Snowflake, BigQuery, Amazon Redshift
  • Key insight: the warehouse is NOT a second source of truth — it is derived from the operational system

What is the difference between a system of record and derived data?
?
System of Record (Source of Truth):

  • The authoritative version of the data; wins all conflicts
  • Normalized, write-optimized, transactional
  • Examples: PostgreSQL user accounts DB, raw Kafka event log
  • Criterion: if two systems disagree, the system of record is correct

Derived Data:

  • Computed from the system of record via transformations
  • Can be rebuilt if corrupted or lost
  • Optimized for specific read patterns (speed, denormalization)
  • Examples: Snowflake warehouse, Elasticsearch index, Redis cache, ML feature store

Critical failure mode: Treating derived data as a system of record leads to split-brain inconsistencies when the transformation pipeline lags.

What is a data lake and how does it differ from a data warehouse?
?
Data Lake: Storage layer of raw, unprocessed data in open formats (Parquet, ORC, Iceberg) on cheap object storage (S3, GCS).

  • Cost: Much cheaper per TB than warehouse compute+storage
  • Schema: Schema-on-read (flexible, no pre-defined schema)
  • Governance: Less structured; data quality varies
  • Query engine: Needs external engine (Athena, Trino, Spark)

Data Warehouse: Structured, curated, transformed data with enforced schema.

  • Cost: Higher per TB but better query performance
  • Schema: Schema-on-write (enforced)
  • Governance: Strong

Lakehouse (Iceberg + Trino, Databricks Delta): Combines cheap lake storage with warehouse-like ACID semantics and query performance.

What is HTAP?
?
HTAP (Hybrid Transactional/Analytical Processing): A system architecture that serves both OLTP and OLAP workloads from a single data store.

  • Internally maintains both row-oriented (for transactions) and column-oriented (for analytics) storage
  • Benefit: Eliminates ETL pipeline; analytical queries see live operational data
  • Trade-off: Significantly higher system complexity
  • Examples: TiDB, Google AlloyDB, SingleStore, CockroachDB (developing)
  • When to use: When real-time analytics on operational data is required and the complexity cost is worth it

Trade-offs and Comparisons

What are the main advantages of cloud hosting over self-hosting?
?
Cloud advantages:

  1. Operational simplicity: Managed services handle patching, failover, backups, scaling
  2. Elastic scaling: Add capacity in minutes; scale down when load drops
  3. Global footprint: Deploy in 30+ regions without building data centers
  4. Managed ecosystem: Managed Kafka (MSK), Redis (ElastiCache), Postgres (RDS) — use without operating
  5. Pay-per-use economics: Cost tracks usage; favorable for variable workloads
  6. No upfront hardware capex: Critical for startups

What are the main disadvantages of cloud hosting versus self-hosting?
?
Cloud disadvantages:

  1. Cost at scale: For predictable, sustained high-volume workloads, cloud can cost more than owned hardware (Dropbox saved ~$75M repatriating from AWS)
  2. Vendor lock-in: Proprietary APIs (DynamoDB, Bigtable) create switching costs
  3. Data egress costs: Moving large datasets out of cloud is expensive
  4. Data sovereignty: Some regulated data must reside in specific jurisdictions
  5. Reduced control: Cannot tune kernel, choose hardware, or guarantee sub-millisecond latency
  6. Opaque failures: Cloud outages are outside your control; AZ failures require application-level handling

What are the core problems introduced by distributed systems that don’t exist on a single machine?
?
Five problems unique to distributed systems:

  1. Partial failures: Some components fail while others run — system state becomes ambiguous
  2. Network partitions: Network split creates isolated groups; CAP theorem forces choose consistency or availability
  3. Clock skew: No global clock; event ordering requires logical clocks (Lamport timestamps, vector clocks)
  4. Byzantine faults: In open networks, nodes may lie or corrupt messages (most internal systems assume crash-fail, not Byzantine)
  5. Coordination overhead: Agreement across nodes (distributed transactions, leader election) adds latency proportional to node count

Key principle: Do not distribute until you must. Start with a single node.

What are the trade-offs of microservices versus a monolith?
?
Monolith advantages:

  • In-process function calls (no network latency or partial failures)
  • Single ACID database (no cross-service consistency problems)
  • Simple debugging (single stack trace)
  • Lower operational complexity (one deployment unit)

Microservices advantages:

  • Independent deployment and scaling per service
  • Team autonomy (each team owns their service)
  • Fault isolation (one service failure doesn’t cascade)
  • Technology heterogeneity

When microservices are appropriate: > 50 engineers, > 10 deploys/day, clear service boundaries
Warning: Premature microservices introduce all distributed system complexity with no team-scale benefit

What are the trade-offs of serverless vs containers vs VMs?
?
Serverless (Lambda, Cloud Functions):

  • Pro: Zero ops, auto-scales to 0, pay-per-invocation
  • Con: Cold start latency (~100ms–1s), execution time limits, no persistent state, vendor lock-in
  • Best for: Stateless, event-driven, variable workloads

Containers (Kubernetes, ECS):

  • Pro: Flexible, portable, good density, persistent state possible
  • Con: Operational overhead (orchestration, networking, scaling config)
  • Best for: Long-running services, stateful workloads

VMs (EC2, GCE):

  • Pro: Full control, predictable performance, suitable for long-running workloads
  • Con: Slow to provision, pay for idle capacity
  • Best for: Specialized hardware needs, compliance requirements

Numbers and Precision

What are typical latency targets for OLTP vs OLAP systems?
?
OLTP:

  • p50 (median): < 10ms
  • p99: < 100ms
  • User impact: direct; slow OLTP = poor user experience

OLAP:

  • Acceptable range: seconds to minutes
  • User is not waiting interactively; query is background analysis
  • Key metric: query throughput (queries per hour), not individual latency

HTAP attempts to bridge this: sub-second analytical queries on live data, but at the cost of system complexity

What did Dropbox’s repatriation from AWS demonstrate about cloud economics?
?
Dropbox moved storage workloads from AWS back to their own hardware and saved approximately $75 million over 2 years.

  • Key condition: Dropbox had predictable, high-volume, sustained storage workload
  • Cloud pay-per-use is economical for variable workloads; at sustained scale, owned hardware can be cheaper
  • Break-even point: Roughly ~1M/year cloud spend justifies dedicated hardware evaluation
  • Lesson: Cloud is not always cheaper — evaluate based on workload predictability and volume

What are the key GDPR deadlines and requirements architects must plan for?
?
Key GDPR requirements with technical implications:

  • Right to erasure: Must delete all personal data for a user within 30 days of request
  • Breach notification: Notify supervisory authority within 72 hours of discovering a breach
  • Data minimization: Don’t collect data you don’t need
  • Purpose limitation: Don’t repurpose data beyond stated collection purpose

Technical challenge: Right to erasure in analytics pipelines

  • Event logs with user actions are hard to delete individual records from
  • Solution 1: Store userId (not PII) in events; delete only the profile store
  • Solution 2: Cryptographic erasure — encrypt user data under per-user key; delete the key

What is the S3 durability guarantee and why does it matter architecturally?
?
S3 durability: 99.999999999% (11 nines)

  • Equivalent to expected loss of 1 object per 10,000 years if you have 10,000 objects
  • Achieved through redundant storage across multiple AZs and integrity checksums

Architectural implication: Object storage is reliable enough to serve as the universal substrate for data pipelines.

  • Build data lakes on S3 (or GCS/Azure Blob) rather than HDFS
  • Lakehouse architectures (Iceberg, Delta) built entirely on S3 + compute-storage separation
  • Cloud-native means: compute is disposable; state lives in object storage

Application and Failure Modes

When should you NOT use microservices?
?
Avoid microservices when:

  • Team size is < 30–50 engineers (coordination overhead outweighs autonomy benefits)
  • Deployment frequency is low (< few deploys/day; independence benefit doesn’t materialize)
  • Service boundaries are unclear (premature decomposition creates wrong seams)
  • Strong transactional consistency is required across what would be service boundaries (cross-service ACID is expensive)

Failure mode of premature microservices:

  • Each service call becomes a distributed systems problem (retries, timeouts, tracing)
  • Data consistency requires sagas or 2PC — complex and error-prone
  • Operational overhead multiplies: N services to monitor, deploy, and debug
  • Rule: Start with a monolith; extract services when team and deployment frequency demand it

Why can’t you run OLAP queries directly against your OLTP database?
?
Running OLAP queries against an OLTP database causes problems for both workloads:

Harm to OLTP:

  • Long sequential scans hold table locks, blocking concurrent transactions
  • Large scans evict the OLTP working set from the buffer pool (cache thrashing)
  • CPU and I/O spikes degrade response times for live users

Harm to OLAP:

  • Row-oriented storage is inefficient for column scans (reads unnecessary data)
  • Missing column compression and vectorized execution → slow queries
  • No isolation from OLTP write traffic

Solution: Separate systems with ETL/CDC pipeline. The warehouse is derived from the OLTP system.
Exception: HTAP systems (TiDB, AlloyDB) mitigate this at the cost of higher system complexity.

What are the failure modes of treating derived data as a system of record?
?
What goes wrong:

  1. Stale data mistaken for current: Pipeline lag causes downstream systems to believe old data is current
  2. Split-brain inconsistency: Two derived stores diverge because one pipeline is behind; neither is authoritative
  3. Incorrect conflict resolution: When two sources disagree, you don’t know which is right
  4. Data loss on rebuild: If you treat the derived store as authoritative, you may discard the true source
  5. Schema confusion: Schema changes in source don’t propagate; derived store has different structure

Correct mental model:

  • System of record = gold copy; always correct; if others disagree, it wins
  • Derived = bronze copy; may be stale; rebuild from source if corrupted

What happens when a company adopts cloud-native architecture but keeps on-premises thinking?
?
Anti-patterns from porting on-premises patterns to cloud:

  1. Stateful instances: Storing session state in application server memory — fails when cloud instances are terminated (spot preemption, auto-scaling events)
  2. Coupled compute and storage: Running databases on single large EC2 instances instead of managed services — loses elasticity
  3. Manual capacity planning: Pre-allocating instance count instead of using auto-scaling groups
  4. Long-lived instances as pets: Treating cloud VMs like physical servers (manual patching, no replacements) — fails when cloud hardware is recycled

Cloud-native principles instead:

  • Stateless compute; all durable state in managed storage
  • Decouple compute and storage (Snowflake model)
  • Treat every instance as disposable; design for termination at any time
  • Use managed services; avoid operating infrastructure that a cloud provider can operate for you

What are the practical technical steps to comply with GDPR right to erasure in a data pipeline?
?
The problem: Event logs and analytical tables contain PII mixed with behavioral data. Can’t delete individual rows from columnar files.

Approach 1 — Separation by design:

  • Store only userId (opaque integer) in event logs and warehouse
  • Personal data (name, email, DOB) lives only in the user profile database
  • On erasure: delete user profile row → events become pseudonymous, not personal data
  • Trade-off: Loses ability to re-identify events if needed

Approach 2 — Cryptographic erasure:

  • Encrypt all user-specific data with a per-user encryption key
  • Store the key separately (key management service)
  • On erasure: delete the key → all encrypted data is unreadable = effectively erased
  • Trade-off: Adds encryption/decryption overhead; key management complexity

Approach 3 — Partitioned storage with TTL:

  • Partition event data by time; apply retention policies
  • Accept that old events beyond retention period are automatically purged

What is the difference between ETL and ELT, and when does each make sense?
?
ETL (Extract-Transform-Load):

  • Transform data before loading into the warehouse
  • Data is clean and structured when it arrives
  • Traditional approach when warehouse compute was expensive
  • Best for: Complex transformations that need specialized tools; regulated data that must be clean before storage

ELT (Extract-Load-Transform):

  • Load raw data into warehouse first, then transform inside the warehouse using SQL
  • Takes advantage of cheap and powerful warehouse compute (Snowflake, BigQuery)
  • dbt (data build tool) is the dominant ELT transformation layer in 2026
  • Best for: Modern cloud warehouses; when you want to preserve raw data for reprocessing; faster ingestion

Trend: ELT has largely displaced ETL for cloud warehouse workloads. The raw data layer serves as a replay buffer.

What is cloud-native architecture and what are its core principles?
?
Cloud-native means designing systems for cloud primitives rather than porting on-premises patterns to the cloud.

Core principles:

  1. Disposable infrastructure: Instances terminate unexpectedly — design for it; no state in application servers
  2. Compute-storage separation: Scale each independently (Snowflake separates query compute from S3 storage)
  3. Prefer managed services: Let the cloud provider operate Kafka, Redis, Postgres unless you have specific reasons not to
  4. Design for availability zones: Spread replicas across AZs for fault tolerance without cross-region latency
  5. Object storage as substrate: S3/GCS with 11-nines durability is the universal data foundation

Anti-patterns (on-premises thinking in the cloud):

  • Storing session state in application server memory
  • Treating cloud VMs as pets rather than cattle
  • Manual capacity planning instead of auto-scaling

What is the difference between star schema and snowflake schema in data warehouses?
?
Both are dimensional modeling schemas for OLAP warehouses. They differ in how dimension tables are structured.

Star schema:

  • One central fact table (events/transactions) surrounded by dimension tables (product, customer, date)
  • Dimension tables are denormalized — contain redundant data
  • Simpler joins; typically 2 joins to answer a question
  • Better query performance for simple queries

Snowflake schema:

  • Dimension tables are normalized — hierarchies split into separate tables
  • Example: product → product_category → department (3 tables instead of 1)
  • More joins required; lower storage cost; harder to query

2026 practice: Most modern data warehouses (Snowflake, BigQuery) use star schema with some denormalization. dbt models typically implement star schema patterns.

What is the difference between ETL and ELT?
?
ETL (Extract-Transform-Load):

  1. Extract data from source systems
  2. Transform (clean, join, aggregate) outside the warehouse in a dedicated ETL tool
  3. Load cleaned data into the warehouse

ELT (Extract-Load-Transform):

  1. Extract raw data from source systems
  2. Load raw data directly into the warehouse
  3. Transform inside the warehouse using SQL (dbt, stored procedures)

Why ELT dominates in 2026:

  • Cloud warehouses (Snowflake, BigQuery) have cheap, powerful compute — transform there
  • Preserves raw data for reprocessing as requirements change
  • dbt has made SQL-based transformation the standard
  • Faster ingestion; raw data available sooner

When ETL still makes sense: Complex transformations requiring specialized tools; regulated data that must be cleaned before storage

What is write amplification in the fan-out context?
?
Write amplification is when one logical write operation results in many physical writes.

Fan-out context: When a user posts, the message must be written to all followers’ timeline caches.

  • A user with 1M followers posts once → 1M cache writes
  • A celebrity with 30M followers (e.g., Taylor Swift on a social network) posts → 30M writes
  • If each write takes 1ms, 30M writes take 30,000 seconds sequentially — must be parallelized

Why it matters: Write amplification determines the system’s write capacity ceiling and is the primary reason celebrities are handled differently from regular users in fan-out architectures.

Other contexts where write amplification appears:

  • LSM-tree compaction (Ch4): data rewritten multiple times during compaction
  • Replication (Ch6): single write replicated to N replicas
  • Indexed writes: each write updates multiple index structures

What are the key characteristics of a data lakehouse?
?
A lakehouse combines the low cost and flexibility of a data lake with the query performance and ACID semantics of a data warehouse.

Key properties:

  • Storage: Open table formats (Apache Iceberg, Delta Lake, Apache Hudi) on object storage (S3, GCS)
  • ACID transactions: Atomic writes, consistent reads across concurrent writers
  • Time travel: Query data as it existed at a past point in time
  • Schema evolution: Add/rename/reorder columns without rewriting data
  • Multiple compute engines: Same data accessed by Spark, Trino, Flink, DuckDB, Snowflake

Examples: Databricks Delta Lake, Apache Iceberg (used by Netflix, Apple), Apache Hudi (used by Uber)

Why it’s replacing pure warehouses: Avoids data duplication (lake + warehouse); cheaper at scale; open formats prevent vendor lock-in; supports both batch and streaming ingestion.

What is Change Data Capture (CDC) and why is it used in data pipelines?
?
Change Data Capture (CDC): A technique that continuously captures row-level changes (INSERT, UPDATE, DELETE) from a source database’s transaction log and publishes them as events.

How it works:

  1. Database writes every change to its transaction log (WAL in PostgreSQL, binlog in MySQL)
  2. CDC tool reads the log (Debezium, AWS DMS, Airbyte)
  3. Changes are published to a message queue (Kafka, Kinesis)
  4. Downstream systems consume the change events to update derived data

Why CDC over polling:

  • Real-time (milliseconds vs minutes for scheduled queries)
  • Low source DB load (reads the log, doesn’t query the table)
  • Captures all changes including deletes (polling misses deletes)
  • Preserves change order

Use cases: Feeding data warehouse with near-real-time data; keeping search indexes in sync; cache invalidation; audit logging


Total Cards: 27
Review Time: ~35 minutes
Priority: HIGH
Last Updated: 2026-05-29