Chapter 1 Flashcards — Trade-Offs in Data Systems Architecture
Definitions
What is OLTP and what are its defining characteristics?
?
OLTP (Online Transaction Processing): Systems that handle live application requests — inserting records, reading user profiles, updating balances.
- Access pattern: Random reads/writes by primary key or index
- Latency target: < 10ms p50, < 100ms p99
- Write pattern: Frequent small inserts, updates, deletes
- Concurrency: Thousands of simultaneous transactions
- Storage: Row-oriented (read whole rows)
- Schema: Normalized (3NF)
- Examples: PostgreSQL, MySQL, DynamoDB, Aurora, Spanner
What is OLAP and what are its defining characteristics?
?
OLAP (Online Analytical Processing): Systems that answer aggregation questions over large historical datasets.
- Access pattern: Sequential column scans over millions–billions of rows
- Latency: Seconds to minutes acceptable
- Write pattern: Bulk batch loads (ETL/ELT), not frequent small writes
- Concurrency: Tens of concurrent queries
- Storage: Column-oriented (read only needed columns)
- Schema: Denormalized (star/snowflake schema)
- Examples: Snowflake, BigQuery, Redshift, ClickHouse
What is a data warehouse and how does it relate to OLTP systems?
?
A data warehouse is a dedicated analytical store designed to receive data from one or more OLTP source systems and serve OLAP queries.
- Data flows from OLTP → warehouse via ETL (Extract-Transform-Load) or ELT pipelines
- Warehouse is derived data: it can be rebuilt from the OLTP source if corrupted
- Optimized for column-oriented reads, not transactional writes
- Modern examples: Snowflake, BigQuery, Amazon Redshift
- Key insight: the warehouse is NOT a second source of truth — it is derived from the operational system
What is the difference between a system of record and derived data?
?
System of Record (Source of Truth):
- The authoritative version of the data; wins all conflicts
- Normalized, write-optimized, transactional
- Examples: PostgreSQL user accounts DB, raw Kafka event log
- Criterion: if two systems disagree, the system of record is correct
Derived Data:
- Computed from the system of record via transformations
- Can be rebuilt if corrupted or lost
- Optimized for specific read patterns (speed, denormalization)
- Examples: Snowflake warehouse, Elasticsearch index, Redis cache, ML feature store
Critical failure mode: Treating derived data as a system of record leads to split-brain inconsistencies when the transformation pipeline lags.
What is a data lake and how does it differ from a data warehouse?
?
Data Lake: Storage layer of raw, unprocessed data in open formats (Parquet, ORC, Iceberg) on cheap object storage (S3, GCS).
- Cost: Much cheaper per TB than warehouse compute+storage
- Schema: Schema-on-read (flexible, no pre-defined schema)
- Governance: Less structured; data quality varies
- Query engine: Needs external engine (Athena, Trino, Spark)
Data Warehouse: Structured, curated, transformed data with enforced schema.
- Cost: Higher per TB but better query performance
- Schema: Schema-on-write (enforced)
- Governance: Strong
Lakehouse (Iceberg + Trino, Databricks Delta): Combines cheap lake storage with warehouse-like ACID semantics and query performance.
What is HTAP?
?
HTAP (Hybrid Transactional/Analytical Processing): A system architecture that serves both OLTP and OLAP workloads from a single data store.
- Internally maintains both row-oriented (for transactions) and column-oriented (for analytics) storage
- Benefit: Eliminates ETL pipeline; analytical queries see live operational data
- Trade-off: Significantly higher system complexity
- Examples: TiDB, Google AlloyDB, SingleStore, CockroachDB (developing)
- When to use: When real-time analytics on operational data is required and the complexity cost is worth it
Trade-offs and Comparisons
What are the main advantages of cloud hosting over self-hosting?
?
Cloud advantages:
- Operational simplicity: Managed services handle patching, failover, backups, scaling
- Elastic scaling: Add capacity in minutes; scale down when load drops
- Global footprint: Deploy in 30+ regions without building data centers
- Managed ecosystem: Managed Kafka (MSK), Redis (ElastiCache), Postgres (RDS) — use without operating
- Pay-per-use economics: Cost tracks usage; favorable for variable workloads
- No upfront hardware capex: Critical for startups
What are the main disadvantages of cloud hosting versus self-hosting?
?
Cloud disadvantages:
- Cost at scale: For predictable, sustained high-volume workloads, cloud can cost more than owned hardware (Dropbox saved ~$75M repatriating from AWS)
- Vendor lock-in: Proprietary APIs (DynamoDB, Bigtable) create switching costs
- Data egress costs: Moving large datasets out of cloud is expensive
- Data sovereignty: Some regulated data must reside in specific jurisdictions
- Reduced control: Cannot tune kernel, choose hardware, or guarantee sub-millisecond latency
- Opaque failures: Cloud outages are outside your control; AZ failures require application-level handling
What are the core problems introduced by distributed systems that don’t exist on a single machine?
?
Five problems unique to distributed systems:
- Partial failures: Some components fail while others run — system state becomes ambiguous
- Network partitions: Network split creates isolated groups; CAP theorem forces choose consistency or availability
- Clock skew: No global clock; event ordering requires logical clocks (Lamport timestamps, vector clocks)
- Byzantine faults: In open networks, nodes may lie or corrupt messages (most internal systems assume crash-fail, not Byzantine)
- Coordination overhead: Agreement across nodes (distributed transactions, leader election) adds latency proportional to node count
Key principle: Do not distribute until you must. Start with a single node.
What are the trade-offs of microservices versus a monolith?
?
Monolith advantages:
- In-process function calls (no network latency or partial failures)
- Single ACID database (no cross-service consistency problems)
- Simple debugging (single stack trace)
- Lower operational complexity (one deployment unit)
Microservices advantages:
- Independent deployment and scaling per service
- Team autonomy (each team owns their service)
- Fault isolation (one service failure doesn’t cascade)
- Technology heterogeneity
When microservices are appropriate: > 50 engineers, > 10 deploys/day, clear service boundaries
Warning: Premature microservices introduce all distributed system complexity with no team-scale benefit
What are the trade-offs of serverless vs containers vs VMs?
?
Serverless (Lambda, Cloud Functions):
- Pro: Zero ops, auto-scales to 0, pay-per-invocation
- Con: Cold start latency (~100ms–1s), execution time limits, no persistent state, vendor lock-in
- Best for: Stateless, event-driven, variable workloads
Containers (Kubernetes, ECS):
- Pro: Flexible, portable, good density, persistent state possible
- Con: Operational overhead (orchestration, networking, scaling config)
- Best for: Long-running services, stateful workloads
VMs (EC2, GCE):
- Pro: Full control, predictable performance, suitable for long-running workloads
- Con: Slow to provision, pay for idle capacity
- Best for: Specialized hardware needs, compliance requirements
Numbers and Precision
What are typical latency targets for OLTP vs OLAP systems?
?
OLTP:
- p50 (median): < 10ms
- p99: < 100ms
- User impact: direct; slow OLTP = poor user experience
OLAP:
- Acceptable range: seconds to minutes
- User is not waiting interactively; query is background analysis
- Key metric: query throughput (queries per hour), not individual latency
HTAP attempts to bridge this: sub-second analytical queries on live data, but at the cost of system complexity
What did Dropbox’s repatriation from AWS demonstrate about cloud economics?
?
Dropbox moved storage workloads from AWS back to their own hardware and saved approximately $75 million over 2 years.
- Key condition: Dropbox had predictable, high-volume, sustained storage workload
- Cloud pay-per-use is economical for variable workloads; at sustained scale, owned hardware can be cheaper
- Break-even point: Roughly ~1M/year cloud spend justifies dedicated hardware evaluation
- Lesson: Cloud is not always cheaper — evaluate based on workload predictability and volume
What are the key GDPR deadlines and requirements architects must plan for?
?
Key GDPR requirements with technical implications:
- Right to erasure: Must delete all personal data for a user within 30 days of request
- Breach notification: Notify supervisory authority within 72 hours of discovering a breach
- Data minimization: Don’t collect data you don’t need
- Purpose limitation: Don’t repurpose data beyond stated collection purpose
Technical challenge: Right to erasure in analytics pipelines
- Event logs with user actions are hard to delete individual records from
- Solution 1: Store userId (not PII) in events; delete only the profile store
- Solution 2: Cryptographic erasure — encrypt user data under per-user key; delete the key
What is the S3 durability guarantee and why does it matter architecturally?
?
S3 durability: 99.999999999% (11 nines)
- Equivalent to expected loss of 1 object per 10,000 years if you have 10,000 objects
- Achieved through redundant storage across multiple AZs and integrity checksums
Architectural implication: Object storage is reliable enough to serve as the universal substrate for data pipelines.
- Build data lakes on S3 (or GCS/Azure Blob) rather than HDFS
- Lakehouse architectures (Iceberg, Delta) built entirely on S3 + compute-storage separation
- Cloud-native means: compute is disposable; state lives in object storage
Application and Failure Modes
When should you NOT use microservices?
?
Avoid microservices when:
- Team size is < 30–50 engineers (coordination overhead outweighs autonomy benefits)
- Deployment frequency is low (< few deploys/day; independence benefit doesn’t materialize)
- Service boundaries are unclear (premature decomposition creates wrong seams)
- Strong transactional consistency is required across what would be service boundaries (cross-service ACID is expensive)
Failure mode of premature microservices:
- Each service call becomes a distributed systems problem (retries, timeouts, tracing)
- Data consistency requires sagas or 2PC — complex and error-prone
- Operational overhead multiplies: N services to monitor, deploy, and debug
- Rule: Start with a monolith; extract services when team and deployment frequency demand it
Why can’t you run OLAP queries directly against your OLTP database?
?
Running OLAP queries against an OLTP database causes problems for both workloads:
Harm to OLTP:
- Long sequential scans hold table locks, blocking concurrent transactions
- Large scans evict the OLTP working set from the buffer pool (cache thrashing)
- CPU and I/O spikes degrade response times for live users
Harm to OLAP:
- Row-oriented storage is inefficient for column scans (reads unnecessary data)
- Missing column compression and vectorized execution → slow queries
- No isolation from OLTP write traffic
Solution: Separate systems with ETL/CDC pipeline. The warehouse is derived from the OLTP system.
Exception: HTAP systems (TiDB, AlloyDB) mitigate this at the cost of higher system complexity.
What are the failure modes of treating derived data as a system of record?
?
What goes wrong:
- Stale data mistaken for current: Pipeline lag causes downstream systems to believe old data is current
- Split-brain inconsistency: Two derived stores diverge because one pipeline is behind; neither is authoritative
- Incorrect conflict resolution: When two sources disagree, you don’t know which is right
- Data loss on rebuild: If you treat the derived store as authoritative, you may discard the true source
- Schema confusion: Schema changes in source don’t propagate; derived store has different structure
Correct mental model:
- System of record = gold copy; always correct; if others disagree, it wins
- Derived = bronze copy; may be stale; rebuild from source if corrupted
What happens when a company adopts cloud-native architecture but keeps on-premises thinking?
?
Anti-patterns from porting on-premises patterns to cloud:
- Stateful instances: Storing session state in application server memory — fails when cloud instances are terminated (spot preemption, auto-scaling events)
- Coupled compute and storage: Running databases on single large EC2 instances instead of managed services — loses elasticity
- Manual capacity planning: Pre-allocating instance count instead of using auto-scaling groups
- Long-lived instances as pets: Treating cloud VMs like physical servers (manual patching, no replacements) — fails when cloud hardware is recycled
Cloud-native principles instead:
- Stateless compute; all durable state in managed storage
- Decouple compute and storage (Snowflake model)
- Treat every instance as disposable; design for termination at any time
- Use managed services; avoid operating infrastructure that a cloud provider can operate for you
What are the practical technical steps to comply with GDPR right to erasure in a data pipeline?
?
The problem: Event logs and analytical tables contain PII mixed with behavioral data. Can’t delete individual rows from columnar files.
Approach 1 — Separation by design:
- Store only
userId(opaque integer) in event logs and warehouse - Personal data (name, email, DOB) lives only in the user profile database
- On erasure: delete user profile row → events become pseudonymous, not personal data
- Trade-off: Loses ability to re-identify events if needed
Approach 2 — Cryptographic erasure:
- Encrypt all user-specific data with a per-user encryption key
- Store the key separately (key management service)
- On erasure: delete the key → all encrypted data is unreadable = effectively erased
- Trade-off: Adds encryption/decryption overhead; key management complexity
Approach 3 — Partitioned storage with TTL:
- Partition event data by time; apply retention policies
- Accept that old events beyond retention period are automatically purged
What is the difference between ETL and ELT, and when does each make sense?
?
ETL (Extract-Transform-Load):
- Transform data before loading into the warehouse
- Data is clean and structured when it arrives
- Traditional approach when warehouse compute was expensive
- Best for: Complex transformations that need specialized tools; regulated data that must be clean before storage
ELT (Extract-Load-Transform):
- Load raw data into warehouse first, then transform inside the warehouse using SQL
- Takes advantage of cheap and powerful warehouse compute (Snowflake, BigQuery)
- dbt (data build tool) is the dominant ELT transformation layer in 2026
- Best for: Modern cloud warehouses; when you want to preserve raw data for reprocessing; faster ingestion
Trend: ELT has largely displaced ETL for cloud warehouse workloads. The raw data layer serves as a replay buffer.
What is cloud-native architecture and what are its core principles?
?
Cloud-native means designing systems for cloud primitives rather than porting on-premises patterns to the cloud.
Core principles:
- Disposable infrastructure: Instances terminate unexpectedly — design for it; no state in application servers
- Compute-storage separation: Scale each independently (Snowflake separates query compute from S3 storage)
- Prefer managed services: Let the cloud provider operate Kafka, Redis, Postgres unless you have specific reasons not to
- Design for availability zones: Spread replicas across AZs for fault tolerance without cross-region latency
- Object storage as substrate: S3/GCS with 11-nines durability is the universal data foundation
Anti-patterns (on-premises thinking in the cloud):
- Storing session state in application server memory
- Treating cloud VMs as pets rather than cattle
- Manual capacity planning instead of auto-scaling
What is the difference between star schema and snowflake schema in data warehouses?
?
Both are dimensional modeling schemas for OLAP warehouses. They differ in how dimension tables are structured.
Star schema:
- One central fact table (events/transactions) surrounded by dimension tables (product, customer, date)
- Dimension tables are denormalized — contain redundant data
- Simpler joins; typically 2 joins to answer a question
- Better query performance for simple queries
Snowflake schema:
- Dimension tables are normalized — hierarchies split into separate tables
- Example: product → product_category → department (3 tables instead of 1)
- More joins required; lower storage cost; harder to query
2026 practice: Most modern data warehouses (Snowflake, BigQuery) use star schema with some denormalization. dbt models typically implement star schema patterns.
What is the difference between ETL and ELT?
?
ETL (Extract-Transform-Load):
- Extract data from source systems
- Transform (clean, join, aggregate) outside the warehouse in a dedicated ETL tool
- Load cleaned data into the warehouse
ELT (Extract-Load-Transform):
- Extract raw data from source systems
- Load raw data directly into the warehouse
- Transform inside the warehouse using SQL (dbt, stored procedures)
Why ELT dominates in 2026:
- Cloud warehouses (Snowflake, BigQuery) have cheap, powerful compute — transform there
- Preserves raw data for reprocessing as requirements change
- dbt has made SQL-based transformation the standard
- Faster ingestion; raw data available sooner
When ETL still makes sense: Complex transformations requiring specialized tools; regulated data that must be cleaned before storage
What is write amplification in the fan-out context?
?
Write amplification is when one logical write operation results in many physical writes.
Fan-out context: When a user posts, the message must be written to all followers’ timeline caches.
- A user with 1M followers posts once → 1M cache writes
- A celebrity with 30M followers (e.g., Taylor Swift on a social network) posts → 30M writes
- If each write takes 1ms, 30M writes take 30,000 seconds sequentially — must be parallelized
Why it matters: Write amplification determines the system’s write capacity ceiling and is the primary reason celebrities are handled differently from regular users in fan-out architectures.
Other contexts where write amplification appears:
- LSM-tree compaction (Ch4): data rewritten multiple times during compaction
- Replication (Ch6): single write replicated to N replicas
- Indexed writes: each write updates multiple index structures
What are the key characteristics of a data lakehouse?
?
A lakehouse combines the low cost and flexibility of a data lake with the query performance and ACID semantics of a data warehouse.
Key properties:
- Storage: Open table formats (Apache Iceberg, Delta Lake, Apache Hudi) on object storage (S3, GCS)
- ACID transactions: Atomic writes, consistent reads across concurrent writers
- Time travel: Query data as it existed at a past point in time
- Schema evolution: Add/rename/reorder columns without rewriting data
- Multiple compute engines: Same data accessed by Spark, Trino, Flink, DuckDB, Snowflake
Examples: Databricks Delta Lake, Apache Iceberg (used by Netflix, Apple), Apache Hudi (used by Uber)
Why it’s replacing pure warehouses: Avoids data duplication (lake + warehouse); cheaper at scale; open formats prevent vendor lock-in; supports both batch and streaming ingestion.
What is Change Data Capture (CDC) and why is it used in data pipelines?
?
Change Data Capture (CDC): A technique that continuously captures row-level changes (INSERT, UPDATE, DELETE) from a source database’s transaction log and publishes them as events.
How it works:
- Database writes every change to its transaction log (WAL in PostgreSQL, binlog in MySQL)
- CDC tool reads the log (Debezium, AWS DMS, Airbyte)
- Changes are published to a message queue (Kafka, Kinesis)
- Downstream systems consume the change events to update derived data
Why CDC over polling:
- Real-time (milliseconds vs minutes for scheduled queries)
- Low source DB load (reads the log, doesn’t query the table)
- Captures all changes including deletes (polling misses deletes)
- Preserves change order
Use cases: Feeding data warehouse with near-real-time data; keeping search indexes in sync; cache invalidation; audit logging
Total Cards: 27
Review Time: ~35 minutes
Priority: HIGH
Last Updated: 2026-05-29