Chapter 12 Flashcards - The Future of Data Systems

flashcards chapter-12 ddia

Basic Concepts

What does “unbundling” databases mean and why does Kleppmann advocate for it?
?

Unbundling: Use specialized, purpose-built tools for each data workload instead of one monolithic database
- PostgreSQL for OLTP; Elasticsearch for search; Redis for caching; Snowflake for analytics; Kafka for streaming
- Each tool optimized for its specific workload; compose them via an event log
Why unbundle:
- No single database is best for all workloads (OLTP + full-text search + analytics = incompatible optimizations)
- Specialized tools can be orders of magnitude faster for their specific workload
- Independent scaling and upgrades
How to keep them in sync: CDC → Kafka → each system subscribes and updates itself
Trade-offs:
- ✅ Performance, flexibility, best-of-breed for each workload
- ❌ Operational complexity (many systems), eventual consistency between systems

What is the role of a durable ordered event log (Kafka) as the “integration backbone”?
?

Event log as backbone: Kafka serves as the central, immutable, ordered record of all changes
Producer: Any system that generates events writes to Kafka (databases via CDC, applications directly)
Consumers: Any derived system subscribes to relevant topics and maintains its own optimized view
Why it works:
1. Single source of truth: All changes captured once (no N² point-to-point connections)
2. Replay: Consumer can rebuild from t=0 (rebuild derived views from scratch)
3. Fan-out: Multiple independent consumers from same events
4. Decoupling: Producers don’t know about consumers; consumers can be added later
Derived data principle: Search index, cache, OLAP store are all derived — can be regenerated from the log if corrupted

Architectures

What is the Lambda architecture and why is it considered an antipattern?
?

Lambda architecture (Nathan Marz):
- Batch layer: Recomputes everything periodically (Spark, full historical data)
- Speed layer: Processes only recent data in real-time (Flink/Storm)
- Serving layer: Merges batch results + speed layer results to answer queries
Why it’s problematic (antipattern):
1. Two code paths: Same computation implemented twice (batch + streaming versions)
2. Hard to keep identical: Bugs in one version don’t appear in the other → inconsistent results
3. Complex merge: Combining batch and speed results at query time is complex
4. Operational burden: Two systems to maintain, monitor, debug
Kappa architecture (Jay Kreps, LinkedIn, 2014):
- Single stream processor for everything
- Historical reprocessing: replay Kafka log through same stream processor
- ✅ One code path, simpler operation, same logic for real-time and backfill

What does “end-to-end correctness” mean and why do database guarantees not automatically provide it?
?

End-to-end correctness: The application produces correct results for users, from input to output, even in the presence of failures and concurrency
Why DB guarantees aren’t enough:
- Database ACID prevents data corruption within the DB
- BUT: application code can still have bugs in business logic
- Application may retry operations without idempotency → duplicate side effects
- Cross-service operations (microservices) have no shared transaction
- User-visible behavior requires correct application logic at every layer
Example: Even with a perfectly ACID database, a payment service can double-charge if:
- Network timeout → retry → payment processed twice
- Solution required: idempotency key at application level, not just DB level
Principle: Design for correctness at each layer; lower layers don’t guarantee higher layers

Derived Data and Immutability

What is the “derived data” principle and how does it simplify fault tolerance?
?

Derived data: Any data that can be recomputed from an immutable source (the event log)
- Search index = derived from event log
- Cache = derived from database
- OLAP aggregates = derived from raw events
- ML features = derived from interaction events
Fault tolerance simplification:
- If a derived store is corrupted or lost: just re-derive from the source
- No need for complex backup/restore of derived stores — they’re computable
- New derived views can be built by replaying all historical events
Contrast with source-of-truth data: Raw events/transactions are not derived; they must be preserved and backed up
Practical implication: Run batch/stream jobs to rebuild derived stores as needed; event log is what must be durably stored

How do you implement GDPR “right to erasure” when using an immutable event log?
?

Challenge: Immutable event logs (Kafka, event sourcing) can’t delete records without breaking integrity
Solution 1: Cryptographic erasure:
1. Generate a unique encryption key K_user per user at account creation
2. All events containing personal data encrypted with K_user before logging
3. “Right to erasure”: delete K_user from key store
4. Events still in log but now indecipherable → functionally erased
5. Cheap: only need to delete a small key, not scan/modify the log
Solution 2: Pseudo-anonymization:
- Store user_id → random_token mapping separately
- Events use random_token (not user_id) as identifier
- “Erase”: delete the mapping → events can no longer be linked to the user
Remaining challenge: Derived views (search, cache, ML models) must also be updated/rebuilt to remove erased user’s data

Ethics and Responsibility

What are the key ethical responsibilities of data engineers and architects?
?

Data minimalism: Only collect data you have a clear purpose for
- “Just in case” data collection creates liability without value
- Less data = smaller breach impact
Purpose limitation: Data used only for the stated purpose
- Data collected for fraud detection should not be used for advertising
- Technical enforcement via access controls, not just policy
Bias and fairness: ML models trained on historical data perpetuate historical biases
- Example: Hiring algorithm trained on past (mostly male) hires → discriminates against women
- Responsibility: audit models for disparate impact; document training data
Transparency: Be clear about automated decision-making; provide explanations
- GDPR Article 22: right to explanation for automated decisions
“Just following specifications” is not an excuse: Engineers have professional responsibility for what they build
Data sovereignty: Users should have meaningful control over their data (access, portability, deletion)

Modern Context (2026)

What is the Data Mesh organizational pattern and how does it relate to DDIA’s unbundling?
?

Data Mesh (Zhamak Dehghani, 2019-2020): Organizational approach to data that decentralizes ownership
Core principles:
1. Domain ownership: Each business domain (marketing, payments, orders) owns its data
2. Data as a product: Domain teams provide data with SLAs, documentation, quality guarantees
3. Self-serve data platform: Central platform team provides infrastructure (Kafka, data catalog)
4. Federated governance: Common standards (schemas, lineage, quality) but local implementation
Relationship to DDIA’s unbundling:
- DDIA: technical unbundling (right tool for right workload)
- Data Mesh: organizational unbundling (right team owns right data)
- Complementary: Data Mesh uses unbundled technical stack; provides organizational ownership model
Why needed: Centralized data teams can’t scale to support all domain data needs; domain teams know their data best

How are LLMs and AI changing data system design patterns in 2026?
?

RAG (Retrieval-Augmented Generation):
- LLMs augmented with real-time data retrieval at inference time
- Pattern: user query → retrieve relevant documents from vector DB/search → include in LLM context
- New data access pattern: similarity search over embeddings (vector DBs)
- Data freshness critical: LLMs may be months old; RAG provides current data
Text2SQL:
- Natural language → SQL query → execute → return results
- Democratizes data access: analysts without SQL skills can query data
- Challenges: schema understanding, handling ambiguity, safety (SQL injection equivalent risks)
AI-generated data pipelines:
- LLMs generating dbt models, Spark jobs, Flink pipelines from specifications
- Early adoption 2024-2026; reduces time to build but requires human review
AI governance:
- Training data provenance: what data was used to train models?
- EU AI Act (2024): high-risk AI must document training data and testing for bias
- Model cards, data sheets: documentation standards for AI systems
New architectural patterns: LLM as transformation operator in data pipelines (document parsing, classification, enrichment)

Interview Scenarios

Design a data architecture for a large e-commerce platform with many downstream use cases.
?
Requirements: OLTP orders + search + recommendations + analytics + real-time notifications + fraud detection

Architecture (unbundled + Kafka backbone):

Write side (source of truth):

PostgreSQL: orders, products, users (OLTP, ACID)
CDC via Debezium → Kafka topics (orders, products, users)

Derived systems (via Kafka consumers):

Elasticsearch: Full-text search (products) — consumes products topic
Redis: Session cache, user preferences — consumes users topic
Snowflake/BigQuery: Analytics, reporting — consumes all topics via Kafka Connect
Flink (stream processing): Real-time fraud scoring — consumes orders topic
ML platform: Recommendations — batch (Spark) + real-time feature store

Stream processing (Flink):

Fraud detection: window aggregates, stream-table join with user risk profiles
Notification triggers: “your order shipped” → push notification

Serving pattern:

Read orders: PostgreSQL (consistent, authoritative)
Search products: Elasticsearch (specialized, fast)
Analytics: Snowflake (OLAP, complex queries)
Recommendations: Redis (pre-computed by ML model)

Key principle: Each downstream system rebuilds from Kafka if corrupted; Kafka is the backbone

How would you explain the trade-off between strong consistency and system availability to a product manager?
?
Simple explanation:
“Think of it like a bank with multiple tellers. Strong consistency means all tellers must check with each other before every transaction — accurate, but slow. Eventual consistency means each teller works from their own records and syncs up later — fast, but temporarily inconsistent.”

Practical trade-off in our system:

Strong consistency required for:
- User sees their own just-posted item (read-after-write)
- Payment deducted exactly once (no double-charge)
- Username is truly unique
Eventual consistency acceptable for:
- Search index takes 2 seconds to show a new product (users won’t notice)
- Analytics dashboard shows yesterday’s data (expected)
- Recommendation model updates hourly (freshness isn’t critical)

Business impact framing:

Strong consistency costs ~2-5x more in infrastructure (coordination overhead)
Eventual consistency means some users see stale data briefly (usually imperceptible)
Wrong choice: paying for strong consistency everywhere, or using eventual where users notice

Recommendation: Default to eventual consistency; add strong consistency only where business rules or user experience require it

Quick Facts

What are the four core principles of GDPR most relevant to data system design?
?

Data minimization: Only collect data “adequate, relevant, and limited to what is necessary”
- Technical: schema design, collection APIs should default to minimal collection
Purpose limitation: Data collected for one purpose cannot be reused for another
- Technical: access control, data catalog with purpose documentation
Storage limitation: Personal data not kept longer than necessary
- Technical: automated TTL/expiry, deletion pipelines, log retention policies
Right to erasure (Art. 17): Users can request deletion of their personal data
- Technical: cryptographic erasure for event logs, deletion propagation to derived stores

Additional rights:

Right to access: Users can see what data you hold about them
Right to portability: Users can receive their data in machine-readable format
Right to explanation: Automated decisions must be explainable

Engineer’s responsibility: These are legal requirements (fines up to 4% of global revenue); design systems to support them from the start

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch12-flashcards