Chapter 12 Flashcards - The Future of Data Systems
Basic Concepts
What does “unbundling” databases mean and why does Kleppmann advocate for it?
?
-
Unbundling: Use specialized, purpose-built tools for each data workload instead of one monolithic database
- PostgreSQL for OLTP; Elasticsearch for search; Redis for caching; Snowflake for analytics; Kafka for streaming
- Each tool optimized for its specific workload; compose them via an event log
-
Why unbundle:
- No single database is best for all workloads (OLTP + full-text search + analytics = incompatible optimizations)
- Specialized tools can be orders of magnitude faster for their specific workload
- Independent scaling and upgrades
-
How to keep them in sync: CDC → Kafka → each system subscribes and updates itself
-
Trade-offs:
- ✅ Performance, flexibility, best-of-breed for each workload
- ❌ Operational complexity (many systems), eventual consistency between systems
What is the role of a durable ordered event log (Kafka) as the “integration backbone”?
?
-
Event log as backbone: Kafka serves as the central, immutable, ordered record of all changes
-
Producer: Any system that generates events writes to Kafka (databases via CDC, applications directly)
-
Consumers: Any derived system subscribes to relevant topics and maintains its own optimized view
-
Why it works:
- Single source of truth: All changes captured once (no N² point-to-point connections)
- Replay: Consumer can rebuild from t=0 (rebuild derived views from scratch)
- Fan-out: Multiple independent consumers from same events
- Decoupling: Producers don’t know about consumers; consumers can be added later
-
Derived data principle: Search index, cache, OLAP store are all derived — can be regenerated from the log if corrupted
Architectures
What is the Lambda architecture and why is it considered an antipattern?
?
-
Lambda architecture (Nathan Marz):
- Batch layer: Recomputes everything periodically (Spark, full historical data)
- Speed layer: Processes only recent data in real-time (Flink/Storm)
- Serving layer: Merges batch results + speed layer results to answer queries
-
Why it’s problematic (antipattern):
- Two code paths: Same computation implemented twice (batch + streaming versions)
- Hard to keep identical: Bugs in one version don’t appear in the other → inconsistent results
- Complex merge: Combining batch and speed results at query time is complex
- Operational burden: Two systems to maintain, monitor, debug
-
Kappa architecture (Jay Kreps, LinkedIn, 2014):
- Single stream processor for everything
- Historical reprocessing: replay Kafka log through same stream processor
- ✅ One code path, simpler operation, same logic for real-time and backfill
What does “end-to-end correctness” mean and why do database guarantees not automatically provide it?
?
-
End-to-end correctness: The application produces correct results for users, from input to output, even in the presence of failures and concurrency
-
Why DB guarantees aren’t enough:
- Database ACID prevents data corruption within the DB
- BUT: application code can still have bugs in business logic
- Application may retry operations without idempotency → duplicate side effects
- Cross-service operations (microservices) have no shared transaction
- User-visible behavior requires correct application logic at every layer
-
Example: Even with a perfectly ACID database, a payment service can double-charge if:
- Network timeout → retry → payment processed twice
- Solution required: idempotency key at application level, not just DB level
-
Principle: Design for correctness at each layer; lower layers don’t guarantee higher layers
Derived Data and Immutability
What is the “derived data” principle and how does it simplify fault tolerance?
?
-
Derived data: Any data that can be recomputed from an immutable source (the event log)
- Search index = derived from event log
- Cache = derived from database
- OLAP aggregates = derived from raw events
- ML features = derived from interaction events
-
Fault tolerance simplification:
- If a derived store is corrupted or lost: just re-derive from the source
- No need for complex backup/restore of derived stores — they’re computable
- New derived views can be built by replaying all historical events
-
Contrast with source-of-truth data: Raw events/transactions are not derived; they must be preserved and backed up
-
Practical implication: Run batch/stream jobs to rebuild derived stores as needed; event log is what must be durably stored
How do you implement GDPR “right to erasure” when using an immutable event log?
?
-
Challenge: Immutable event logs (Kafka, event sourcing) can’t delete records without breaking integrity
-
Solution 1: Cryptographic erasure:
- Generate a unique encryption key
K_userper user at account creation - All events containing personal data encrypted with
K_userbefore logging - “Right to erasure”: delete
K_userfrom key store - Events still in log but now indecipherable → functionally erased
- Cheap: only need to delete a small key, not scan/modify the log
- Generate a unique encryption key
-
Solution 2: Pseudo-anonymization:
- Store
user_id → random_tokenmapping separately - Events use
random_token(notuser_id) as identifier - “Erase”: delete the mapping → events can no longer be linked to the user
- Store
-
Remaining challenge: Derived views (search, cache, ML models) must also be updated/rebuilt to remove erased user’s data
Ethics and Responsibility
What are the key ethical responsibilities of data engineers and architects?
?
-
Data minimalism: Only collect data you have a clear purpose for
- “Just in case” data collection creates liability without value
- Less data = smaller breach impact
-
Purpose limitation: Data used only for the stated purpose
- Data collected for fraud detection should not be used for advertising
- Technical enforcement via access controls, not just policy
-
Bias and fairness: ML models trained on historical data perpetuate historical biases
- Example: Hiring algorithm trained on past (mostly male) hires → discriminates against women
- Responsibility: audit models for disparate impact; document training data
-
Transparency: Be clear about automated decision-making; provide explanations
- GDPR Article 22: right to explanation for automated decisions
-
“Just following specifications” is not an excuse: Engineers have professional responsibility for what they build
-
Data sovereignty: Users should have meaningful control over their data (access, portability, deletion)
Modern Context (2026)
What is the Data Mesh organizational pattern and how does it relate to DDIA’s unbundling?
?
-
Data Mesh (Zhamak Dehghani, 2019-2020): Organizational approach to data that decentralizes ownership
-
Core principles:
- Domain ownership: Each business domain (marketing, payments, orders) owns its data
- Data as a product: Domain teams provide data with SLAs, documentation, quality guarantees
- Self-serve data platform: Central platform team provides infrastructure (Kafka, data catalog)
- Federated governance: Common standards (schemas, lineage, quality) but local implementation
-
Relationship to DDIA’s unbundling:
- DDIA: technical unbundling (right tool for right workload)
- Data Mesh: organizational unbundling (right team owns right data)
- Complementary: Data Mesh uses unbundled technical stack; provides organizational ownership model
-
Why needed: Centralized data teams can’t scale to support all domain data needs; domain teams know their data best
How are LLMs and AI changing data system design patterns in 2026?
?
-
RAG (Retrieval-Augmented Generation):
- LLMs augmented with real-time data retrieval at inference time
- Pattern: user query → retrieve relevant documents from vector DB/search → include in LLM context
- New data access pattern: similarity search over embeddings (vector DBs)
- Data freshness critical: LLMs may be months old; RAG provides current data
-
Text2SQL:
- Natural language → SQL query → execute → return results
- Democratizes data access: analysts without SQL skills can query data
- Challenges: schema understanding, handling ambiguity, safety (SQL injection equivalent risks)
-
AI-generated data pipelines:
- LLMs generating dbt models, Spark jobs, Flink pipelines from specifications
- Early adoption 2024-2026; reduces time to build but requires human review
-
AI governance:
- Training data provenance: what data was used to train models?
- EU AI Act (2024): high-risk AI must document training data and testing for bias
- Model cards, data sheets: documentation standards for AI systems
-
New architectural patterns: LLM as transformation operator in data pipelines (document parsing, classification, enrichment)
Interview Scenarios
Design a data architecture for a large e-commerce platform with many downstream use cases.
?
Requirements: OLTP orders + search + recommendations + analytics + real-time notifications + fraud detection
Architecture (unbundled + Kafka backbone):
Write side (source of truth):
- PostgreSQL: orders, products, users (OLTP, ACID)
- CDC via Debezium → Kafka topics (orders, products, users)
Derived systems (via Kafka consumers):
- Elasticsearch: Full-text search (products) — consumes products topic
- Redis: Session cache, user preferences — consumes users topic
- Snowflake/BigQuery: Analytics, reporting — consumes all topics via Kafka Connect
- Flink (stream processing): Real-time fraud scoring — consumes orders topic
- ML platform: Recommendations — batch (Spark) + real-time feature store
Stream processing (Flink):
- Fraud detection: window aggregates, stream-table join with user risk profiles
- Notification triggers: “your order shipped” → push notification
Serving pattern:
- Read orders: PostgreSQL (consistent, authoritative)
- Search products: Elasticsearch (specialized, fast)
- Analytics: Snowflake (OLAP, complex queries)
- Recommendations: Redis (pre-computed by ML model)
Key principle: Each downstream system rebuilds from Kafka if corrupted; Kafka is the backbone
How would you explain the trade-off between strong consistency and system availability to a product manager?
?
Simple explanation:
“Think of it like a bank with multiple tellers. Strong consistency means all tellers must check with each other before every transaction — accurate, but slow. Eventual consistency means each teller works from their own records and syncs up later — fast, but temporarily inconsistent.”
Practical trade-off in our system:
-
Strong consistency required for:
- User sees their own just-posted item (read-after-write)
- Payment deducted exactly once (no double-charge)
- Username is truly unique
-
Eventual consistency acceptable for:
- Search index takes 2 seconds to show a new product (users won’t notice)
- Analytics dashboard shows yesterday’s data (expected)
- Recommendation model updates hourly (freshness isn’t critical)
Business impact framing:
- Strong consistency costs ~2-5x more in infrastructure (coordination overhead)
- Eventual consistency means some users see stale data briefly (usually imperceptible)
- Wrong choice: paying for strong consistency everywhere, or using eventual where users notice
Recommendation: Default to eventual consistency; add strong consistency only where business rules or user experience require it
Quick Facts
What are the four core principles of GDPR most relevant to data system design?
?
-
Data minimization: Only collect data “adequate, relevant, and limited to what is necessary”
- Technical: schema design, collection APIs should default to minimal collection
-
Purpose limitation: Data collected for one purpose cannot be reused for another
- Technical: access control, data catalog with purpose documentation
-
Storage limitation: Personal data not kept longer than necessary
- Technical: automated TTL/expiry, deletion pipelines, log retention policies
-
Right to erasure (Art. 17): Users can request deletion of their personal data
- Technical: cryptographic erasure for event logs, deletion propagation to derived stores
Additional rights:
- Right to access: Users can see what data you hold about them
- Right to portability: Users can receive their data in machine-readable format
- Right to explanation: Automated decisions must be explainable
Engineer’s responsibility: These are legal requirements (fines up to 4% of global revenue); design systems to support them from the start
Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13