Chapter 12 Flashcards - The Future of Data Systems

flashcards chapter-12 ddia


Basic Concepts

What does “unbundling” databases mean and why does Kleppmann advocate for it?
?

  • Unbundling: Use specialized, purpose-built tools for each data workload instead of one monolithic database

    • PostgreSQL for OLTP; Elasticsearch for search; Redis for caching; Snowflake for analytics; Kafka for streaming
    • Each tool optimized for its specific workload; compose them via an event log
  • Why unbundle:

    • No single database is best for all workloads (OLTP + full-text search + analytics = incompatible optimizations)
    • Specialized tools can be orders of magnitude faster for their specific workload
    • Independent scaling and upgrades
  • How to keep them in sync: CDC → Kafka → each system subscribes and updates itself

  • Trade-offs:

    • ✅ Performance, flexibility, best-of-breed for each workload
    • ❌ Operational complexity (many systems), eventual consistency between systems

What is the role of a durable ordered event log (Kafka) as the “integration backbone”?
?

  • Event log as backbone: Kafka serves as the central, immutable, ordered record of all changes

  • Producer: Any system that generates events writes to Kafka (databases via CDC, applications directly)

  • Consumers: Any derived system subscribes to relevant topics and maintains its own optimized view

  • Why it works:

    1. Single source of truth: All changes captured once (no N² point-to-point connections)
    2. Replay: Consumer can rebuild from t=0 (rebuild derived views from scratch)
    3. Fan-out: Multiple independent consumers from same events
    4. Decoupling: Producers don’t know about consumers; consumers can be added later
  • Derived data principle: Search index, cache, OLAP store are all derived — can be regenerated from the log if corrupted

Architectures

What is the Lambda architecture and why is it considered an antipattern?
?

  • Lambda architecture (Nathan Marz):

    • Batch layer: Recomputes everything periodically (Spark, full historical data)
    • Speed layer: Processes only recent data in real-time (Flink/Storm)
    • Serving layer: Merges batch results + speed layer results to answer queries
  • Why it’s problematic (antipattern):

    1. Two code paths: Same computation implemented twice (batch + streaming versions)
    2. Hard to keep identical: Bugs in one version don’t appear in the other → inconsistent results
    3. Complex merge: Combining batch and speed results at query time is complex
    4. Operational burden: Two systems to maintain, monitor, debug
  • Kappa architecture (Jay Kreps, LinkedIn, 2014):

    • Single stream processor for everything
    • Historical reprocessing: replay Kafka log through same stream processor
    • ✅ One code path, simpler operation, same logic for real-time and backfill

What does “end-to-end correctness” mean and why do database guarantees not automatically provide it?
?

  • End-to-end correctness: The application produces correct results for users, from input to output, even in the presence of failures and concurrency

  • Why DB guarantees aren’t enough:

    • Database ACID prevents data corruption within the DB
    • BUT: application code can still have bugs in business logic
    • Application may retry operations without idempotency → duplicate side effects
    • Cross-service operations (microservices) have no shared transaction
    • User-visible behavior requires correct application logic at every layer
  • Example: Even with a perfectly ACID database, a payment service can double-charge if:

    • Network timeout → retry → payment processed twice
    • Solution required: idempotency key at application level, not just DB level
  • Principle: Design for correctness at each layer; lower layers don’t guarantee higher layers

Derived Data and Immutability

What is the “derived data” principle and how does it simplify fault tolerance?
?

  • Derived data: Any data that can be recomputed from an immutable source (the event log)

    • Search index = derived from event log
    • Cache = derived from database
    • OLAP aggregates = derived from raw events
    • ML features = derived from interaction events
  • Fault tolerance simplification:

    • If a derived store is corrupted or lost: just re-derive from the source
    • No need for complex backup/restore of derived stores — they’re computable
    • New derived views can be built by replaying all historical events
  • Contrast with source-of-truth data: Raw events/transactions are not derived; they must be preserved and backed up

  • Practical implication: Run batch/stream jobs to rebuild derived stores as needed; event log is what must be durably stored

How do you implement GDPR “right to erasure” when using an immutable event log?
?

  • Challenge: Immutable event logs (Kafka, event sourcing) can’t delete records without breaking integrity

  • Solution 1: Cryptographic erasure:

    1. Generate a unique encryption key K_user per user at account creation
    2. All events containing personal data encrypted with K_user before logging
    3. “Right to erasure”: delete K_user from key store
    4. Events still in log but now indecipherable → functionally erased
    5. Cheap: only need to delete a small key, not scan/modify the log
  • Solution 2: Pseudo-anonymization:

    • Store user_id → random_token mapping separately
    • Events use random_token (not user_id) as identifier
    • “Erase”: delete the mapping → events can no longer be linked to the user
  • Remaining challenge: Derived views (search, cache, ML models) must also be updated/rebuilt to remove erased user’s data

Ethics and Responsibility

What are the key ethical responsibilities of data engineers and architects?
?

  • Data minimalism: Only collect data you have a clear purpose for

    • “Just in case” data collection creates liability without value
    • Less data = smaller breach impact
  • Purpose limitation: Data used only for the stated purpose

    • Data collected for fraud detection should not be used for advertising
    • Technical enforcement via access controls, not just policy
  • Bias and fairness: ML models trained on historical data perpetuate historical biases

    • Example: Hiring algorithm trained on past (mostly male) hires → discriminates against women
    • Responsibility: audit models for disparate impact; document training data
  • Transparency: Be clear about automated decision-making; provide explanations

    • GDPR Article 22: right to explanation for automated decisions
  • “Just following specifications” is not an excuse: Engineers have professional responsibility for what they build

  • Data sovereignty: Users should have meaningful control over their data (access, portability, deletion)

Modern Context (2026)

What is the Data Mesh organizational pattern and how does it relate to DDIA’s unbundling?
?

  • Data Mesh (Zhamak Dehghani, 2019-2020): Organizational approach to data that decentralizes ownership

  • Core principles:

    1. Domain ownership: Each business domain (marketing, payments, orders) owns its data
    2. Data as a product: Domain teams provide data with SLAs, documentation, quality guarantees
    3. Self-serve data platform: Central platform team provides infrastructure (Kafka, data catalog)
    4. Federated governance: Common standards (schemas, lineage, quality) but local implementation
  • Relationship to DDIA’s unbundling:

    • DDIA: technical unbundling (right tool for right workload)
    • Data Mesh: organizational unbundling (right team owns right data)
    • Complementary: Data Mesh uses unbundled technical stack; provides organizational ownership model
  • Why needed: Centralized data teams can’t scale to support all domain data needs; domain teams know their data best

How are LLMs and AI changing data system design patterns in 2026?
?

  • RAG (Retrieval-Augmented Generation):

    • LLMs augmented with real-time data retrieval at inference time
    • Pattern: user query → retrieve relevant documents from vector DB/search → include in LLM context
    • New data access pattern: similarity search over embeddings (vector DBs)
    • Data freshness critical: LLMs may be months old; RAG provides current data
  • Text2SQL:

    • Natural language → SQL query → execute → return results
    • Democratizes data access: analysts without SQL skills can query data
    • Challenges: schema understanding, handling ambiguity, safety (SQL injection equivalent risks)
  • AI-generated data pipelines:

    • LLMs generating dbt models, Spark jobs, Flink pipelines from specifications
    • Early adoption 2024-2026; reduces time to build but requires human review
  • AI governance:

    • Training data provenance: what data was used to train models?
    • EU AI Act (2024): high-risk AI must document training data and testing for bias
    • Model cards, data sheets: documentation standards for AI systems
  • New architectural patterns: LLM as transformation operator in data pipelines (document parsing, classification, enrichment)

Interview Scenarios

Design a data architecture for a large e-commerce platform with many downstream use cases.
?
Requirements: OLTP orders + search + recommendations + analytics + real-time notifications + fraud detection

Architecture (unbundled + Kafka backbone):

Write side (source of truth):

  • PostgreSQL: orders, products, users (OLTP, ACID)
  • CDC via Debezium → Kafka topics (orders, products, users)

Derived systems (via Kafka consumers):

  • Elasticsearch: Full-text search (products) — consumes products topic
  • Redis: Session cache, user preferences — consumes users topic
  • Snowflake/BigQuery: Analytics, reporting — consumes all topics via Kafka Connect
  • Flink (stream processing): Real-time fraud scoring — consumes orders topic
  • ML platform: Recommendations — batch (Spark) + real-time feature store

Stream processing (Flink):

  • Fraud detection: window aggregates, stream-table join with user risk profiles
  • Notification triggers: “your order shipped” → push notification

Serving pattern:

  • Read orders: PostgreSQL (consistent, authoritative)
  • Search products: Elasticsearch (specialized, fast)
  • Analytics: Snowflake (OLAP, complex queries)
  • Recommendations: Redis (pre-computed by ML model)

Key principle: Each downstream system rebuilds from Kafka if corrupted; Kafka is the backbone

How would you explain the trade-off between strong consistency and system availability to a product manager?
?
Simple explanation:
“Think of it like a bank with multiple tellers. Strong consistency means all tellers must check with each other before every transaction — accurate, but slow. Eventual consistency means each teller works from their own records and syncs up later — fast, but temporarily inconsistent.”

Practical trade-off in our system:

  • Strong consistency required for:

    • User sees their own just-posted item (read-after-write)
    • Payment deducted exactly once (no double-charge)
    • Username is truly unique
  • Eventual consistency acceptable for:

    • Search index takes 2 seconds to show a new product (users won’t notice)
    • Analytics dashboard shows yesterday’s data (expected)
    • Recommendation model updates hourly (freshness isn’t critical)

Business impact framing:

  • Strong consistency costs ~2-5x more in infrastructure (coordination overhead)
  • Eventual consistency means some users see stale data briefly (usually imperceptible)
  • Wrong choice: paying for strong consistency everywhere, or using eventual where users notice

Recommendation: Default to eventual consistency; add strong consistency only where business rules or user experience require it

Quick Facts

What are the four core principles of GDPR most relevant to data system design?
?

  1. Data minimization: Only collect data “adequate, relevant, and limited to what is necessary”

    • Technical: schema design, collection APIs should default to minimal collection
  2. Purpose limitation: Data collected for one purpose cannot be reused for another

    • Technical: access control, data catalog with purpose documentation
  3. Storage limitation: Personal data not kept longer than necessary

    • Technical: automated TTL/expiry, deletion pipelines, log retention policies
  4. Right to erasure (Art. 17): Users can request deletion of their personal data

    • Technical: cryptographic erasure for event logs, deletion propagation to derived stores

Additional rights:

  • Right to access: Users can see what data you hold about them
  • Right to portability: Users can receive their data in machine-readable format
  • Right to explanation: Automated decisions must be explainable

Engineer’s responsibility: These are legal requirements (fines up to 4% of global revenue); design systems to support them from the start

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13