Chapter 12 Cheat Sheet - The Future of Data Systems

One-Line Summaries

ConceptOne-Liner
Unbundling databasesUse specialized tools for each concern; compose via event log
Event log as backboneKafka as durable, replayable, ordered source of truth
Derived dataAny state computable from the event log; re-derivable if lost
Lambda architectureBatch + speed + serving layers; antipattern (dual code paths)
Kappa architectureSingle stream processor; replay log for historical reprocessing
End-to-end correctnessLow-level DB guarantees don’t automatically mean correct app behavior
Data minimalismDon’t collect data you don’t need; less data = less risk
Data meshDomain teams own their data as products; federated governance
Idempotent writesOperations safe to retry; same result regardless of repetition

The Unbundled Data Stack

Traditional monolith:            Unbundled (modern):
─────────────────────────        ───────────────────────────────────────
One database does:               Specialized tools connected by event log:
├─ OLTP                          │
├─ Full-text search              ├─ PostgreSQL (OLTP, ACID)
├─ Analytics                     ├─ Elasticsearch (full-text search)
├─ Caching                       ├─ Redis (low-latency cache)
├─ Graph queries                 ├─ Snowflake (analytics/OLAP)
└─ Message queuing               ├─ Flink (stream processing)
                                 ├─ Neo4j (graph)
                                 └─ Kafka (event log / integration backbone)
                                 
How they stay in sync:
  DB → CDC → Kafka → each system subscribes and updates itself

Data Flow Architecture

SOURCES                    BACKBONE              DERIVED VIEWS
───────────────────────    ─────────────────     ──────────────────────
User actions               │              │      Search index (Elasticsearch)
Application events         │    Kafka     │  →   Cache (Redis)
Database changes (CDC)  →  │   (durable   │      OLAP (Snowflake)
IoT sensors                │    ordered   │      ML features (Feature Store)
External APIs              │     log)     │      Data warehouse (dbt models)
                           │              │      Materialized views
                                          
Serving layer: Application reads from derived views (fast, specialized)
Write path: All writes captured in Kafka; derived views eventually consistent

Lambda vs Kappa Architecture

LAMBDA ARCHITECTURE (antipattern):
  Input →─┬─→ Batch Layer (Spark, full recompute)   ──┐
           │                                           ├─→ Serving Layer
           └─→ Speed Layer (Flink, real-time only)  ──┘
  
  Problems:
  ❌ Two code paths for same logic (batch + streaming)
  ❌ Hard to keep both identical
  ❌ Complex operational overhead
  ❌ Merging results at serving layer is complex

KAPPA ARCHITECTURE (better):
  Input → Kafka log ──→ Stream Processor ──→ Serving Layer
                          (Flink/Spark)
  Historical reprocessing: replay Kafka log from t=0

  Advantages:
  ✅ Single code path (streaming handles both)
  ✅ Replay for reprocessing (Kafka retention)
  ✅ Simpler operation
  ✅ Same logic for real-time and backfill

Correctness Layers

Layer 7: Application logic          ← your code
         (correct business behavior)
         
Layer 6: End-to-end idempotency     ← request IDs, deduplication
         (safe retries at app layer)
         
Layer 5: Distributed transactions   ← 2PC, Saga pattern
         (atomic commits across services)
         
Layer 4: Consensus / ordering       ← Raft, Kafka total order
         (agreed-upon event sequence)
         
Layer 3: Replication                ← multi-leader, quorum writes
         (durability across nodes)
         
Layer 2: Storage engine             ← WAL, B-tree/LSM correctness
         (crash recovery)
         
Layer 1: Hardware / network         ← disk checksums, TCP
         (physical correctness)

Point: Lower layers don't guarantee upper layers.
Database ACID ≠ correct application behavior.
Must design for correctness at each layer.

Handling Immutability + GDPR

Problem: Event log is immutable → how to implement "right to erasure"?

Solution: Cryptographic erasure
  1. Generate per-user encryption key: K_user
  2. Encrypt all user's events with K_user before storing in log
  3. To "forget" user: delete K_user from key store
  4. Events still in log but now indecipherable → functionally erased

Alternative: Pseudo-anonymization
  Replace user_id in events with a random token
  "Forget" = delete mapping from user_id to token
  Events remain but not linkable to the user

Risk: All derived stores (search indexes, caches) must also be updated
→ CDC-based derivation: rebuild derived views after erasure

Ethical Principles Checklist

Data Collection:
  □ Do we need this data? (minimalism)
  □ Have we informed users clearly? (consent)
  □ Can users opt out? (control)
  □ Is it proportionate to the benefit? (proportionality)

Data Use:
  □ Only used for stated purpose? (purpose limitation)
  □ Not used to discriminate? (fairness)
  □ Not sold without consent? (commercialization)
  □ Not accessible to unauthorized parties? (security)

Data Retention:
  □ How long do we keep it? (retention limits)
  □ When and how do we delete? (right to erasure)
  □ What's our breach response? (notification)

System Design:
  □ Could this system be used for surveillance?
  □ Could ML models trained on this data be biased?
  □ Is there a "right to explanation" for automated decisions?

Key Trade-offs

DecisionProConWhen to Use
Monolithic DBSimple, consistent, one system to manageNot optimized for all workloadsSmall systems, early-stage
Unbundled stackEach tool best for its workloadComplex operations, eventual consistencyLarge-scale, multiple workloads
Lambda archHistorical + real-time coverageDual code paths, complex mergeAvoid; use Kappa instead
Kappa archSingle code path, simplerRequires long Kafka retention for reprocessingMost modern streaming systems
Immutable eventsAudit trail, re-derivabilityStorage cost, right-to-erasure complicationsEvent-driven architectures
Mutable stateSimpler, less storageNo history, hard to auditSimple CRUD apps

Red Flags

❌ Treating derived data as source of truth (can’t re-derive; single point of corruption)
❌ Point-to-point synchronization between N systems (N² complexity)
❌ Lambda architecture (two code paths; complexity with no benefit over Kappa)
❌ Collecting data “just in case” without clear purpose (GDPR risk + ethical problem)
❌ Assuming DB-level ACID means your application is correct (end-to-end thinking needed)

Green Flags

✅ Event log (Kafka) as integration backbone; all systems derive from it
✅ CDC + derived stores: DB is source of truth; search/cache/analytics derived
✅ Replay capability: long Kafka retention enables historical reprocessing
✅ Data minimalism: only collect what you need
✅ Design idempotency at every layer of the stack

Modern Additions (2026)

Data Mesh:
├─ Domain teams own their data as products
├─ Federated governance: shared standards, independent ownership
└─ Complements technical unbundling with organizational structure

AI/LLM integration:
├─ RAG (Retrieval-Augmented Generation): LLMs query data systems at inference time
├─ Text2SQL: democratize data access via natural language
└─ Data governance for AI: model cards, data lineage for training data

Privacy Preserving:
├─ Differential privacy (Apple, Google): mathematical privacy guarantees on aggregate queries
├─ Federated learning: train ML without centralizing raw data
└─ Synthetic data: replace sensitive data with realistic generated data

Regulatory landscape (2026):
├─ GDPR (EU, 2018): Right to erasure, data minimization, purpose limitation
├─ CCPA (California, 2020): Consumer privacy rights
├─ EU AI Act (2024): Transparency, bias audits for high-risk AI systems
└─ Data Governance Act (EU): data sharing frameworks

Interview Response Templates

When Asked About Data System Architecture

“I’d start with the principle that the event log (Kafka) is the backbone. The authoritative source writes to Kafka; derived systems (search, cache, analytics) subscribe and build their own optimized views. This avoids point-to-point synchronization between systems and gives us replay capability — if a derived view is corrupted, we rebuild from the Kafka log. The key trade-off is operational complexity vs. performance optimization.”

When Asked About Eventual Consistency in Data Systems

“In an unbundled architecture, derived stores (Elasticsearch, Redis) are eventually consistent with the source database. For most reads this is fine — users can tolerate 100ms delay in search index updates. For writes where immediate consistency is critical (user’s own updates), route reads to the source DB or use a consistency protocol. The key is to identify which operations require strong consistency and which can tolerate eventual consistency — don’t pay the latency cost of strong consistency everywhere.”


Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-13