Chapter 12 Cheat Sheet - The Future of Data Systems

One-Line Summaries

Concept	One-Liner
Unbundling databases	Use specialized tools for each concern; compose via event log
Event log as backbone	Kafka as durable, replayable, ordered source of truth
Derived data	Any state computable from the event log; re-derivable if lost
Lambda architecture	Batch + speed + serving layers; antipattern (dual code paths)
Kappa architecture	Single stream processor; replay log for historical reprocessing
End-to-end correctness	Low-level DB guarantees don’t automatically mean correct app behavior
Data minimalism	Don’t collect data you don’t need; less data = less risk
Data mesh	Domain teams own their data as products; federated governance
Idempotent writes	Operations safe to retry; same result regardless of repetition

The Unbundled Data Stack

Traditional monolith:            Unbundled (modern):
─────────────────────────        ───────────────────────────────────────
One database does:               Specialized tools connected by event log:
├─ OLTP                          │
├─ Full-text search              ├─ PostgreSQL (OLTP, ACID)
├─ Analytics                     ├─ Elasticsearch (full-text search)
├─ Caching                       ├─ Redis (low-latency cache)
├─ Graph queries                 ├─ Snowflake (analytics/OLAP)
└─ Message queuing               ├─ Flink (stream processing)
                                 ├─ Neo4j (graph)
                                 └─ Kafka (event log / integration backbone)
                                 
How they stay in sync:
  DB → CDC → Kafka → each system subscribes and updates itself

Data Flow Architecture

SOURCES                    BACKBONE              DERIVED VIEWS
───────────────────────    ─────────────────     ──────────────────────
User actions               │              │      Search index (Elasticsearch)
Application events         │    Kafka     │  →   Cache (Redis)
Database changes (CDC)  →  │   (durable   │      OLAP (Snowflake)
IoT sensors                │    ordered   │      ML features (Feature Store)
External APIs              │     log)     │      Data warehouse (dbt models)
                           │              │      Materialized views
                                          
Serving layer: Application reads from derived views (fast, specialized)
Write path: All writes captured in Kafka; derived views eventually consistent

Lambda vs Kappa Architecture

LAMBDA ARCHITECTURE (antipattern):
  Input →─┬─→ Batch Layer (Spark, full recompute)   ──┐
           │                                           ├─→ Serving Layer
           └─→ Speed Layer (Flink, real-time only)  ──┘
  
  Problems:
  ❌ Two code paths for same logic (batch + streaming)
  ❌ Hard to keep both identical
  ❌ Complex operational overhead
  ❌ Merging results at serving layer is complex

KAPPA ARCHITECTURE (better):
  Input → Kafka log ──→ Stream Processor ──→ Serving Layer
                          (Flink/Spark)
  Historical reprocessing: replay Kafka log from t=0

  Advantages:
  ✅ Single code path (streaming handles both)
  ✅ Replay for reprocessing (Kafka retention)
  ✅ Simpler operation
  ✅ Same logic for real-time and backfill

Correctness Layers

Layer 7: Application logic          ← your code
         (correct business behavior)
         
Layer 6: End-to-end idempotency     ← request IDs, deduplication
         (safe retries at app layer)
         
Layer 5: Distributed transactions   ← 2PC, Saga pattern
         (atomic commits across services)
         
Layer 4: Consensus / ordering       ← Raft, Kafka total order
         (agreed-upon event sequence)
         
Layer 3: Replication                ← multi-leader, quorum writes
         (durability across nodes)
         
Layer 2: Storage engine             ← WAL, B-tree/LSM correctness
         (crash recovery)
         
Layer 1: Hardware / network         ← disk checksums, TCP
         (physical correctness)

Point: Lower layers don't guarantee upper layers.
Database ACID ≠ correct application behavior.
Must design for correctness at each layer.

Problem: Event log is immutable → how to implement "right to erasure"?

Solution: Cryptographic erasure
  1. Generate per-user encryption key: K_user
  2. Encrypt all user's events with K_user before storing in log
  3. To "forget" user: delete K_user from key store
  4. Events still in log but now indecipherable → functionally erased

Alternative: Pseudo-anonymization
  Replace user_id in events with a random token
  "Forget" = delete mapping from user_id to token
  Events remain but not linkable to the user

Risk: All derived stores (search indexes, caches) must also be updated
→ CDC-based derivation: rebuild derived views after erasure

Ethical Principles Checklist

Data Collection:
  □ Do we need this data? (minimalism)
  □ Have we informed users clearly? (consent)
  □ Can users opt out? (control)
  □ Is it proportionate to the benefit? (proportionality)

Data Use:
  □ Only used for stated purpose? (purpose limitation)
  □ Not used to discriminate? (fairness)
  □ Not sold without consent? (commercialization)
  □ Not accessible to unauthorized parties? (security)

Data Retention:
  □ How long do we keep it? (retention limits)
  □ When and how do we delete? (right to erasure)
  □ What's our breach response? (notification)

System Design:
  □ Could this system be used for surveillance?
  □ Could ML models trained on this data be biased?
  □ Is there a "right to explanation" for automated decisions?

Key Trade-offs

Decision	Pro	Con	When to Use
Monolithic DB	Simple, consistent, one system to manage	Not optimized for all workloads	Small systems, early-stage
Unbundled stack	Each tool best for its workload	Complex operations, eventual consistency	Large-scale, multiple workloads
Lambda arch	Historical + real-time coverage	Dual code paths, complex merge	Avoid; use Kappa instead
Kappa arch	Single code path, simpler	Requires long Kafka retention for reprocessing	Most modern streaming systems
Immutable events	Audit trail, re-derivability	Storage cost, right-to-erasure complications	Event-driven architectures
Mutable state	Simpler, less storage	No history, hard to audit	Simple CRUD apps

Red Flags

❌ Treating derived data as source of truth (can’t re-derive; single point of corruption)
❌ Point-to-point synchronization between N systems (N² complexity)
❌ Lambda architecture (two code paths; complexity with no benefit over Kappa)
❌ Collecting data “just in case” without clear purpose (GDPR risk + ethical problem)
❌ Assuming DB-level ACID means your application is correct (end-to-end thinking needed)

Green Flags

✅ Event log (Kafka) as integration backbone; all systems derive from it
✅ CDC + derived stores: DB is source of truth; search/cache/analytics derived
✅ Replay capability: long Kafka retention enables historical reprocessing
✅ Data minimalism: only collect what you need
✅ Design idempotency at every layer of the stack

Modern Additions (2026)

Data Mesh:
├─ Domain teams own their data as products
├─ Federated governance: shared standards, independent ownership
└─ Complements technical unbundling with organizational structure

AI/LLM integration:
├─ RAG (Retrieval-Augmented Generation): LLMs query data systems at inference time
├─ Text2SQL: democratize data access via natural language
└─ Data governance for AI: model cards, data lineage for training data

Privacy Preserving:
├─ Differential privacy (Apple, Google): mathematical privacy guarantees on aggregate queries
├─ Federated learning: train ML without centralizing raw data
└─ Synthetic data: replace sensitive data with realistic generated data

Regulatory landscape (2026):
├─ GDPR (EU, 2018): Right to erasure, data minimization, purpose limitation
├─ CCPA (California, 2020): Consumer privacy rights
├─ EU AI Act (2024): Transparency, bias audits for high-risk AI systems
└─ Data Governance Act (EU): data sharing frameworks

Interview Response Templates

When Asked About Data System Architecture

“I’d start with the principle that the event log (Kafka) is the backbone. The authoritative source writes to Kafka; derived systems (search, cache, analytics) subscribe and build their own optimized views. This avoids point-to-point synchronization between systems and gives us replay capability — if a derived view is corrupted, we rebuild from the Kafka log. The key trade-off is operational complexity vs. performance optimization.”

When Asked About Eventual Consistency in Data Systems

“In an unbundled architecture, derived stores (Elasticsearch, Redis) are eventually consistent with the source database. For most reads this is fine — users can tolerate 100ms delay in search index updates. For writes where immediate consistency is critical (user’s own updates), route reads to the source DB or use a consistency protocol. The key is to identify which operations require strong consistency and which can tolerate eventual consistency — don’t pay the latency cost of strong consistency everywhere.”

Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch12-cheatsheet

Chapter 12 Cheat Sheet - The Future of Data Systems

One-Line Summaries

The Unbundled Data Stack

Data Flow Architecture

Lambda vs Kappa Architecture

Correctness Layers

Ethical Principles Checklist

Key Trade-offs

Red Flags

Green Flags

Modern Additions (2026)

Interview Response Templates

When Asked About Data System Architecture

When Asked About Eventual Consistency in Data Systems

Graph View

Table of Contents

Study Notes by Niladri & AI

Explorer

ch12-cheatsheet

Chapter 12 Cheat Sheet - The Future of Data Systems

One-Line Summaries

The Unbundled Data Stack

Data Flow Architecture

Lambda vs Kappa Architecture

Correctness Layers

Handling Immutability + GDPR

Ethical Principles Checklist

Key Trade-offs

Red Flags

Green Flags

Modern Additions (2026)

Interview Response Templates

When Asked About Data System Architecture

When Asked About Eventual Consistency in Data Systems

Graph View

Table of Contents