Chapter 12 Cheat Sheet - The Future of Data Systems
One-Line Summaries
| Concept | One-Liner |
|---|---|
| Unbundling databases | Use specialized tools for each concern; compose via event log |
| Event log as backbone | Kafka as durable, replayable, ordered source of truth |
| Derived data | Any state computable from the event log; re-derivable if lost |
| Lambda architecture | Batch + speed + serving layers; antipattern (dual code paths) |
| Kappa architecture | Single stream processor; replay log for historical reprocessing |
| End-to-end correctness | Low-level DB guarantees don’t automatically mean correct app behavior |
| Data minimalism | Don’t collect data you don’t need; less data = less risk |
| Data mesh | Domain teams own their data as products; federated governance |
| Idempotent writes | Operations safe to retry; same result regardless of repetition |
The Unbundled Data Stack
Traditional monolith: Unbundled (modern):
───────────────────────── ───────────────────────────────────────
One database does: Specialized tools connected by event log:
├─ OLTP │
├─ Full-text search ├─ PostgreSQL (OLTP, ACID)
├─ Analytics ├─ Elasticsearch (full-text search)
├─ Caching ├─ Redis (low-latency cache)
├─ Graph queries ├─ Snowflake (analytics/OLAP)
└─ Message queuing ├─ Flink (stream processing)
├─ Neo4j (graph)
└─ Kafka (event log / integration backbone)
How they stay in sync:
DB → CDC → Kafka → each system subscribes and updates itself
Data Flow Architecture
SOURCES BACKBONE DERIVED VIEWS
─────────────────────── ───────────────── ──────────────────────
User actions │ │ Search index (Elasticsearch)
Application events │ Kafka │ → Cache (Redis)
Database changes (CDC) → │ (durable │ OLAP (Snowflake)
IoT sensors │ ordered │ ML features (Feature Store)
External APIs │ log) │ Data warehouse (dbt models)
│ │ Materialized views
Serving layer: Application reads from derived views (fast, specialized)
Write path: All writes captured in Kafka; derived views eventually consistent
Lambda vs Kappa Architecture
LAMBDA ARCHITECTURE (antipattern):
Input →─┬─→ Batch Layer (Spark, full recompute) ──┐
│ ├─→ Serving Layer
└─→ Speed Layer (Flink, real-time only) ──┘
Problems:
❌ Two code paths for same logic (batch + streaming)
❌ Hard to keep both identical
❌ Complex operational overhead
❌ Merging results at serving layer is complex
KAPPA ARCHITECTURE (better):
Input → Kafka log ──→ Stream Processor ──→ Serving Layer
(Flink/Spark)
Historical reprocessing: replay Kafka log from t=0
Advantages:
✅ Single code path (streaming handles both)
✅ Replay for reprocessing (Kafka retention)
✅ Simpler operation
✅ Same logic for real-time and backfill
Correctness Layers
Layer 7: Application logic ← your code
(correct business behavior)
Layer 6: End-to-end idempotency ← request IDs, deduplication
(safe retries at app layer)
Layer 5: Distributed transactions ← 2PC, Saga pattern
(atomic commits across services)
Layer 4: Consensus / ordering ← Raft, Kafka total order
(agreed-upon event sequence)
Layer 3: Replication ← multi-leader, quorum writes
(durability across nodes)
Layer 2: Storage engine ← WAL, B-tree/LSM correctness
(crash recovery)
Layer 1: Hardware / network ← disk checksums, TCP
(physical correctness)
Point: Lower layers don't guarantee upper layers.
Database ACID ≠ correct application behavior.
Must design for correctness at each layer.
Handling Immutability + GDPR
Problem: Event log is immutable → how to implement "right to erasure"?
Solution: Cryptographic erasure
1. Generate per-user encryption key: K_user
2. Encrypt all user's events with K_user before storing in log
3. To "forget" user: delete K_user from key store
4. Events still in log but now indecipherable → functionally erased
Alternative: Pseudo-anonymization
Replace user_id in events with a random token
"Forget" = delete mapping from user_id to token
Events remain but not linkable to the user
Risk: All derived stores (search indexes, caches) must also be updated
→ CDC-based derivation: rebuild derived views after erasure
Ethical Principles Checklist
Data Collection:
□ Do we need this data? (minimalism)
□ Have we informed users clearly? (consent)
□ Can users opt out? (control)
□ Is it proportionate to the benefit? (proportionality)
Data Use:
□ Only used for stated purpose? (purpose limitation)
□ Not used to discriminate? (fairness)
□ Not sold without consent? (commercialization)
□ Not accessible to unauthorized parties? (security)
Data Retention:
□ How long do we keep it? (retention limits)
□ When and how do we delete? (right to erasure)
□ What's our breach response? (notification)
System Design:
□ Could this system be used for surveillance?
□ Could ML models trained on this data be biased?
□ Is there a "right to explanation" for automated decisions?
Key Trade-offs
| Decision | Pro | Con | When to Use |
|---|---|---|---|
| Monolithic DB | Simple, consistent, one system to manage | Not optimized for all workloads | Small systems, early-stage |
| Unbundled stack | Each tool best for its workload | Complex operations, eventual consistency | Large-scale, multiple workloads |
| Lambda arch | Historical + real-time coverage | Dual code paths, complex merge | Avoid; use Kappa instead |
| Kappa arch | Single code path, simpler | Requires long Kafka retention for reprocessing | Most modern streaming systems |
| Immutable events | Audit trail, re-derivability | Storage cost, right-to-erasure complications | Event-driven architectures |
| Mutable state | Simpler, less storage | No history, hard to audit | Simple CRUD apps |
Red Flags
❌ Treating derived data as source of truth (can’t re-derive; single point of corruption)
❌ Point-to-point synchronization between N systems (N² complexity)
❌ Lambda architecture (two code paths; complexity with no benefit over Kappa)
❌ Collecting data “just in case” without clear purpose (GDPR risk + ethical problem)
❌ Assuming DB-level ACID means your application is correct (end-to-end thinking needed)
Green Flags
✅ Event log (Kafka) as integration backbone; all systems derive from it
✅ CDC + derived stores: DB is source of truth; search/cache/analytics derived
✅ Replay capability: long Kafka retention enables historical reprocessing
✅ Data minimalism: only collect what you need
✅ Design idempotency at every layer of the stack
Modern Additions (2026)
Data Mesh:
├─ Domain teams own their data as products
├─ Federated governance: shared standards, independent ownership
└─ Complements technical unbundling with organizational structure
AI/LLM integration:
├─ RAG (Retrieval-Augmented Generation): LLMs query data systems at inference time
├─ Text2SQL: democratize data access via natural language
└─ Data governance for AI: model cards, data lineage for training data
Privacy Preserving:
├─ Differential privacy (Apple, Google): mathematical privacy guarantees on aggregate queries
├─ Federated learning: train ML without centralizing raw data
└─ Synthetic data: replace sensitive data with realistic generated data
Regulatory landscape (2026):
├─ GDPR (EU, 2018): Right to erasure, data minimization, purpose limitation
├─ CCPA (California, 2020): Consumer privacy rights
├─ EU AI Act (2024): Transparency, bias audits for high-risk AI systems
└─ Data Governance Act (EU): data sharing frameworks
Interview Response Templates
When Asked About Data System Architecture
“I’d start with the principle that the event log (Kafka) is the backbone. The authoritative source writes to Kafka; derived systems (search, cache, analytics) subscribe and build their own optimized views. This avoids point-to-point synchronization between systems and gives us replay capability — if a derived view is corrupted, we rebuild from the Kafka log. The key trade-off is operational complexity vs. performance optimization.”
When Asked About Eventual Consistency in Data Systems
“In an unbundled architecture, derived stores (Elasticsearch, Redis) are eventually consistent with the source database. For most reads this is fine — users can tolerate 100ms delay in search index updates. For writes where immediate consistency is critical (user’s own updates), route reads to the source DB or use a consistency protocol. The key is to identify which operations require strong consistency and which can tolerate eventual consistency — don’t pay the latency cost of strong consistency everywhere.”
Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-13