Chapter 12: The Future of Data Systems

Overview

Chapter 12 is the most philosophical chapter in DDIA. Rather than teaching new technical concepts, it synthesizes the entire book into a vision for how data systems should be designed. Kleppmann argues for unbundling databases—using specialized tools for each purpose rather than one monolithic system—with a reliable event log (like Kafka) as the integration backbone. The chapter also addresses ethics in data systems, which is increasingly important in 2026.

Core thesis: Rather than trying to fit all data needs into one database, combine specialized tools (stream processing, search, caches, OLAP) with a durable, ordered log of events as the ground truth. Build systems where derived data is continuously recomputed from the event log.

Key Concepts

Data Integration Challenge

The problem: Modern applications use many different data systems (MySQL, Elasticsearch, Redis, Kafka, Snowflake) that all need to stay in sync with each other.

Bad approach: Point-to-point synchronization (N² connections for N systems)
Better approach: Publish all changes to a single log; all systems subscribe

Key insight from DDIA: The job of data systems is to move data from where it’s produced to where it’s consumed, transforming it along the way. The event log is the reliable medium for this transport.

Derived data principle:

One system is the authoritative source of truth (the write side)
All other representations (search index, cache, OLAP store) are derived data — computable from the source
If derived data is corrupted, re-derive it from the source

Unbundling Databases

Traditional monolithic database: One system handles storage, indexing, query processing, transactions, caching, full-text search, analytics, etc.

Unbundling: Use purpose-built tools for each concern and compose them:

PostgreSQL: OLTP, relational queries, ACID transactions
Elasticsearch: Full-text search, faceted filtering
Redis: Low-latency key-value cache
Kafka: Durable ordered event log, message streaming
Snowflake/BigQuery: OLAP analytics, column-oriented queries
Flink/Spark: Stream/batch processing, complex transformations
Neo4j: Graph traversal

How to keep them in sync: CDC (via Debezium) → Kafka → each system subscribes and updates itself.

Advantages of unbundling:

Each tool optimized for its specific workload
Can swap tools without affecting others
Clear ownership: event log is the ground truth

Disadvantages:

Operational complexity (more systems to manage)
Consistency challenges (derived stores are eventually consistent)
Need expertise in each tool

The Lambda Architecture Problem

Lambda architecture (Nathan Marz):

Batch layer: Recompute everything periodically (Hadoop/Spark)
Speed layer: Process new events in real-time (Storm/Flink)
Serving layer: Merge batch results + speed layer results

Problem with Lambda:

Maintaining two code paths (batch + streaming) for same computation
Hard to keep them identical; bugs in one don’t appear in other
Complex to debug and operate

Kappa architecture (Jay Kreps):

Only one layer: stream processing
Historical reprocessing: replay event log through same stream processor
Simpler: one code path for all processing

DDIA’s view: Kappa is better when possible; event log as replayable, immutable source of truth.

Building Correct Applications

The fundamental tension:

Distributed systems cannot provide perfect guarantees (CAP theorem, partial failures)
Applications need to provide correct behavior to users

End-to-end correctness:

Low-level guarantees (ACID, exactly-once) don’t automatically mean the application is correct
Must design for correctness at every layer: application logic, not just DB

Techniques for correctness:

Immutable events + derived state:

Never delete source events (append-only)
All state derived from events (can re-derive if wrong)
Mistakes can be compensated, not corrected retroactively in place

Separation of concerns:

Data flows from authoritative source → event log → derived views
Each derivation is explicit and auditable

Idempotency:

Design all operations to be safe to retry
Unique request IDs; exactly-once semantics at application level

Fault-tolerant message ordering:

Total order of events in the log provides a foundation for correctness
Downstream systems can detect and handle ordering violations

Append-only data:

Don’t update/delete records; add new events (event sourcing style)
Mistakes: add correcting events; old records preserved for audit

Doing the Right Thing (Ethics)

Surveillance systems:

Data collected for one purpose can be repurposed (behavioral profiling, law enforcement)
No technical barrier to aggregating data across systems
Design decision: what data to collect, how long to retain, who can access

Bias and discrimination:

ML models trained on historical data perpetuate historical biases
Example: Hiring algorithms trained on past hires (predominantly male) discriminate against women
Systems can discriminate without being designed to discriminate

Privacy by design:

Minimize data collection (only what’s needed)
Purpose limitation (only use for stated purpose)
Retention limits (delete after no longer needed)
User control (access, portability, deletion)

DDIA’s challenge to engineers:

We are responsible for the systems we build
Technical decisions have social and ethical consequences
Regulations (GDPR, CCPA) are a floor, not a ceiling
“Just following orders” (specifications) is not an excuse

The principle of data minimalism: Collect less data, and it can’t be misused.

Key Concepts Summary (Synthesis)

The data flow view:

Sources: Users, sensors, APIs generate events
Durable ordered log: Kafka captures events as immutable, replayable stream
Derived stores: Stream/batch processors compute derived views (search, caches, OLAP)
Serving: Applications serve from derived views (low-latency reads)
Feedback loop: System actions may generate new events → back to step 1

Key properties to design for:

Immutability: Events are facts; don’t delete history
Derivability: All state is derivable from event log
Auditability: Can trace any state back to the events that caused it
Evolvability: New derived views can be built by replaying old events

Important Points

No single database is the right tool for everything: Unbundled, specialized systems connected by an event log.
The event log is the source of truth: All derived views should be derivable from it.
Lambda architecture is an antipattern: Maintain one code path (Kappa); replay stream for historical reprocessing.
Correctness requires end-to-end thinking: Low-level guarantees don’t automatically make applications correct.
Ethics are not optional: The systems we build have social consequences; we have responsibility.
Data minimalism: Don’t collect data you don’t need; you can’t be responsible for data you don’t have.

Examples & Case Studies

LinkedIn’s Samza / Kafka Architecture
- Profile updates, connection requests, page views → Kafka
- Stream processors keep secondary systems (search, feed, notifications) up-to-date
- Kafka as “nervous system” of LinkedIn’s data infrastructure
Netflix Event-Driven Architecture
- All user interactions published as events to Kafka
- Derived systems: recommendations, analytics, A/B testing — all consume from Kafka
- Can retroactively build new features by replaying old events
Google Dremel (BigQuery predecessor)
- Example of unbundling: specialized column-store query engine
- Separate from operational DB; fed by batch jobs
- Demonstrates power of purpose-built systems
GDPR “Right to be Forgotten”
- Immutable event logs create complications: can’t delete events
- Solution: Encrypt user data with a per-user key; “forget” = delete the encryption key
- Demonstrates how data design decisions have legal/ethical consequences

Questions

What is the “unbundling” of databases and what are its trade-offs?
What is the Lambda architecture and why is Kappa architecture preferred?
How does treating derived data as derivable from an event log improve fault tolerance?
What ethical responsibilities do data engineers have?
How should you handle GDPR’s right to erasure in an immutable event log?
What does “end-to-end” correctness mean for a distributed system?
Why is data minimalism an important design principle?
How does idempotency contribute to system correctness?

Modern Context (2026)

AI/ML and data systems:

LLMs consuming vast amounts of training data raise new questions about data provenance
Training data curation: what data should AI systems be trained on?
Retrieval-Augmented Generation (RAG): real-time data access patterns for LLMs
AI governance: documenting what data systems do and why (model cards, datasheets)

Data Mesh (2021-2026):

Organizational approach to data systems: domain ownership of data
Each domain team owns their data as a “product” (with SLAs, contracts)
Federated governance: common standards (schema registry, data catalog)
Complements unbundling: technical unbundling + organizational data mesh

Privacy-Preserving Techniques:

Differential privacy: add mathematical noise to query results; privacy guarantees
Federated learning: train ML models without centralizing user data
Homomorphic encryption: compute on encrypted data (still slow in 2026)
Synthetic data generation: replace sensitive data with realistic synthetic data

Data Governance (2026 landscape):

Open Data Lineage (OpenLineage), OpenMetadata: track data provenance
Great Expectations, dbt tests: data quality as code
GDPR (2018), CCPA (2020), AI Act (EU 2024): regulatory environment forcing good practices
Data contracts: formal agreements between data producers and consumers

The role of AI in data systems:

Natural language to SQL (Text2SQL): democratizing data access
Automated schema inference and data cataloging
AI-generated ETL/data pipeline code (dbt model generation)
LLM-powered data quality monitoring

Status: Notes complete
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch12-future-of-data-systems