Chapter 13 Flashcards — A Philosophy of Streaming Systems

flashcards ddia-2e chapter-13 streaming philosophy


Core Philosophy

Why does DDIA 2E argue that streaming is a “fundamental paradigm” rather than just an optimization over batch processing?
?

  • Events are the primary representation of how data actually exists in the world — discrete occurrences at specific moments in time
  • Databases (current state) are a derived artifact: the fold of all historical events from the beginning of time
  • The inversion: event log is the source of truth; database tables are materialized snapshots of that log
  • This is not just architectural — it is philosophical: facts (events) are immutable; state (rows) is ephemeral and derivable
  • Practical consequence: if a derived view is lost or corrupted, replay the event log to rebuild it. You cannot do the reverse.
  • Batch processing is a special case of streaming: a stream with a finite, known end. Flink’s “everything is streaming” unifies both under one model.

What is the core philosophical problem with the Lambda architecture?
?

  • Lambda architecture (Nathan Marz, 2011) runs two parallel pipelines for the same computation:
    • Batch layer (Spark): full historical recompute; accurate but stale
    • Speed layer (Storm/Flink): recent events only; fresh but approximate
    • Serving layer: merges both
  • Philosophical problems:
    1. Two code paths for identical logic: same computation implemented twice; bugs in one don’t surface in the other
    2. It enshrines a workaround: Lambda was invented because 2011 stream processors were weak; modern processors (Flink) are strong enough to make Lambda unnecessary
    3. Correctness by averaging out two imperfect systems: epistemically unsatisfying and dangerous
  • Kappa architecture (Jay Kreps, 2014): single stream processor + log replay for historical data. One code path, simpler, correct.
  • Lambda is now an antipattern — avoid in any new system

Kappa Architecture

What is the Kappa architecture and how does historical reprocessing work in it?
?

  • Kappa architecture: single stream processing layer; no separate batch layer
    • Input → Kafka (durable log) → Stream Processor (Flink) → Serving Layer
  • Historical reprocessing: replay the Kafka log from offset 0 through the same stream processor code
    • The processor is running the same code; only the input changes (bounded historical data vs. unbounded live data)
    • Enables retroactive bug fixes: fix the code, replay → rebuilt materialized view
  • Requirements for Kappa:
    1. Long enough Kafka retention (or archival to object storage like S3/Iceberg)
    2. Stream processor efficient enough for catch-up replay
    3. Mechanism to cut over from old to new materialized view without downtime
  • When Kappa is harder: very long-term reprocessing (years) may require Kafka + cold storage integration (sometimes called “Kappa with cold storage”)
  • By 2026: Kappa is the default architecture; Lambda is legacy

Time

What is the difference between event time and processing time, and why does it matter?
?

  • Processing time: wall clock when the stream processor receives the event
    • Easy to obtain (System.currentTimeMillis())
    • Reflects system behavior, not real-world events
  • Event time: timestamp embedded in the event, recording when it actually occurred
    • Requires correctly set clocks on originating devices
    • Reflects real-world semantics
  • Why it matters: mobile user makes purchases at 11:58 PM, 11:59 PM while offline; events arrive at processor at 12:20 AM
    • Processing time: all events go in 12 AM bucket
    • Event time: first two events belong in 11 PM bucket (correct for business reporting)
  • Rule: for business semantics (revenue, user behavior), use event time. For operational metrics (pipeline health, throughput), processing time is acceptable.
  • The challenge: event-time processing requires handling out-of-order and late events, which processing time does not

What is a watermark in stream processing, and what trade-off does it represent?
?

  • Watermark W(t): an assertion that “all events with event time ≤ t have been received”
    • When watermark advances past the end of a window, the window can be closed and results emitted
    • Without watermarks, you would never know when a window is “complete” (events could always arrive later)
  • Generating watermarks: watermark = max(event_time_seen) - allowed_lateness
    • If latest event time seen is 12:00:50 and allowed lateness is 10s → watermark = 12:00:40
  • The trade-off:
    • Conservative (large allowed lateness): high completeness, high latency — windows close late
    • Aggressive (small allowed lateness): low latency, but late events are dropped or trigger retractions
  • Late events: events arriving after the watermark has passed their window
    • Options: discard, accept into a side output, trigger a window retraction (correction of already-emitted result)
  • Core insight: correctness in streaming is inherently time-bounded and probabilistic — you are betting on when “complete enough” is

Exactly-Once

What does “exactly-once” processing actually mean in a stream processor?
?

  • Common misconception: the processor runs the code exactly once per message
  • Correct definition: the observable effect in the output appears exactly once
    • The processor may internally re-deliver and re-process a message due to failures and retries
    • What matters is that the output is indistinguishable from having processed it exactly once
  • Mechanisms:
    • Idempotent producer: Kafka assigns sequence numbers; broker deduplicates retried sends
    • Kafka transactions: atomically commit offset advancement + output records; consumer only sees committed batches
    • Idempotent sink: assign unique dedup key to each output record; sink discards duplicates
    • At-least-once + idempotent operations: simplest approach; design all operations to be safe to repeat
  • End-to-end principle: exactly-once in Flink does not mean exactly-once application behavior
    • If the “charge payment” API call is not idempotent, you can double-charge even with Flink exactly-once
    • Must design idempotency at every external interaction boundary

What is the “end-to-end argument” applied to exactly-once semantics?
?

  • End-to-end argument: a guarantee at a lower layer (stream processor) does not automatically provide the same guarantee at higher layers (application behavior)
  • Applied to exactly-once:
    • Flink provides exactly-once within its processing topology
    • But external systems (databases, APIs, email services) are outside that topology
    • An operation that is not idempotent will produce incorrect results when the stream processor retries
  • Examples of the gap:
    • INCREMENT credit BY $10 — not idempotent (second execution adds another $10)
    • SET credit TO $X_final IF event_id NOT IN processed_events — idempotent
    • Sending an email notification: not idempotent (two sends = two emails)
  • Practical solution: design all external operations to be idempotent by carrying a unique deduplication key
  • Idempotency is the practical philosophy: easier to design idempotent operations than to guarantee exactly-once delivery across system boundaries

State

What are the types of state in a stream processor and how do they scale?
?

  • Operator state: bound to a specific operator instance (e.g., Kafka consumer offset)
    • Redistributed when operator parallelism changes
    • Example: offset per partition
  • Keyed state: most common; stream partitioned by key, each key maintains independent state
    • Scales horizontally: adding parallelism shards the key space
    • Example: per-user session state, running totals, aggregations
  • Window state: accumulated within a time window (tumbling, sliding, session)
    • Cleared automatically when window closes
    • Example: click counts in a 5-minute window per user
  • Broadcast state: distributed to all parallel instances (not partitioned)
    • Used for lookup tables that all operators need
    • Example: currency conversion rates, fraud rules table
  • Key insight: keyed state is the scalability primitive — partition by key, scale by adding key space shards

How does Flink’s distributed snapshot algorithm (checkpoint) provide fault tolerance?
?

  • Goal: take a consistent snapshot of all operator state without pausing processing
  • Algorithm (inspired by Chandy-Lamport):
    1. Checkpoint coordinator injects a barrier event into each source
    2. When an operator receives a barrier on all inputs: snapshot its state → write to durable storage → forward barrier downstream
    3. Sink acknowledges the barrier → checkpoint complete
    4. On crash: restore all operators to last complete snapshot → replay events from the checkpoint’s Kafka offset
  • RocksDB state backend (for large state):
    • State stored in RocksDB on local SSD (fast access)
    • Checkpoints are incremental — only upload changed SST files to S3
    • Recovery: restore RocksDB from S3 + replay delta since last checkpoint
  • The invariant: a complete checkpoint represents a globally consistent state; recovery from it is deterministic
  • Checkpoint interval trade-off: frequent = faster recovery, more overhead; infrequent = less overhead, longer replay on recovery

Materialized Views and Stream-Table Duality

What is the stream-table duality in Kafka Streams?
?

  • Core insight: streams and tables are two representations of the same underlying data
  • Stream → Table (materialization): process stream events and accumulate into a KV store; the table is the current snapshot
    • Kafka Streams: KTable = continuously updated from a topic’s latest values per key
  • Table → Stream (change data capture): emit an event for every INSERT, UPDATE, DELETE
    • CDC (Debezium, Kafka Connect JDBC Source) converts a database table into a stream of changes
  • Kafka topic variants:
    • Compacted topic: retains only the latest value per key — behaves like a table
    • Non-compacted topic: retains all messages — behaves like a stream (full history)
  • Practical consequence: you can join a stream with a table (lookup at each event’s key), join two tables (compare their latest values), or join two streams (temporal join within a time window)
  • Philosophical implication: “the database” is just a materialized view of the event stream; the stream is the primary artifact

What is a materialized view in the context of stream processing, and how does it differ from a database materialized view?
?

  • Materialized view: a pre-computed query result stored for fast retrieval, updated as new data arrives
  • Database materialized view: computed by batch refresh (rebuild periodically) or incremental update (apply delta); limited by database engine capacity; updated synchronously
  • Stream-maintained materialized view: the stream processor continuously applies new events to maintain the view
    • Latency: sub-second updates (vs. minutes/hours for batch refresh)
    • Scale: horizontally scalable (Flink with keyed state, sharded by key)
    • Decoupled from serving: view stored in Redis, RocksDB, or Elasticsearch — not inside the database
  • CQRS pattern: separate the write model (commands in Kafka) from the read model (materialized views in serving stores)
    • Stream processor is the bridge: consumes commands, maintains views
  • The streaming database vision: Materialize, Flink SQL, RisingWave implement SQL queries that continuously update as new events arrive — blurring the boundary between databases and stream processors

Fault Tolerance and Immutability

Why is replayability the philosophical foundation of streaming fault tolerance?
?

  • Key principle: if the event log is immutable and retained, any downstream processor can recover from any failure by replaying from a checkpoint
  • Contrast with traditional queuing (RabbitMQ, SQS without replay):
    • Messages deleted after consumption
    • Consumer crash between processing and committing = message lost or double-processed
    • No recovery path exists
  • Kafka’s log model:
    • Events retained for a configurable period (or indefinitely with archival to S3)
    • Consumer position (offset) is separate from the message; can be reset to any point
    • Recovery = restore checkpoint + replay from that offset
  • Secondary benefit: auditability
    • Can always answer “what did the system know, and when?” by replaying to any point in time
    • Invaluable for debugging (“why did this user get charged?”), compliance, data quality investigation
  • The GDPR tension: immutable log + right to erasure
    • Solution: cryptographic erasure — encrypt personal data with per-user key; “erase” = delete the key
    • Events remain in log but are indecipherable → functionally erased
    • Must be designed in from the start; retrofitting is very hard

Correctness by Construction

What does “correctness by construction” mean in the context of streaming systems?
?

  • Correctness by construction: design choices that make incorrect behavior structurally impossible, rather than testing for correctness after the fact
  • The five design principles:
    1. Explicit state transitions: represent all state changes as events; never mutate state silently — every change leaves a trace
    2. Reproducible derivations: any derived view can be rebuilt from the source events — derivations are pure functions
    3. Idempotent side effects: any external operation (API call, payment, notification) is safe to repeat
    4. Explicit time: event time embedded in the event itself; don’t rely on processing time for business semantics
    5. Failure as first-class concern: design for recovery from the beginning, not as an afterthought
  • Contrast with imperative mutation:
    • Traditional: UPDATE orders SET status = 'shipped' — history lost; bug in shipping logic → corrupt state
    • Event-sourced: emit OrderShipped event → derive current state by processing events — bug fix = replay with corrected logic
  • The functional programming parallel: pure functions applied to events produce deterministic, reproducible results. Side effects are isolated to the output boundary where they can be controlled.

Architecture and Modern Context

Compare Lambda, Kappa, and “Kappa with cold storage” architectures.
?

  • Lambda architecture:
    • Batch layer (Spark) + Speed layer (Flink) + Serving layer (merge)
    • Two code paths; maintenance burden; serves as antipattern in 2026
  • Kappa architecture:
    • Single stream processor; historical reprocessing via Kafka log replay
    • Weakness: Kafka retention can be expensive for years of data
  • Kappa with cold storage (2026 standard):
    • Kafka for recent events (days to weeks)
    • Apache Iceberg / Delta Lake on S3 for historical archival (years)
    • Flink reads from both with the same SQL/API — same processing logic
    • Solves Kappa’s retention cost problem without adding a second code path
  • Why this matters: you get the architectural simplicity of Kappa with the storage economics of batch systems

What streaming and data processing developments define the 2026 landscape?
?

  • Apache Flink dominates: unified batch-streaming, exactly-once, rich stateful API, Flink SQL for accessible streaming
  • Streaming SQL is mainstream: Flink SQL, Materialize, RisingWave, ksqlDB allow SQL queries that continuously update — lowers barrier to stream processing
  • Iceberg integration: “Kappa with cold storage” pattern well-supported; Flink reads from Kafka (recent) and Iceberg on S3 (historical) uniformly
  • ML on streams: Flink jobs compute embeddings and write to vector databases (Pinecone, Qdrant); real-time feature stores (Feast, Tecton) replace batch feature pipelines
  • Kafka KRaft mode: ZooKeeper removed from Kafka 3.x; internal Raft consensus simplifies operations, improves scalability; ZooKeeper-based Kafka is fully deprecated
  • Vector search as a stream operation: approximate nearest neighbor search over embeddings is a new access pattern alongside OLTP and OLAP

Synthesis

How does Chapter 13 connect to the broader DDIA 2E narrative?
?

  • Chapter 12 (Stream Processing) teaches the mechanics: Kafka internals, watermarks, stream joins, stream processing APIs
  • Chapter 13 provides the WHY behind what Chapter 12 describes mechanically:
    • Why event logs are the source of truth (immutability + replayability)
    • Why Kappa supersedes Lambda (philosophical, not just pragmatic)
    • Why exactly-once requires end-to-end thinking (the end-to-end argument)
    • Why event time matters (business semantics vs. system behavior)
  • Connection to Chapter 10 (Consensus): exactly-once in Kafka requires Raft consensus at the broker level; the same theoretical foundations underlie both
  • Connection to Chapter 14 (Ethics): immutable event logs create the “right to erasure” tension; this chapter establishes the architectural tools (cryptographic erasure) that Chapter 14’s privacy requirements demand
  • The overarching theme: streaming is not just a performance optimization. It is an epistemically honest model of how data exists in time, and it produces systems that are more correct, more debuggable, and more auditable than mutable-state alternatives.

Total Cards: 20
Review Time: ~20 minutes
Priority: HIGH
Last Updated: 2026-05-29