Chapter 12: Stream Processing
ddia-2e stream-processing kafka cdc watermarks exactly-once flink
Status: Notes complete
Overview
Stream processing is the paradigm for working with unbounded (infinite) data — events that arrive continuously and never form a complete, finite dataset. Where batch processing waits to collect all the data before processing, stream processing acts on events as they arrive. Chapter 12 covers the full stack: the infrastructure for transmitting events (messaging systems, log-based brokers), the patterns for keeping systems in sync (Change Data Capture, event sourcing), and the mechanics of stream processing (windowing, joins, fault tolerance, exactly-once semantics).
The 2nd edition significantly updates the 1st edition’s Chapter 11, adding concrete CDC tooling (Debezium + Kafka + PostgreSQL), Kafka Streams and ksqlDB, Flink as the dominant stream processor, and expanded treatment of watermarking strategies for late-arriving data.
Why stream processing matters: Modern systems cannot afford to wait hours for a batch job. Fraud detection, real-time recommendations, alerting, operational dashboards, and CDC-based data synchronization all require processing events within seconds. Stream processing is batch processing applied to the present moment.
Key tension in stream processing: Correctness (waiting for all data) vs latency (acting now). Every design decision in stream processing is navigating this tension.
Key Concepts
Transmitting Event Streams
Messaging Systems
When a producer generates events faster than consumers can process them, a messaging system (broker) is needed to buffer the events and decouple producers from consumers.
Two fundamental delivery models:
Message Queue (traditional: RabbitMQ, ActiveMQ, AWS SQS):
- Messages pushed to consumers
- Message deleted after acknowledged by one consumer
- Competing consumers: multiple consumers compete for the same messages (load-balanced)
- No replay: once consumed, a message is gone
- Good for: task queues, work distribution, job scheduling
Message Log / Event Log (Kafka, Amazon Kinesis, Azure Event Hubs):
- Append-only log; messages retained by time or size policy
- Consumers pull and track their own offset (position in log)
- Multiple independent consumer groups each read the full log independently
- Replay: consumers can re-read from any past offset
- Good for: event sourcing, audit log, stream processing, data integration, replay
Comparison:
| Property | Message Queue (RabbitMQ) | Log-Based Broker (Kafka) |
|---|---|---|
| Retention | Deleted after consumed | Retained for configurable period |
| Multiple consumers | Competing (one gets each message) | Fan-out (each group gets all messages) |
| Replay | No | Yes (seek to any offset) |
| Ordering | Per-queue (limited) | Per-partition (total order) |
| Throughput | Medium | Very high (sequential disk I/O) |
| Consumer pacing | Push (broker controls rate) | Pull (consumer controls rate) |
| Use case | Task queues, work distribution | Event streaming, audit, CDC, replay |
Kafka architecture fundamentals:
Topic: "user-events"
Partition 0: [offset 0][offset 1][offset 2]...[offset N]
Partition 1: [offset 0][offset 1][offset 2]...[offset M]
Partition 2: [offset 0][offset 1][offset 2]...[offset K]
Consumer Group "analytics":
Consumer A → reads Partition 0 (tracks offset 0–N)
Consumer B → reads Partition 1 (tracks offset 0–M)
Consumer C → reads Partition 2 (tracks offset 0–K)
Consumer Group "fraud-detection":
Consumer X → reads ALL 3 partitions from their own offsets
Key properties:
- Within a partition: messages are totally ordered
- Across partitions: no ordering guarantee
- Parallelism = number of partitions per topic
- Producer routes by key hash (key → same partition → ordered)
Log-Based Message Brokers
Why log-based beats traditional queues for stream processing:
- Replay: Consumer crash at offset 500? Re-read from offset 500. No data loss.
- Fan-out: Spin up a new consumer group (analytics, fraud, monitoring) — each gets all messages independently. Traditional queue: each message consumed by only one consumer.
- Backpressure handled naturally: Consumers read at their own pace. If consumer slows, messages wait in the log. No push overwhelming the consumer.
- Audit and debugging: Every event persisted for inspection. Can replay past events to debug processing logic.
- Bootstrapping new consumers: A new consumer can replay from the beginning of the log to build initial state.
Kafka log compaction (essential for CDC state):
- For a compacted topic: Kafka retains only the most recent value for each key
- Deletes tracked via tombstone messages (key with null value)
- Reading a compacted log from offset 0 = current state of all keys (like a key-value snapshot)
- This enables CDC consumers to restart and rebuild current state without reading all history
Databases and Streams
Keeping Systems in Sync
The problem: An operational database (PostgreSQL) is the source of truth. Multiple derived systems need to stay in sync: search index (Elasticsearch), analytics warehouse, caches (Redis), microservice databases.
Approach 1 — Dual writes (anti-pattern):
Application writes to PostgreSQL AND Elasticsearch simultaneously
Problem: What if PostgreSQL write succeeds but Elasticsearch write fails?
→ Systems diverge
Problem: Two concurrent writes to different systems → race condition
→ Systems end up in different states depending on timing
Approach 2 — Change Data Capture (CDC) (correct approach):
Application writes to PostgreSQL only (single source of truth)
CDC reads PostgreSQL WAL (Write-Ahead Log) → publishes changes to Kafka
All derived systems subscribe to Kafka → update themselves from log
Change Data Capture (CDC)
What CDC captures: Every INSERT, UPDATE, DELETE from the database’s replication log, as a stream of events sent to a message broker.
Concrete CDC setup: PostgreSQL → Debezium → Kafka → Flink → Elasticsearch:
┌─────────────────────────────────────────────────────────────────────────┐
│ CDC Pipeline Architecture │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ PostgreSQL │ │ Debezium │ │ Kafka │ │
│ │ │ │ (Kafka │ │ │ │
│ │ WAL │───►│ Connect │───►│ Topic: │ │
│ │ (replication│ │ plugin) │ │ postgres. │ │
│ │ log) │ │ │ │ public. │ │
│ │ │ │ Reads WAL, │ │ orders │ │
│ │ INSERT INTO │ │ transforms │ │ │ │
│ │ orders ... │ │ to JSON/ │ │ [JSON CDC │ │
│ └──────────────┘ │ Avro events │ │ events] │ │
│ └──────────────┘ └──────┬───────┘ │
│ │ │
│ ┌───────────────────┤ │
│ │ │ │
│ ┌────────▼──────┐ ┌────────▼──────┐ │
│ │ Apache │ │ Data │ │
│ │ Flink │ │ Warehouse │ │
│ │ │ │ (Snowflake/ │ │
│ │ Transform, │ │ BigQuery) │ │
│ │ enrich, │ └───────────────┘ │
│ │ filter │ │
│ └───────┬───────┘ │
│ │ │
│ ┌───────▼───────┐ │
│ │ Elasticsearch │ │
│ │ (search index │ │
│ │ up to date │ │
│ │ within secs) │ │
│ └───────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Debezium configuration snippet (PostgreSQL source connector):
{
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres-db",
"database.port": "5432",
"database.user": "debezium",
"database.password": "secret",
"database.dbname": "myapp",
"table.include.list": "public.orders,public.customers",
"topic.prefix": "postgres",
"plugin.name": "pgoutput",
"publication.name": "dbz_publication"
}CDC event structure (Debezium JSON):
{
"before": {"id": 1, "status": "pending", "amount": 99.99},
"after": {"id": 1, "status": "shipped", "amount": 99.99},
"op": "u", // u=update, c=create, d=delete, r=read(snapshot)
"ts_ms": 1716998400000,
"source": {"table": "orders", "lsn": "0/15D3F8"}
}Benefits of CDC:
- Single source of truth (database remains authoritative)
- Derived systems updated eventually-consistently via log
- Replay: re-build Elasticsearch index from scratch by replaying Kafka topic
- Zero downtime migrations: new system subscribes to topic and catches up
State, Streams, and Immutability
Fundamental duality: A database table and a stream of changes are two representations of the same information.
Stream of changes: INSERT(Alice), UPDATE(Alice→Alice2), DELETE(Alice2)
↕
Table (current state): {Alice2} (fold/reduce over the stream)
Table → Stream: CDC (export each write as a changelog event)
Stream → Table: Stream processor materializes a view by applying each event
Event sourcing:
- Store all state changes as an immutable sequence of events (not mutable rows)
- Current state = replay/fold all events for an entity
- Events are facts: “user placed order #99 at 14:00” (never changes)
- State is derived: current order status = reduce all order events
Event sourcing vs CDC:
| CDC | Event Sourcing | |
|---|---|---|
| Level | Low-level DB changes (SQL rows) | High-level domain events (business facts) |
| Designed for | Infrastructure sync | Application architecture |
| Consumer access | Any CDC consumer | Application-specific consumers |
| Reprocessing | Re-read WAL / replay Kafka topic | Replay event store |
Processing Streams
Uses of Stream Processing
1. Complex event processing (CEP): Detect patterns across multiple events within a time window.
- Example: Detect fraud — 3+ failed card transactions within 5 minutes → alert
2. Stream analytics: Compute running metrics, rolling aggregates, windowed counts.
- Example: Count page views per minute; compute P99 latency over last 5 minutes
3. Maintaining materialized views: Apply CDC events to keep a derived table (or search index) up to date.
- Example: Debezium → Kafka → Flink → Elasticsearch (orders searchable within seconds)
4. Stream-to-stream joins: Correlate events from two streams within a time window.
- Example: Match ad impression events with ad click events within 30 minutes
5. Enrichment: Join each stream event with a reference table to add context.
- Example: Enrich order events with customer tier from customer profile table
Reasoning About Time
Event time vs processing time — the most important distinction in stream processing:
| Event Time | Processing Time | |
|---|---|---|
| Definition | When the event actually occurred (in the event payload) | When the event is processed by the stream processor |
| Source | Event producer’s timestamp | Stream processor’s system clock |
| Consistency | Consistent across retries | Changes on retry |
| Problem | Late arrivals; out-of-order | Doesn’t reflect when event happened |
| Use for | Business logic (revenue per hour, session duration) | SLA monitoring, job performance |
Why event time ≠ processing time:
- Mobile app buffers events offline for 4 hours → arrives in burst
- Network delay between producer and broker
- Clock skew: device clock is wrong
- Kafka consumer paused → resumes and processes 1-hour-old messages
Windowing strategies:
TUMBLING WINDOW (fixed, non-overlapping):
──[──────────][──────────][──────────]──►
00:00 01:00 02:00 03:00 04:00
Each event belongs to exactly one window
Use: "counts per hour", "revenue per day"
SLIDING WINDOW (fixed size, moves by step):
──[──────────]──►
──[──────────]──►
──[──────────]──►
Window size: 5 min, step: 1 min
Events can appear in multiple windows
Use: "rolling 5-minute error rate"
SESSION WINDOW (variable size, gap-based):
[event1, event2, event3] [ev4] [ev5, ev6, ev7, ev8]
session: 3 events gap session: 4 events
(window closes after 30min inactivity)
Use: "user session duration", "clickstream session"
Watermark Strategies
The watermark problem: When aggregating by event time (e.g., counts per hour), how long do you wait for late-arriving events before closing the window and emitting the result?
Watermark = the stream processor’s estimate of “all events at or before this timestamp have likely arrived.”
- When watermark advances past time T: the window for time T is closed and results emitted
Watermark strategies:
1. Fixed-lag watermark:
Watermark = MAX(event_time seen) - fixed_lag
Example: fixed_lag = 2 minutes
If latest event = 14:10:00, watermark = 14:08:00
Windows before 14:08 are safe to close
✅ Simple; predictable; low latency
❌ If events arrive more than 2 minutes late → silently dropped or ignored
2. Heuristic watermark:
Observe actual arrival delay distribution dynamically
Watermark = f(historical P99 late arrival delay)
If 99% of events arrive within 45 seconds, set watermark at 50s lag
✅ Adapts to actual data patterns
❌ Complex; still a heuristic (never 100% correct)
3. Perfect watermark (rare):
Only possible when you KNOW all input sources have sent all events
up to a certain time (e.g., batch files with known completion)
✅ No late events possible; results always correct
❌ Only achievable with coordinated producers; not general
Handling late events after watermark:
| Strategy | Behavior | Trade-off |
|---|---|---|
| Drop | Discard late events | Simple; loses data |
| Grace period | Window stays open for extra N seconds after watermark | Small extra latency; handles most late arrivals |
| Side output | Route late events to separate stream for special handling | Flexible; more complex |
| Update result | Re-emit updated window aggregate when late event arrives | Downstream must handle corrections; complex |
Late data trade-off:
Completeness
│
100% ┤ ........ (wait forever)
│ ↗
│ ↗ (practical zone: 95-99% within watermark)
50% ┤ ↗
│ ↗
└────────────────────────────► Latency
0s 1s 5s 1min 10min
You cannot have both 100% completeness AND zero latency.
Watermark = the trade-off knob.
Stream Joins
Three types of stream joins:
1. Stream-Table Join (enrichment):
- Enrich each stream event with data from a lookup table
- Table = slowly changing reference data (user profiles, product catalog)
- Implementation: load table into state store (RocksDB in Flink); look up at processing time
Stream: [order_id=1, user_id=42, amount=99] ──► enriched with user tier
Table: {user_id=42 → {name="Alice", tier="gold", country="US"}}
Output: [order_id=1, user_id=42, amount=99, tier="gold", country="US"]
Challenge: Table can change. Solution: Use CDC to keep state store fresh (stream-table join becomes effectively stream-stream with one side slowly changing).
2. Stream-Stream Join (event correlation):
- Join events from two streams that are related within a time window
- Both streams buffered in state store; join matching events within the window
- Unmatched events expire after window duration
Stream A (ad impressions): [user=42, ad_id=9, impression_time=14:00:05]
Stream B (ad clicks): [user=42, ad_id=9, click_time=14:07:33]
Join: impression → click within 30 minutes, same user + ad_id
Window state: buffer all impressions; for each click, look up matching impression
Output: [user=42, ad_id=9, impression_time=14:00:05, click_time=14:07:33,
time_to_click=7m28s]
3. Table-Table Join (incremental materialized view):
- Both inputs are tables (or CDC streams of tables)
- Maintains join result as incrementally updated materialized view
Table A: customer updates (via CDC)
Table B: order updates (via CDC)
Materialized view: JOIN of customer + order
When customer row changes → recompute join for all their orders
When order row changes → recompute join for that order's customer
Used in: ksqlDB, Kafka Streams, Flink Table API
Stream join comparison:
| Join Type | State stored | Expiration | Use case |
|---|---|---|---|
| Stream-table | Lookup table (one side) | No expiry (table is persistent) | Enrichment with reference data |
| Stream-stream | Both streams buffered | Events expire after window | Event correlation, funnel analysis |
| Table-table | Full join result | No expiry (maintained incrementally) | Incremental materialized views |
Fault Tolerance
The three delivery guarantees:
AT-MOST-ONCE:
Process event; don't retry on failure
Risk: events dropped on failure
When: acceptable for non-critical metrics (logging, analytics)
AT-LEAST-ONCE:
Retry on failure; might process event multiple times
Requirement: processing must be idempotent (process twice = same result)
When: acceptable for most use cases with idempotent writes
EXACTLY-ONCE:
Each event processed exactly once, even with failures
Hardest to implement; highest overhead
When: financial transactions, billing, inventory updates
Exactly-once is harder than it sounds:
Problem: Consumer crashes after processing but before committing offset
→ On restart, re-reads and re-processes the event
→ Output written twice
Solution options:
1. Idempotent writes: same event written twice = same result (via dedup ID)
2. Kafka transactions: atomic commit of output + offset
3. Flink checkpoints + rollback: snapshot state + input offset; rollback on failure
Kafka exactly-once semantics (transactions):
Producer API:
producer.beginTransaction()
producer.send(outputTopic, result) // write output
consumer.commitSync(offsetsToCommit) // commit input offset
producer.commitTransaction() // atomic: both or neither
On failure: Transaction aborted → consumer re-reads; no double processing
Consumer must use isolation.level=read_committed (skips uncommitted msgs)
Flink exactly-once semantics (Chandy-Lamport checkpoints):
Step 1: JobManager injects BARRIER markers into all input streams
[event1][event2][BARRIER_1][event3]...
Step 2: When an operator receives BARRIER from all inputs:
- Save operator state to durable storage (S3, HDFS)
- Record: "input offset at checkpoint 1 = partition_offset"
- Forward BARRIER downstream
Step 3: All operators saved → checkpoint complete
On failure:
- Roll back all operator states to last checkpoint
- Re-read input from checkpoint's recorded offset
- Re-process events after checkpoint → same results (deterministic)
Flink’s exactly-once requires:
- Idempotent or transactional sinks (can’t duplicate-write to Elasticsearch without dedup)
- All operators must checkpoint their state
- Deterministic processing (same input → same output on replay)
Comparison Tables
Kafka vs Kinesis vs Pulsar vs RabbitMQ
| Property | Apache Kafka | Amazon Kinesis | Apache Pulsar | RabbitMQ |
|---|---|---|---|---|
| Type | Log-based (pull) | Log-based (pull) | Log-based + queue | Message queue (push) |
| Retention | Configurable (days to forever) | 1–365 days | Tiered (S3 for old msgs) | Until consumed |
| Replay | Yes (any offset) | Yes (within retention) | Yes (tiered storage) | No |
| Ordering | Per partition | Per shard | Per partition | Per queue |
| Throughput | Very high | High | High | Medium |
| Latency | Low (ms) | Low (ms) | Low (ms) | Very low (sub-ms) |
| Multi-consumer | Yes (consumer groups) | Yes (enhanced fan-out) | Yes (subscriptions) | Limited |
| Managed option | Confluent Cloud | AWS native | StreamNative | CloudAMQP |
| Ecosystem | Richest (Connect, Streams, ksqlDB) | AWS-integrated | Growing | Enterprise legacy |
| 2026 market | Dominant | AWS shops | Niche/growing | Legacy |
Decision guide:
- AWS-native stack → Kinesis (tight IAM integration, no ops)
- Maximum ecosystem / open source → Kafka (Debezium, Flink Connect, ksqlDB)
- Tiered storage (cost-sensitive long retention) → Pulsar
- Simple task queue, no replay needed → RabbitMQ or SQS
CDC Tools Comparison
| Tool | Source DBs | Output | Architecture | Managed? |
|---|---|---|---|---|
| Debezium | PostgreSQL, MySQL, MongoDB, Oracle, SQL Server | Kafka | Kafka Connect plugin | No (OSS) |
| Maxwell | MySQL only | Kafka, Kinesis, stdout | Standalone daemon | No (OSS) |
| AWS DMS | Most major DBs | Kafka, Kinesis, S3 | Managed AWS service | Yes (AWS) |
| Striim | Most major DBs | Kafka, various | Commercial | Yes |
| Fivetran CDC | PostgreSQL, MySQL | Fivetran destinations | Managed SaaS | Yes |
Watermark Strategies Comparison
| Strategy | Implementation | Completeness | Latency | Use when |
|---|---|---|---|---|
| Fixed lag | watermark = max_event_time - lag | ~P95 of events | Predictable | Bounded network delay |
| Heuristic | Track distribution of arrival delays dynamically | ~P99 adaptive | Variable | Variable delay patterns |
| Bounded out-of-orderness | Flink’s built-in; fixed lag with max out-of-order bound | Configurable | Configurable | Most Flink jobs |
| Monotonically increasing | No late events; watermark = latest event time | 100% | Minimal | Sources with strict ordering (batch files) |
Important Points Summary
- Log-based brokers enable replay: Kafka’s retention policy is what makes CDC, event sourcing, and stream-processing fault tolerance practical. A message queue (RabbitMQ) that deletes messages after consumption cannot provide these properties.
- CDC solves the dual-write problem: Writing to multiple systems simultaneously creates race conditions and partial failures. CDC reads the database’s replication log — atomic at the source, eventually consistent at derived systems.
- Watermarks are heuristics, not guarantees: You can never know with certainty when all events for a time window have arrived. The watermark is a bet; late data arriving after the watermark must be handled explicitly.
- Event time for business logic; processing time for SLAs: Always window by event time when computing revenue, session duration, or user activity. Use processing time only for operational metrics like “job latency.”
- Exactly-once is expensive: Flink’s checkpoint protocol and Kafka transactions both add overhead. For many use cases, idempotent processing + at-least-once delivery is the right trade-off.
- Three stream join types serve different needs: Stream-table (enrichment, no expiry), stream-stream (event correlation, windowed state), table-table (incremental materialized views).
- Flink has become the dominant stream processor: Lower latency than Spark Structured Streaming, better state management, stronger exactly-once guarantees. ksqlDB (Confluent) provides SQL-over-Kafka for simpler use cases.
- Kafka log compaction is the bridge between streams and databases: A compacted topic retains one value per key → reading from offset 0 gives current state → bootstraps new CDC consumers without replaying all history.
- State is the hard part of stream processing: Stateless stream processing (filter, map) is trivial. Stateful operations (windows, joins, deduplication) require durable, fault-tolerant state stores (RocksDB, backed by Kafka or S3).
- CDC + streaming lakehouse is the 2026 standard: PostgreSQL → Debezium → Kafka → Flink → Iceberg/Delta on S3 → query with Trino/DuckDB. Replaces ETL batch loading with continuous, low-latency data pipelines.
Modern Context (2026)
Apache Flink dominates stream processing:
- Flink 1.18+: unified batch + stream (same runtime); Table API + Flink SQL
- Lower latency than Spark Structured Streaming (true streaming vs micro-batch)
- Stronger exactly-once semantics via Chandy-Lamport checkpointing
- Deployed on Kubernetes (Flink Kubernetes Operator) or managed (AWS KDA, Confluent Cloud)
ksqlDB (Confluent Platform):
- SQL interface directly over Kafka topics — no separate cluster needed
CREATE STREAM,CREATE TABLE,SELECT ... EMIT CHANGESsyntax- Materialized views as Kafka topics; query results streamed continuously
- Used for: event filtering, enrichment, simple aggregations — not suitable for complex stateful joins
Kafka ecosystem maturity:
- Kafka 3.x (KRaft mode): ZooKeeper dependency removed; simpler operations
- Kafka Connect: 100+ connectors (Debezium, JDBC, S3, Elasticsearch)
- Schema Registry (Confluent): Avro/Protobuf schema management and evolution
- Kafka Streams: embedded Java stream processing library (no separate cluster)
Streaming Lakehouse (2026 standard):
- Apache Iceberg + Flink: stream writes to Iceberg tables on S3 with ACID guarantees
- Delta Lake + Spark Structured Streaming: micro-batch streaming to Delta tables
- Real-time BI: BI tools (Superset, Looker) query Iceberg/Delta directly — sub-minute data freshness without a separate OLTP/OLAP system
Real-time feature stores:
- Feast, Tecton, Hopsworks: compute streaming features for online ML
- Stream processing jobs compute features (e.g., rolling fraud signals) → feature store → online inference
- Bridges stream processing and ML serving: fresh features within seconds, not hours
Apache Pulsar growing:
- Tiered storage: old messages auto-offloaded to S3 (infinite retention at low cost)
- Multi-tenancy and geo-replication built in
- Gaining adoption in cost-sensitive environments; Kafka still dominates in 2026
Questions for Reflection
- A startup is choosing between RabbitMQ and Kafka for their messaging infrastructure. Their use case is: (a) distributing background jobs, and (b) building a real-time analytics pipeline. What do you recommend and why?
- Walk through exactly how Debezium captures changes from PostgreSQL. What happens if Debezium crashes and restarts? What happens if it misses some WAL entries?
- Your stream processing job computes hourly revenue by event time. You discover that 5% of mobile events arrive 10+ minutes late. How do you configure your watermark strategy, and what do you do with the late events?
- Explain the difference between stream-stream join and stream-table join. Give a concrete example of when you would use each.
- Your team has achieved at-least-once delivery for a financial transaction aggregation pipeline. What specific steps would you take to upgrade to exactly-once? What overhead does this add?
- Compare ksqlDB and Apache Flink for implementing a real-time fraud detection system that: detects 3 failed transactions within 5 minutes for the same card. What are the trade-offs?
Related Resources
- ch11-batch-processing — Batch processing: the bounded counterpart; same dataflow concepts
- ch06-replication — Replication fundamentals; CDC reads the replication log (WAL)
- ch10-consistency-and-consensus — Exactly-once semantics requires consensus protocols underneath
- ch13-philosophy-of-streaming — Philosophical treatment of streaming-first system design
- ch11-stream-processing — 1st edition Ch11 (good baseline; 2E significantly updated)
Last Updated: 2026-05-29