Chapter 12: Stream Processing

ddia-2e stream-processing kafka cdc watermarks exactly-once flink

Status: Notes complete


Overview

Stream processing is the paradigm for working with unbounded (infinite) data — events that arrive continuously and never form a complete, finite dataset. Where batch processing waits to collect all the data before processing, stream processing acts on events as they arrive. Chapter 12 covers the full stack: the infrastructure for transmitting events (messaging systems, log-based brokers), the patterns for keeping systems in sync (Change Data Capture, event sourcing), and the mechanics of stream processing (windowing, joins, fault tolerance, exactly-once semantics).

The 2nd edition significantly updates the 1st edition’s Chapter 11, adding concrete CDC tooling (Debezium + Kafka + PostgreSQL), Kafka Streams and ksqlDB, Flink as the dominant stream processor, and expanded treatment of watermarking strategies for late-arriving data.

Why stream processing matters: Modern systems cannot afford to wait hours for a batch job. Fraud detection, real-time recommendations, alerting, operational dashboards, and CDC-based data synchronization all require processing events within seconds. Stream processing is batch processing applied to the present moment.

Key tension in stream processing: Correctness (waiting for all data) vs latency (acting now). Every design decision in stream processing is navigating this tension.


Key Concepts

Transmitting Event Streams

Messaging Systems

When a producer generates events faster than consumers can process them, a messaging system (broker) is needed to buffer the events and decouple producers from consumers.

Two fundamental delivery models:

Message Queue (traditional: RabbitMQ, ActiveMQ, AWS SQS):

  • Messages pushed to consumers
  • Message deleted after acknowledged by one consumer
  • Competing consumers: multiple consumers compete for the same messages (load-balanced)
  • No replay: once consumed, a message is gone
  • Good for: task queues, work distribution, job scheduling

Message Log / Event Log (Kafka, Amazon Kinesis, Azure Event Hubs):

  • Append-only log; messages retained by time or size policy
  • Consumers pull and track their own offset (position in log)
  • Multiple independent consumer groups each read the full log independently
  • Replay: consumers can re-read from any past offset
  • Good for: event sourcing, audit log, stream processing, data integration, replay

Comparison:

PropertyMessage Queue (RabbitMQ)Log-Based Broker (Kafka)
RetentionDeleted after consumedRetained for configurable period
Multiple consumersCompeting (one gets each message)Fan-out (each group gets all messages)
ReplayNoYes (seek to any offset)
OrderingPer-queue (limited)Per-partition (total order)
ThroughputMediumVery high (sequential disk I/O)
Consumer pacingPush (broker controls rate)Pull (consumer controls rate)
Use caseTask queues, work distributionEvent streaming, audit, CDC, replay

Kafka architecture fundamentals:

Topic: "user-events"
  Partition 0: [offset 0][offset 1][offset 2]...[offset N]
  Partition 1: [offset 0][offset 1][offset 2]...[offset M]
  Partition 2: [offset 0][offset 1][offset 2]...[offset K]

Consumer Group "analytics":
  Consumer A → reads Partition 0 (tracks offset 0–N)
  Consumer B → reads Partition 1 (tracks offset 0–M)
  Consumer C → reads Partition 2 (tracks offset 0–K)

Consumer Group "fraud-detection":
  Consumer X → reads ALL 3 partitions from their own offsets

Key properties:
  - Within a partition: messages are totally ordered
  - Across partitions: no ordering guarantee
  - Parallelism = number of partitions per topic
  - Producer routes by key hash (key → same partition → ordered)

Log-Based Message Brokers

Why log-based beats traditional queues for stream processing:

  1. Replay: Consumer crash at offset 500? Re-read from offset 500. No data loss.
  2. Fan-out: Spin up a new consumer group (analytics, fraud, monitoring) — each gets all messages independently. Traditional queue: each message consumed by only one consumer.
  3. Backpressure handled naturally: Consumers read at their own pace. If consumer slows, messages wait in the log. No push overwhelming the consumer.
  4. Audit and debugging: Every event persisted for inspection. Can replay past events to debug processing logic.
  5. Bootstrapping new consumers: A new consumer can replay from the beginning of the log to build initial state.

Kafka log compaction (essential for CDC state):

  • For a compacted topic: Kafka retains only the most recent value for each key
  • Deletes tracked via tombstone messages (key with null value)
  • Reading a compacted log from offset 0 = current state of all keys (like a key-value snapshot)
  • This enables CDC consumers to restart and rebuild current state without reading all history

Databases and Streams

Keeping Systems in Sync

The problem: An operational database (PostgreSQL) is the source of truth. Multiple derived systems need to stay in sync: search index (Elasticsearch), analytics warehouse, caches (Redis), microservice databases.

Approach 1 — Dual writes (anti-pattern):

Application writes to PostgreSQL AND Elasticsearch simultaneously
Problem: What if PostgreSQL write succeeds but Elasticsearch write fails?
  → Systems diverge
Problem: Two concurrent writes to different systems → race condition
  → Systems end up in different states depending on timing

Approach 2 — Change Data Capture (CDC) (correct approach):

Application writes to PostgreSQL only (single source of truth)
CDC reads PostgreSQL WAL (Write-Ahead Log) → publishes changes to Kafka
All derived systems subscribe to Kafka → update themselves from log

Change Data Capture (CDC)

What CDC captures: Every INSERT, UPDATE, DELETE from the database’s replication log, as a stream of events sent to a message broker.

Concrete CDC setup: PostgreSQL → Debezium → Kafka → Flink → Elasticsearch:

┌─────────────────────────────────────────────────────────────────────────┐
│                  CDC Pipeline Architecture                              │
│                                                                         │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐              │
│  │  PostgreSQL  │    │   Debezium   │    │    Kafka     │              │
│  │              │    │  (Kafka      │    │              │              │
│  │  WAL         │───►│   Connect    │───►│  Topic:      │              │
│  │  (replication│    │   plugin)    │    │  postgres.   │              │
│  │   log)       │    │              │    │  public.     │              │
│  │              │    │  Reads WAL,  │    │  orders      │              │
│  │  INSERT INTO │    │  transforms  │    │              │              │
│  │  orders ...  │    │  to JSON/    │    │  [JSON CDC   │              │
│  └──────────────┘    │  Avro events │    │   events]    │              │
│                      └──────────────┘    └──────┬───────┘              │
│                                                 │                      │
│                             ┌───────────────────┤                      │
│                             │                   │                      │
│                    ┌────────▼──────┐    ┌────────▼──────┐             │
│                    │   Apache      │    │  Data         │             │
│                    │   Flink       │    │  Warehouse    │             │
│                    │               │    │  (Snowflake/  │             │
│                    │  Transform,   │    │   BigQuery)   │             │
│                    │  enrich,      │    └───────────────┘             │
│                    │  filter       │                                   │
│                    └───────┬───────┘                                   │
│                            │                                           │
│                    ┌───────▼───────┐                                   │
│                    │ Elasticsearch │                                   │
│                    │ (search index │                                   │
│                    │  up to date   │                                   │
│                    │  within secs) │                                   │
│                    └───────────────┘                                   │
└─────────────────────────────────────────────────────────────────────────┘

Debezium configuration snippet (PostgreSQL source connector):

{
  "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
  "database.hostname": "postgres-db",
  "database.port": "5432",
  "database.user": "debezium",
  "database.password": "secret",
  "database.dbname": "myapp",
  "table.include.list": "public.orders,public.customers",
  "topic.prefix": "postgres",
  "plugin.name": "pgoutput",
  "publication.name": "dbz_publication"
}

CDC event structure (Debezium JSON):

{
  "before": {"id": 1, "status": "pending", "amount": 99.99},
  "after":  {"id": 1, "status": "shipped", "amount": 99.99},
  "op": "u",          // u=update, c=create, d=delete, r=read(snapshot)
  "ts_ms": 1716998400000,
  "source": {"table": "orders", "lsn": "0/15D3F8"}
}

Benefits of CDC:

  • Single source of truth (database remains authoritative)
  • Derived systems updated eventually-consistently via log
  • Replay: re-build Elasticsearch index from scratch by replaying Kafka topic
  • Zero downtime migrations: new system subscribes to topic and catches up

State, Streams, and Immutability

Fundamental duality: A database table and a stream of changes are two representations of the same information.

Stream of changes:  INSERT(Alice), UPDATE(Alice→Alice2), DELETE(Alice2)
                                                        ↕
Table (current state):  {Alice2}  (fold/reduce over the stream)

Table → Stream:  CDC (export each write as a changelog event)
Stream → Table:  Stream processor materializes a view by applying each event

Event sourcing:

  • Store all state changes as an immutable sequence of events (not mutable rows)
  • Current state = replay/fold all events for an entity
  • Events are facts: “user placed order #99 at 14:00” (never changes)
  • State is derived: current order status = reduce all order events

Event sourcing vs CDC:

CDCEvent Sourcing
LevelLow-level DB changes (SQL rows)High-level domain events (business facts)
Designed forInfrastructure syncApplication architecture
Consumer accessAny CDC consumerApplication-specific consumers
ReprocessingRe-read WAL / replay Kafka topicReplay event store

Processing Streams

Uses of Stream Processing

1. Complex event processing (CEP): Detect patterns across multiple events within a time window.

  • Example: Detect fraud — 3+ failed card transactions within 5 minutes → alert

2. Stream analytics: Compute running metrics, rolling aggregates, windowed counts.

  • Example: Count page views per minute; compute P99 latency over last 5 minutes

3. Maintaining materialized views: Apply CDC events to keep a derived table (or search index) up to date.

  • Example: Debezium → Kafka → Flink → Elasticsearch (orders searchable within seconds)

4. Stream-to-stream joins: Correlate events from two streams within a time window.

  • Example: Match ad impression events with ad click events within 30 minutes

5. Enrichment: Join each stream event with a reference table to add context.

  • Example: Enrich order events with customer tier from customer profile table

Reasoning About Time

Event time vs processing time — the most important distinction in stream processing:

Event TimeProcessing Time
DefinitionWhen the event actually occurred (in the event payload)When the event is processed by the stream processor
SourceEvent producer’s timestampStream processor’s system clock
ConsistencyConsistent across retriesChanges on retry
ProblemLate arrivals; out-of-orderDoesn’t reflect when event happened
Use forBusiness logic (revenue per hour, session duration)SLA monitoring, job performance

Why event time ≠ processing time:

  • Mobile app buffers events offline for 4 hours → arrives in burst
  • Network delay between producer and broker
  • Clock skew: device clock is wrong
  • Kafka consumer paused → resumes and processes 1-hour-old messages

Windowing strategies:

TUMBLING WINDOW (fixed, non-overlapping):
  ──[──────────][──────────][──────────]──►
  00:00  01:00  02:00  03:00  04:00
  
  Each event belongs to exactly one window
  Use: "counts per hour", "revenue per day"

SLIDING WINDOW (fixed size, moves by step):
  ──[──────────]──►
      ──[──────────]──►
          ──[──────────]──►
  Window size: 5 min, step: 1 min
  Events can appear in multiple windows
  Use: "rolling 5-minute error rate"

SESSION WINDOW (variable size, gap-based):
  [event1, event2, event3]   [ev4]   [ev5, ev6, ev7, ev8]
  session: 3 events           gap     session: 4 events
  (window closes after 30min inactivity)
  Use: "user session duration", "clickstream session"

Watermark Strategies

The watermark problem: When aggregating by event time (e.g., counts per hour), how long do you wait for late-arriving events before closing the window and emitting the result?

Watermark = the stream processor’s estimate of “all events at or before this timestamp have likely arrived.”

  • When watermark advances past time T: the window for time T is closed and results emitted

Watermark strategies:

1. Fixed-lag watermark:

Watermark = MAX(event_time seen) - fixed_lag
Example: fixed_lag = 2 minutes
If latest event = 14:10:00, watermark = 14:08:00
Windows before 14:08 are safe to close

✅ Simple; predictable; low latency
❌ If events arrive more than 2 minutes late → silently dropped or ignored

2. Heuristic watermark:

Observe actual arrival delay distribution dynamically
Watermark = f(historical P99 late arrival delay)
If 99% of events arrive within 45 seconds, set watermark at 50s lag

✅ Adapts to actual data patterns
❌ Complex; still a heuristic (never 100% correct)

3. Perfect watermark (rare):

Only possible when you KNOW all input sources have sent all events
up to a certain time (e.g., batch files with known completion)

✅ No late events possible; results always correct
❌ Only achievable with coordinated producers; not general

Handling late events after watermark:

StrategyBehaviorTrade-off
DropDiscard late eventsSimple; loses data
Grace periodWindow stays open for extra N seconds after watermarkSmall extra latency; handles most late arrivals
Side outputRoute late events to separate stream for special handlingFlexible; more complex
Update resultRe-emit updated window aggregate when late event arrivesDownstream must handle corrections; complex

Late data trade-off:

                    Completeness
                        │
                   100% ┤ ........ (wait forever)
                        │        ↗
                        │      ↗  (practical zone: 95-99% within watermark)
                    50% ┤    ↗
                        │  ↗
                        └────────────────────────────► Latency
                        0s    1s    5s    1min  10min
                        
You cannot have both 100% completeness AND zero latency.
Watermark = the trade-off knob.

Stream Joins

Three types of stream joins:

1. Stream-Table Join (enrichment):

  • Enrich each stream event with data from a lookup table
  • Table = slowly changing reference data (user profiles, product catalog)
  • Implementation: load table into state store (RocksDB in Flink); look up at processing time
Stream: [order_id=1, user_id=42, amount=99]  ──► enriched with user tier
Table:  {user_id=42 → {name="Alice", tier="gold", country="US"}}
Output: [order_id=1, user_id=42, amount=99, tier="gold", country="US"]

Challenge: Table can change. Solution: Use CDC to keep state store fresh (stream-table join becomes effectively stream-stream with one side slowly changing).

2. Stream-Stream Join (event correlation):

  • Join events from two streams that are related within a time window
  • Both streams buffered in state store; join matching events within the window
  • Unmatched events expire after window duration
Stream A (ad impressions):  [user=42, ad_id=9, impression_time=14:00:05]
Stream B (ad clicks):       [user=42, ad_id=9, click_time=14:07:33]

Join: impression → click within 30 minutes, same user + ad_id
Window state: buffer all impressions; for each click, look up matching impression

Output: [user=42, ad_id=9, impression_time=14:00:05, click_time=14:07:33,
         time_to_click=7m28s]

3. Table-Table Join (incremental materialized view):

  • Both inputs are tables (or CDC streams of tables)
  • Maintains join result as incrementally updated materialized view
Table A: customer updates (via CDC)
Table B: order updates (via CDC)

Materialized view: JOIN of customer + order
  When customer row changes → recompute join for all their orders
  When order row changes → recompute join for that order's customer

Used in: ksqlDB, Kafka Streams, Flink Table API

Stream join comparison:

Join TypeState storedExpirationUse case
Stream-tableLookup table (one side)No expiry (table is persistent)Enrichment with reference data
Stream-streamBoth streams bufferedEvents expire after windowEvent correlation, funnel analysis
Table-tableFull join resultNo expiry (maintained incrementally)Incremental materialized views

Fault Tolerance

The three delivery guarantees:

AT-MOST-ONCE:
  Process event; don't retry on failure
  Risk: events dropped on failure
  When: acceptable for non-critical metrics (logging, analytics)

AT-LEAST-ONCE:
  Retry on failure; might process event multiple times
  Requirement: processing must be idempotent (process twice = same result)
  When: acceptable for most use cases with idempotent writes

EXACTLY-ONCE:
  Each event processed exactly once, even with failures
  Hardest to implement; highest overhead
  When: financial transactions, billing, inventory updates

Exactly-once is harder than it sounds:

Problem: Consumer crashes after processing but before committing offset
  → On restart, re-reads and re-processes the event
  → Output written twice

Solution options:
  1. Idempotent writes: same event written twice = same result (via dedup ID)
  2. Kafka transactions: atomic commit of output + offset
  3. Flink checkpoints + rollback: snapshot state + input offset; rollback on failure

Kafka exactly-once semantics (transactions):

Producer API:
  producer.beginTransaction()
  producer.send(outputTopic, result)         // write output
  consumer.commitSync(offsetsToCommit)       // commit input offset
  producer.commitTransaction()               // atomic: both or neither

On failure: Transaction aborted → consumer re-reads; no double processing
Consumer must use isolation.level=read_committed (skips uncommitted msgs)

Flink exactly-once semantics (Chandy-Lamport checkpoints):

Step 1: JobManager injects BARRIER markers into all input streams
        [event1][event2][BARRIER_1][event3]...

Step 2: When an operator receives BARRIER from all inputs:
        - Save operator state to durable storage (S3, HDFS)
        - Record: "input offset at checkpoint 1 = partition_offset"
        - Forward BARRIER downstream

Step 3: All operators saved → checkpoint complete

On failure:
  - Roll back all operator states to last checkpoint
  - Re-read input from checkpoint's recorded offset
  - Re-process events after checkpoint → same results (deterministic)

Flink’s exactly-once requires:

  • Idempotent or transactional sinks (can’t duplicate-write to Elasticsearch without dedup)
  • All operators must checkpoint their state
  • Deterministic processing (same input → same output on replay)

Comparison Tables

Kafka vs Kinesis vs Pulsar vs RabbitMQ

PropertyApache KafkaAmazon KinesisApache PulsarRabbitMQ
TypeLog-based (pull)Log-based (pull)Log-based + queueMessage queue (push)
RetentionConfigurable (days to forever)1–365 daysTiered (S3 for old msgs)Until consumed
ReplayYes (any offset)Yes (within retention)Yes (tiered storage)No
OrderingPer partitionPer shardPer partitionPer queue
ThroughputVery highHighHighMedium
LatencyLow (ms)Low (ms)Low (ms)Very low (sub-ms)
Multi-consumerYes (consumer groups)Yes (enhanced fan-out)Yes (subscriptions)Limited
Managed optionConfluent CloudAWS nativeStreamNativeCloudAMQP
EcosystemRichest (Connect, Streams, ksqlDB)AWS-integratedGrowingEnterprise legacy
2026 marketDominantAWS shopsNiche/growingLegacy

Decision guide:

  • AWS-native stack → Kinesis (tight IAM integration, no ops)
  • Maximum ecosystem / open source → Kafka (Debezium, Flink Connect, ksqlDB)
  • Tiered storage (cost-sensitive long retention) → Pulsar
  • Simple task queue, no replay needed → RabbitMQ or SQS

CDC Tools Comparison

ToolSource DBsOutputArchitectureManaged?
DebeziumPostgreSQL, MySQL, MongoDB, Oracle, SQL ServerKafkaKafka Connect pluginNo (OSS)
MaxwellMySQL onlyKafka, Kinesis, stdoutStandalone daemonNo (OSS)
AWS DMSMost major DBsKafka, Kinesis, S3Managed AWS serviceYes (AWS)
StriimMost major DBsKafka, variousCommercialYes
Fivetran CDCPostgreSQL, MySQLFivetran destinationsManaged SaaSYes

Watermark Strategies Comparison

StrategyImplementationCompletenessLatencyUse when
Fixed lagwatermark = max_event_time - lag~P95 of eventsPredictableBounded network delay
HeuristicTrack distribution of arrival delays dynamically~P99 adaptiveVariableVariable delay patterns
Bounded out-of-ordernessFlink’s built-in; fixed lag with max out-of-order boundConfigurableConfigurableMost Flink jobs
Monotonically increasingNo late events; watermark = latest event time100%MinimalSources with strict ordering (batch files)

Important Points Summary

  • Log-based brokers enable replay: Kafka’s retention policy is what makes CDC, event sourcing, and stream-processing fault tolerance practical. A message queue (RabbitMQ) that deletes messages after consumption cannot provide these properties.
  • CDC solves the dual-write problem: Writing to multiple systems simultaneously creates race conditions and partial failures. CDC reads the database’s replication log — atomic at the source, eventually consistent at derived systems.
  • Watermarks are heuristics, not guarantees: You can never know with certainty when all events for a time window have arrived. The watermark is a bet; late data arriving after the watermark must be handled explicitly.
  • Event time for business logic; processing time for SLAs: Always window by event time when computing revenue, session duration, or user activity. Use processing time only for operational metrics like “job latency.”
  • Exactly-once is expensive: Flink’s checkpoint protocol and Kafka transactions both add overhead. For many use cases, idempotent processing + at-least-once delivery is the right trade-off.
  • Three stream join types serve different needs: Stream-table (enrichment, no expiry), stream-stream (event correlation, windowed state), table-table (incremental materialized views).
  • Flink has become the dominant stream processor: Lower latency than Spark Structured Streaming, better state management, stronger exactly-once guarantees. ksqlDB (Confluent) provides SQL-over-Kafka for simpler use cases.
  • Kafka log compaction is the bridge between streams and databases: A compacted topic retains one value per key → reading from offset 0 gives current state → bootstraps new CDC consumers without replaying all history.
  • State is the hard part of stream processing: Stateless stream processing (filter, map) is trivial. Stateful operations (windows, joins, deduplication) require durable, fault-tolerant state stores (RocksDB, backed by Kafka or S3).
  • CDC + streaming lakehouse is the 2026 standard: PostgreSQL → Debezium → Kafka → Flink → Iceberg/Delta on S3 → query with Trino/DuckDB. Replaces ETL batch loading with continuous, low-latency data pipelines.

Modern Context (2026)

Apache Flink dominates stream processing:

  • Flink 1.18+: unified batch + stream (same runtime); Table API + Flink SQL
  • Lower latency than Spark Structured Streaming (true streaming vs micro-batch)
  • Stronger exactly-once semantics via Chandy-Lamport checkpointing
  • Deployed on Kubernetes (Flink Kubernetes Operator) or managed (AWS KDA, Confluent Cloud)

ksqlDB (Confluent Platform):

  • SQL interface directly over Kafka topics — no separate cluster needed
  • CREATE STREAM, CREATE TABLE, SELECT ... EMIT CHANGES syntax
  • Materialized views as Kafka topics; query results streamed continuously
  • Used for: event filtering, enrichment, simple aggregations — not suitable for complex stateful joins

Kafka ecosystem maturity:

  • Kafka 3.x (KRaft mode): ZooKeeper dependency removed; simpler operations
  • Kafka Connect: 100+ connectors (Debezium, JDBC, S3, Elasticsearch)
  • Schema Registry (Confluent): Avro/Protobuf schema management and evolution
  • Kafka Streams: embedded Java stream processing library (no separate cluster)

Streaming Lakehouse (2026 standard):

  • Apache Iceberg + Flink: stream writes to Iceberg tables on S3 with ACID guarantees
  • Delta Lake + Spark Structured Streaming: micro-batch streaming to Delta tables
  • Real-time BI: BI tools (Superset, Looker) query Iceberg/Delta directly — sub-minute data freshness without a separate OLTP/OLAP system

Real-time feature stores:

  • Feast, Tecton, Hopsworks: compute streaming features for online ML
  • Stream processing jobs compute features (e.g., rolling fraud signals) → feature store → online inference
  • Bridges stream processing and ML serving: fresh features within seconds, not hours

Apache Pulsar growing:

  • Tiered storage: old messages auto-offloaded to S3 (infinite retention at low cost)
  • Multi-tenancy and geo-replication built in
  • Gaining adoption in cost-sensitive environments; Kafka still dominates in 2026

Questions for Reflection

  1. A startup is choosing between RabbitMQ and Kafka for their messaging infrastructure. Their use case is: (a) distributing background jobs, and (b) building a real-time analytics pipeline. What do you recommend and why?
  2. Walk through exactly how Debezium captures changes from PostgreSQL. What happens if Debezium crashes and restarts? What happens if it misses some WAL entries?
  3. Your stream processing job computes hourly revenue by event time. You discover that 5% of mobile events arrive 10+ minutes late. How do you configure your watermark strategy, and what do you do with the late events?
  4. Explain the difference between stream-stream join and stream-table join. Give a concrete example of when you would use each.
  5. Your team has achieved at-least-once delivery for a financial transaction aggregation pipeline. What specific steps would you take to upgrade to exactly-once? What overhead does this add?
  6. Compare ksqlDB and Apache Flink for implementing a real-time fraud detection system that: detects 3 failed transactions within 5 minutes for the same card. What are the trade-offs?

Last Updated: 2026-05-29