Chapter 11: Stream Processing

Overview

Stream processing is batch processing over unbounded (never-ending) data. Rather than waiting to collect a complete dataset before processing, stream processing works on events as they arrive. Chapter 11 covers the infrastructure for messaging systems (message brokers), the patterns of stream processing (transformations, aggregations, joins), and the unique challenges of event time, watermarks, and exactly-once semantics.

Key distinction: Batch processing works on bounded (finite) data; stream processing works on unbounded (infinite) data flowing continuously.

Key Concepts

Messaging Systems

Two types of message delivery:

Message queues (RabbitMQ, ActiveMQ, SQS):

Messages pushed to consumers (push model)
Message deleted after consumed
One consumer per message (competing consumers)
Good for: task queues, work distribution

Message logs / Event logs (Apache Kafka, Amazon Kinesis):

Append-only log; messages retained by time or size
Consumers pull and track their own offset (position in log)
Multiple independent consumer groups each read full log
Good for: event sourcing, audit log, stream processing, replay

Log-based messaging advantages:

Replay: Consumers can re-read from any offset (re-process historical events)
Fan-out: Multiple consumer groups, each processing all events independently
Backpressure: Consumer reads at its own pace (no push overwhelming it)
Durability: Events retained for configurable period

Kafka partitions:

Topic divided into partitions for parallelism
Within each partition: messages totally ordered
Across partitions: no ordering guarantee
Producer chooses partition (via key hash or round-robin)
Consumer group: one consumer per partition (parallelism = partition count)

Change Data Capture (CDC)

Problem: Your database is the source of truth. Other systems (search indexes, caches, data warehouses) need to stay in sync.

Traditional approach: Dual writes (write to DB and to Kafka). Problem: race conditions, partial failures.

Change Data Capture (CDC):

Stream changes from the database’s replication log
Changes (inserts, updates, deletes) captured as events
Written to a message log (Kafka) for consumers

Debezium (CDC platform): Reads MySQL binlog, PostgreSQL WAL, MongoDB oplog → sends to Kafka

Benefits of CDC:

Single source of truth remains the database
Derived systems (Elasticsearch, Redis, data warehouse) updated via log consumer
Replay possible: re-build derived view from the beginning of the log

Log compaction (Kafka’s feature for CDC):

For each key, keep only the most recent value
Allows “snapshot” of current state by reading compacted log from beginning
Deletes tracked via “tombstone” messages (null value for a key)

Event Sourcing

Traditional approach: Store current state (mutable rows updated in-place)

Event sourcing: Store immutable sequence of events that led to current state

Application state = fold/reduce over event log
Events are facts: “user added item to cart” (never changes)
State is derived: current cart = reduce events for that user

Event sourcing advantages:

Complete audit trail (every change recorded)
Time travel: derive state at any past point
Reprocessing: change business logic → reprocess event log → new derived state
Microservices: events as integration contracts

Difference from CDC:

CDC: captures low-level DB changes (SQL-level)
Event sourcing: high-level domain events (business-level)
CDC is often used to implement event sourcing infrastructure

Stream Processing Patterns

1. Stateless transformations (simple):

Map: transform each event (e.g., parse JSON → structured record)
Filter: keep only events matching criteria
Enrichment: join stream with static lookup table
Each event processed independently; no state needed

2. Stateful transformations:

Aggregation: count, sum, average over time window
Join: stream-stream or stream-table
Deduplication: track seen event IDs
State stored in local storage (RocksDB in Flink/Kafka Streams)

3. Windows (time-based aggregation):

Tumbling window: Fixed-size, non-overlapping (e.g., counts per 1-minute window)
Sliding window: Overlap allowed (e.g., counts over last 5 minutes, updated every 1 minute)
Session window: Variable-size, ends after inactivity gap (e.g., user session until 30 min idle)
Global window: All time (usually with trigger)

Event Time vs Processing Time

Processing time: When the event is processed by the stream processor
Event time: When the event actually occurred (timestamp in the event itself)

Why they differ:

Network delays: event created at t=100ms, arrives at processor at t=500ms
Mobile apps: events buffered offline, arrive hours later in burst
Clock skew: event device clock wrong

Problem: If you aggregate by event time but events arrive late, how long do you wait?

Watermarks:

Estimate of “how far behind in event time is the data stream”
Watermark = current event time - maximum expected late arrival
When watermark passes time T: assume all events before T have arrived (safe to close window)
Heuristic: Derived from observing actual event arrival patterns

Handling late events:

Drop late events: Simple but loses data
Reemit with update: Update aggregate when late event arrives; downstream receives correction
Grace period: Window stays open for grace period after watermark; then closes

Stream-Table Joins

Three types of stream joins:

1. Stream-table join (enrichment):

Enrich each stream event with data from a database/lookup table
Table = static reference data (user profiles, product catalog)
Implementation: cache the lookup table in memory; join at point of processing
Problem: table can change; use CDC to keep cache fresh

2. Stream-stream join (event correlation):

Join two event streams based on a shared key and time window
Both streams buffered in state store; join events that arrive within the window
Example: Match “search query” events with “click” events within 30 minutes

3. Table-table join (incremental materialized view):

Both inputs are tables (CDC streams from two databases)
Maintains join result as a materialized view, updated incrementally

Fault Tolerance and Exactly-Once

At-least-once processing:

If consumer crashes, replays from last checkpoint → might process events again
Acceptable if processing is idempotent (writing same event twice = same result)

At-most-once processing:

Don’t retry on failure; might miss events
Not acceptable for most business use cases

Exactly-once processing:

Each event processed exactly once, even in the presence of failures
Hardest to implement; most important for financial transactions, billing

Approaches:

Idempotent writes: Processing same event multiple times = same result (write with unique ID; dedup)
Transactional writes (Kafka Transactions): Producer writes multiple records atomically; consumer commits offset + processes transactionally
Two-phase commit: Mark processed in source + write output in atomic transaction (expensive)
Checkpoint + rollback (Flink): Periodic state snapshots; on failure roll back to last checkpoint and replay

Flink’s exactly-once semantics:

Chandy-Lamport algorithm adapted: barrier markers sent through data stream
All operators take consistent snapshot when barrier passes through
On failure: roll back to last barrier snapshot; re-read input from that checkpoint’s input offset

Important Points

Stream processing = batch on unbounded data: Same dataflow concepts; different time model.
Event time ≠ processing time: Always use event time for business logic; processing time for SLAs.
Watermarks are heuristics: Cannot be perfectly precise; must trade off completeness vs latency.
Kafka log compaction is key for CDC state: Allows bootstrapping derived views without full replay from beginning.
Exactly-once is expensive: Idempotent processing + at-least-once is often the right trade-off.
State management is the hard part: Stateless stream processing is easy; stateful requires careful fault tolerance.

Examples & Case Studies

Kafka Streams (Kafka’s built-in stream processor)
- Embedded in application; no separate cluster
- State stored in local RocksDB, backed up to Kafka
- Exactly-once via Kafka transactions
- Used by LinkedIn for real-time analytics
Apache Flink at Scale
- AirBnB: fraud detection (stream-stream join of bookings + fraud signals)
- Uber: real-time price surge detection
- Netflix: real-time recommendation updates
- State: checkpointed to HDFS/S3 via Chandy-Lamport snapshots
Change Data Capture in Practice
- LinkedIn: Databus (CDC from Oracle to Espresso, search, other stores)
- Airbnb: Debezium + Kafka → search index (Elasticsearch), analytics (Snowflake)
- Standard pattern in 2026 for keeping derived stores in sync
Event Sourcing at CQRS-based systems
- EventStoreDB: purpose-built event store
- Axon Framework: Java event sourcing + CQRS framework
- Apache Kafka as event log for microservice event sourcing

Questions

What are the trade-offs between message queues and message logs?
What is change data capture and how is it different from dual writes?
What is the difference between event time and processing time?
How do watermarks work and what are their limitations?
What are the three types of windows in stream processing?
How does Flink achieve exactly-once semantics?
What is event sourcing and how does it differ from CDC?
When would you use stream-stream join vs stream-table join?

Modern Context (2026)

Flink SQL and Kafka SQL:

Apache Flink SQL: write stream processing as SQL (with window functions)
ksqlDB (Confluent): SQL for Kafka streams directly
Both abstract away the Flink/Kafka API for stream processing

Streaming Lakehouses:

Apache Iceberg + Flink: stream writes to Iceberg tables on S3 (ACID streaming ingest)
Delta Lake + Spark Structured Streaming: micro-batch streaming to Delta tables
Enables low-latency analytics without separate OLAP system

Real-time ML inference:

Feature stores (Feast, Tecton, Hopsworks) serve features for online ML
Stream processing computes real-time features from event streams
Feature freshness: seconds vs batch’s hours

Apache Kafka evolution:

Kafka 3.x+: KRaft mode (no ZooKeeper), improved geo-replication
Confluent Platform: managed Kafka + Schema Registry + Kafka Connect + ksqlDB
AWS Kinesis: managed alternative to Kafka; tight AWS integration

Apache Pulsar:

Alternative to Kafka: tiered storage (S3 for old messages), multi-tenancy, geo-replication
Growing adoption but Kafka still dominates (2026)

Status: Notes complete
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch11-stream-processing