Chapter 11: Stream Processing
Overview
Stream processing is batch processing over unbounded (never-ending) data. Rather than waiting to collect a complete dataset before processing, stream processing works on events as they arrive. Chapter 11 covers the infrastructure for messaging systems (message brokers), the patterns of stream processing (transformations, aggregations, joins), and the unique challenges of event time, watermarks, and exactly-once semantics.
Key distinction: Batch processing works on bounded (finite) data; stream processing works on unbounded (infinite) data flowing continuously.
Key Concepts
Messaging Systems
Two types of message delivery:
Message queues (RabbitMQ, ActiveMQ, SQS):
- Messages pushed to consumers (push model)
- Message deleted after consumed
- One consumer per message (competing consumers)
- Good for: task queues, work distribution
Message logs / Event logs (Apache Kafka, Amazon Kinesis):
- Append-only log; messages retained by time or size
- Consumers pull and track their own offset (position in log)
- Multiple independent consumer groups each read full log
- Good for: event sourcing, audit log, stream processing, replay
Log-based messaging advantages:
- Replay: Consumers can re-read from any offset (re-process historical events)
- Fan-out: Multiple consumer groups, each processing all events independently
- Backpressure: Consumer reads at its own pace (no push overwhelming it)
- Durability: Events retained for configurable period
Kafka partitions:
- Topic divided into partitions for parallelism
- Within each partition: messages totally ordered
- Across partitions: no ordering guarantee
- Producer chooses partition (via key hash or round-robin)
- Consumer group: one consumer per partition (parallelism = partition count)
Change Data Capture (CDC)
Problem: Your database is the source of truth. Other systems (search indexes, caches, data warehouses) need to stay in sync.
Traditional approach: Dual writes (write to DB and to Kafka). Problem: race conditions, partial failures.
Change Data Capture (CDC):
- Stream changes from the database’s replication log
- Changes (inserts, updates, deletes) captured as events
- Written to a message log (Kafka) for consumers
Debezium (CDC platform): Reads MySQL binlog, PostgreSQL WAL, MongoDB oplog → sends to Kafka
Benefits of CDC:
- Single source of truth remains the database
- Derived systems (Elasticsearch, Redis, data warehouse) updated via log consumer
- Replay possible: re-build derived view from the beginning of the log
Log compaction (Kafka’s feature for CDC):
- For each key, keep only the most recent value
- Allows “snapshot” of current state by reading compacted log from beginning
- Deletes tracked via “tombstone” messages (null value for a key)
Event Sourcing
Traditional approach: Store current state (mutable rows updated in-place)
Event sourcing: Store immutable sequence of events that led to current state
- Application state = fold/reduce over event log
- Events are facts: “user added item to cart” (never changes)
- State is derived: current cart = reduce events for that user
Event sourcing advantages:
- Complete audit trail (every change recorded)
- Time travel: derive state at any past point
- Reprocessing: change business logic → reprocess event log → new derived state
- Microservices: events as integration contracts
Difference from CDC:
- CDC: captures low-level DB changes (SQL-level)
- Event sourcing: high-level domain events (business-level)
- CDC is often used to implement event sourcing infrastructure
Stream Processing Patterns
1. Stateless transformations (simple):
- Map: transform each event (e.g., parse JSON → structured record)
- Filter: keep only events matching criteria
- Enrichment: join stream with static lookup table
- Each event processed independently; no state needed
2. Stateful transformations:
- Aggregation: count, sum, average over time window
- Join: stream-stream or stream-table
- Deduplication: track seen event IDs
- State stored in local storage (RocksDB in Flink/Kafka Streams)
3. Windows (time-based aggregation):
- Tumbling window: Fixed-size, non-overlapping (e.g., counts per 1-minute window)
- Sliding window: Overlap allowed (e.g., counts over last 5 minutes, updated every 1 minute)
- Session window: Variable-size, ends after inactivity gap (e.g., user session until 30 min idle)
- Global window: All time (usually with trigger)
Event Time vs Processing Time
Processing time: When the event is processed by the stream processor
Event time: When the event actually occurred (timestamp in the event itself)
Why they differ:
- Network delays: event created at t=100ms, arrives at processor at t=500ms
- Mobile apps: events buffered offline, arrive hours later in burst
- Clock skew: event device clock wrong
Problem: If you aggregate by event time but events arrive late, how long do you wait?
Watermarks:
- Estimate of “how far behind in event time is the data stream”
- Watermark = current event time - maximum expected late arrival
- When watermark passes time T: assume all events before T have arrived (safe to close window)
- Heuristic: Derived from observing actual event arrival patterns
Handling late events:
- Drop late events: Simple but loses data
- Reemit with update: Update aggregate when late event arrives; downstream receives correction
- Grace period: Window stays open for grace period after watermark; then closes
Stream-Table Joins
Three types of stream joins:
1. Stream-table join (enrichment):
- Enrich each stream event with data from a database/lookup table
- Table = static reference data (user profiles, product catalog)
- Implementation: cache the lookup table in memory; join at point of processing
- Problem: table can change; use CDC to keep cache fresh
2. Stream-stream join (event correlation):
- Join two event streams based on a shared key and time window
- Both streams buffered in state store; join events that arrive within the window
- Example: Match “search query” events with “click” events within 30 minutes
3. Table-table join (incremental materialized view):
- Both inputs are tables (CDC streams from two databases)
- Maintains join result as a materialized view, updated incrementally
Fault Tolerance and Exactly-Once
At-least-once processing:
- If consumer crashes, replays from last checkpoint → might process events again
- Acceptable if processing is idempotent (writing same event twice = same result)
At-most-once processing:
- Don’t retry on failure; might miss events
- Not acceptable for most business use cases
Exactly-once processing:
- Each event processed exactly once, even in the presence of failures
- Hardest to implement; most important for financial transactions, billing
Approaches:
- Idempotent writes: Processing same event multiple times = same result (write with unique ID; dedup)
- Transactional writes (Kafka Transactions): Producer writes multiple records atomically; consumer commits offset + processes transactionally
- Two-phase commit: Mark processed in source + write output in atomic transaction (expensive)
- Checkpoint + rollback (Flink): Periodic state snapshots; on failure roll back to last checkpoint and replay
Flink’s exactly-once semantics:
- Chandy-Lamport algorithm adapted: barrier markers sent through data stream
- All operators take consistent snapshot when barrier passes through
- On failure: roll back to last barrier snapshot; re-read input from that checkpoint’s input offset
Important Points
- Stream processing = batch on unbounded data: Same dataflow concepts; different time model.
- Event time ≠ processing time: Always use event time for business logic; processing time for SLAs.
- Watermarks are heuristics: Cannot be perfectly precise; must trade off completeness vs latency.
- Kafka log compaction is key for CDC state: Allows bootstrapping derived views without full replay from beginning.
- Exactly-once is expensive: Idempotent processing + at-least-once is often the right trade-off.
- State management is the hard part: Stateless stream processing is easy; stateful requires careful fault tolerance.
Examples & Case Studies
-
Kafka Streams (Kafka’s built-in stream processor)
- Embedded in application; no separate cluster
- State stored in local RocksDB, backed up to Kafka
- Exactly-once via Kafka transactions
- Used by LinkedIn for real-time analytics
-
Apache Flink at Scale
- AirBnB: fraud detection (stream-stream join of bookings + fraud signals)
- Uber: real-time price surge detection
- Netflix: real-time recommendation updates
- State: checkpointed to HDFS/S3 via Chandy-Lamport snapshots
-
Change Data Capture in Practice
- LinkedIn: Databus (CDC from Oracle to Espresso, search, other stores)
- Airbnb: Debezium + Kafka → search index (Elasticsearch), analytics (Snowflake)
- Standard pattern in 2026 for keeping derived stores in sync
-
Event Sourcing at CQRS-based systems
- EventStoreDB: purpose-built event store
- Axon Framework: Java event sourcing + CQRS framework
- Apache Kafka as event log for microservice event sourcing
Questions
- What are the trade-offs between message queues and message logs?
- What is change data capture and how is it different from dual writes?
- What is the difference between event time and processing time?
- How do watermarks work and what are their limitations?
- What are the three types of windows in stream processing?
- How does Flink achieve exactly-once semantics?
- What is event sourcing and how does it differ from CDC?
- When would you use stream-stream join vs stream-table join?
Modern Context (2026)
Flink SQL and Kafka SQL:
- Apache Flink SQL: write stream processing as SQL (with window functions)
- ksqlDB (Confluent): SQL for Kafka streams directly
- Both abstract away the Flink/Kafka API for stream processing
Streaming Lakehouses:
- Apache Iceberg + Flink: stream writes to Iceberg tables on S3 (ACID streaming ingest)
- Delta Lake + Spark Structured Streaming: micro-batch streaming to Delta tables
- Enables low-latency analytics without separate OLAP system
Real-time ML inference:
- Feature stores (Feast, Tecton, Hopsworks) serve features for online ML
- Stream processing computes real-time features from event streams
- Feature freshness: seconds vs batch’s hours
Apache Kafka evolution:
- Kafka 3.x+: KRaft mode (no ZooKeeper), improved geo-replication
- Confluent Platform: managed Kafka + Schema Registry + Kafka Connect + ksqlDB
- AWS Kinesis: managed alternative to Kafka; tight AWS integration
Apache Pulsar:
- Alternative to Kafka: tiered storage (S3 for old messages), multi-tenancy, geo-replication
- Growing adoption but Kafka still dominates (2026)
Status: Notes complete
Last Updated: 2026-04-13