Chapter 12: Stream Processing

ddia-2e stream-processing kafka cdc watermarks exactly-once flink

Status: Notes complete

Overview

Stream processing is the paradigm for working with unbounded (infinite) data — events that arrive continuously and never form a complete, finite dataset. Where batch processing waits to collect all the data before processing, stream processing acts on events as they arrive. Chapter 12 covers the full stack: the infrastructure for transmitting events (messaging systems, log-based brokers), the patterns for keeping systems in sync (Change Data Capture, event sourcing), and the mechanics of stream processing (windowing, joins, fault tolerance, exactly-once semantics).

The 2nd edition significantly updates the 1st edition’s Chapter 11, adding concrete CDC tooling (Debezium + Kafka + PostgreSQL), Kafka Streams and ksqlDB, Flink as the dominant stream processor, and expanded treatment of watermarking strategies for late-arriving data.

Why stream processing matters: Modern systems cannot afford to wait hours for a batch job. Fraud detection, real-time recommendations, alerting, operational dashboards, and CDC-based data synchronization all require processing events within seconds. Stream processing is batch processing applied to the present moment.

Key tension in stream processing: Correctness (waiting for all data) vs latency (acting now). Every design decision in stream processing is navigating this tension.

Key Concepts

Transmitting Event Streams

Messaging Systems

When a producer generates events faster than consumers can process them, a messaging system (broker) is needed to buffer the events and decouple producers from consumers.

Two fundamental delivery models:

Message Queue (traditional: RabbitMQ, ActiveMQ, AWS SQS):

Messages pushed to consumers
Message deleted after acknowledged by one consumer
Competing consumers: multiple consumers compete for the same messages (load-balanced)
No replay: once consumed, a message is gone
Good for: task queues, work distribution, job scheduling

Message Log / Event Log (Kafka, Amazon Kinesis, Azure Event Hubs):

Append-only log; messages retained by time or size policy
Consumers pull and track their own offset (position in log)
Multiple independent consumer groups each read the full log independently
Replay: consumers can re-read from any past offset
Good for: event sourcing, audit log, stream processing, data integration, replay

Comparison:

Property	Message Queue (RabbitMQ)	Log-Based Broker (Kafka)
Retention	Deleted after consumed	Retained for configurable period
Multiple consumers	Competing (one gets each message)	Fan-out (each group gets all messages)
Replay	No	Yes (seek to any offset)
Ordering	Per-queue (limited)	Per-partition (total order)
Throughput	Medium	Very high (sequential disk I/O)
Consumer pacing	Push (broker controls rate)	Pull (consumer controls rate)
Use case	Task queues, work distribution	Event streaming, audit, CDC, replay

Kafka architecture fundamentals:

Topic: "user-events"
  Partition 0: [offset 0][offset 1][offset 2]...[offset N]
  Partition 1: [offset 0][offset 1][offset 2]...[offset M]
  Partition 2: [offset 0][offset 1][offset 2]...[offset K]

Consumer Group "analytics":
  Consumer A → reads Partition 0 (tracks offset 0–N)
  Consumer B → reads Partition 1 (tracks offset 0–M)
  Consumer C → reads Partition 2 (tracks offset 0–K)

Consumer Group "fraud-detection":
  Consumer X → reads ALL 3 partitions from their own offsets

Key properties:
  - Within a partition: messages are totally ordered
  - Across partitions: no ordering guarantee
  - Parallelism = number of partitions per topic
  - Producer routes by key hash (key → same partition → ordered)

Log-Based Message Brokers

Why log-based beats traditional queues for stream processing:

Replay: Consumer crash at offset 500? Re-read from offset 500. No data loss.
Fan-out: Spin up a new consumer group (analytics, fraud, monitoring) — each gets all messages independently. Traditional queue: each message consumed by only one consumer.
Backpressure handled naturally: Consumers read at their own pace. If consumer slows, messages wait in the log. No push overwhelming the consumer.
Audit and debugging: Every event persisted for inspection. Can replay past events to debug processing logic.
Bootstrapping new consumers: A new consumer can replay from the beginning of the log to build initial state.

Kafka log compaction (essential for CDC state):

For a compacted topic: Kafka retains only the most recent value for each key
Deletes tracked via tombstone messages (key with null value)
Reading a compacted log from offset 0 = current state of all keys (like a key-value snapshot)
This enables CDC consumers to restart and rebuild current state without reading all history

Databases and Streams

Keeping Systems in Sync

The problem: An operational database (PostgreSQL) is the source of truth. Multiple derived systems need to stay in sync: search index (Elasticsearch), analytics warehouse, caches (Redis), microservice databases.

Approach 1 — Dual writes (anti-pattern):

Application writes to PostgreSQL AND Elasticsearch simultaneously
Problem: What if PostgreSQL write succeeds but Elasticsearch write fails?
  → Systems diverge
Problem: Two concurrent writes to different systems → race condition
  → Systems end up in different states depending on timing

Approach 2 — Change Data Capture (CDC) (correct approach):

Application writes to PostgreSQL only (single source of truth)
CDC reads PostgreSQL WAL (Write-Ahead Log) → publishes changes to Kafka
All derived systems subscribe to Kafka → update themselves from log

Change Data Capture (CDC)

What CDC captures: Every INSERT, UPDATE, DELETE from the database’s replication log, as a stream of events sent to a message broker.

Concrete CDC setup: PostgreSQL → Debezium → Kafka → Flink → Elasticsearch:

┌─────────────────────────────────────────────────────────────────────────┐
│                  CDC Pipeline Architecture                              │
│                                                                         │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐              │
│  │  PostgreSQL  │    │   Debezium   │    │    Kafka     │              │
│  │              │    │  (Kafka      │    │              │              │
│  │  WAL         │───►│   Connect    │───►│  Topic:      │              │
│  │  (replication│    │   plugin)    │    │  postgres.   │              │
│  │   log)       │    │              │    │  public.     │              │
│  │              │    │  Reads WAL,  │    │  orders      │              │
│  │  INSERT INTO │    │  transforms  │    │              │              │
│  │  orders ...  │    │  to JSON/    │    │  [JSON CDC   │              │
│  └──────────────┘    │  Avro events │    │   events]    │              │
│                      └──────────────┘    └──────┬───────┘              │
│                                                 │                      │
│                             ┌───────────────────┤                      │
│                             │                   │                      │
│                    ┌────────▼──────┐    ┌────────▼──────┐             │
│                    │   Apache      │    │  Data         │             │
│                    │   Flink       │    │  Warehouse    │             │
│                    │               │    │  (Snowflake/  │             │
│                    │  Transform,   │    │   BigQuery)   │             │
│                    │  enrich,      │    └───────────────┘             │
│                    │  filter       │                                   │
│                    └───────┬───────┘                                   │
│                            │                                           │
│                    ┌───────▼───────┐                                   │
│                    │ Elasticsearch │                                   │
│                    │ (search index │                                   │
│                    │  up to date   │                                   │
│                    │  within secs) │                                   │
│                    └───────────────┘                                   │
└─────────────────────────────────────────────────────────────────────────┘

Debezium configuration snippet (PostgreSQL source connector):

{
  "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
  "database.hostname": "postgres-db",
  "database.port": "5432",
  "database.user": "debezium",
  "database.password": "secret",
  "database.dbname": "myapp",
  "table.include.list": "public.orders,public.customers",
  "topic.prefix": "postgres",
  "plugin.name": "pgoutput",
  "publication.name": "dbz_publication"
}

CDC event structure (Debezium JSON):

{
  "before": {"id": 1, "status": "pending", "amount": 99.99},
  "after":  {"id": 1, "status": "shipped", "amount": 99.99},
  "op": "u",          // u=update, c=create, d=delete, r=read(snapshot)
  "ts_ms": 1716998400000,
  "source": {"table": "orders", "lsn": "0/15D3F8"}
}

Benefits of CDC:

Single source of truth (database remains authoritative)
Derived systems updated eventually-consistently via log
Replay: re-build Elasticsearch index from scratch by replaying Kafka topic
Zero downtime migrations: new system subscribes to topic and catches up

State, Streams, and Immutability

Fundamental duality: A database table and a stream of changes are two representations of the same information.

Stream of changes:  INSERT(Alice), UPDATE(Alice→Alice2), DELETE(Alice2)
                                                        ↕
Table (current state):  {Alice2}  (fold/reduce over the stream)

Table → Stream:  CDC (export each write as a changelog event)
Stream → Table:  Stream processor materializes a view by applying each event

Event sourcing:

Store all state changes as an immutable sequence of events (not mutable rows)
Current state = replay/fold all events for an entity
Events are facts: “user placed order #99 at 14:00” (never changes)
State is derived: current order status = reduce all order events

Event sourcing vs CDC:

	CDC	Event Sourcing
Level	Low-level DB changes (SQL rows)	High-level domain events (business facts)
Designed for	Infrastructure sync	Application architecture
Consumer access	Any CDC consumer	Application-specific consumers
Reprocessing	Re-read WAL / replay Kafka topic	Replay event store

Processing Streams

Uses of Stream Processing

1. Complex event processing (CEP): Detect patterns across multiple events within a time window.

Example: Detect fraud — 3+ failed card transactions within 5 minutes → alert

2. Stream analytics: Compute running metrics, rolling aggregates, windowed counts.

Example: Count page views per minute; compute P99 latency over last 5 minutes

3. Maintaining materialized views: Apply CDC events to keep a derived table (or search index) up to date.

Example: Debezium → Kafka → Flink → Elasticsearch (orders searchable within seconds)

4. Stream-to-stream joins: Correlate events from two streams within a time window.

Example: Match ad impression events with ad click events within 30 minutes

5. Enrichment: Join each stream event with a reference table to add context.

Example: Enrich order events with customer tier from customer profile table

Reasoning About Time

Event time vs processing time — the most important distinction in stream processing:

	Event Time	Processing Time
Definition	When the event actually occurred (in the event payload)	When the event is processed by the stream processor
Source	Event producer’s timestamp	Stream processor’s system clock
Consistency	Consistent across retries	Changes on retry
Problem	Late arrivals; out-of-order	Doesn’t reflect when event happened
Use for	Business logic (revenue per hour, session duration)	SLA monitoring, job performance

Why event time ≠ processing time:

Mobile app buffers events offline for 4 hours → arrives in burst
Network delay between producer and broker
Clock skew: device clock is wrong
Kafka consumer paused → resumes and processes 1-hour-old messages

Windowing strategies:

TUMBLING WINDOW (fixed, non-overlapping):
  ──[──────────][──────────][──────────]──►
  00:00  01:00  02:00  03:00  04:00
  
  Each event belongs to exactly one window
  Use: "counts per hour", "revenue per day"

SLIDING WINDOW (fixed size, moves by step):
  ──[──────────]──►
      ──[──────────]──►
          ──[──────────]──►
  Window size: 5 min, step: 1 min
  Events can appear in multiple windows
  Use: "rolling 5-minute error rate"

SESSION WINDOW (variable size, gap-based):
  [event1, event2, event3]   [ev4]   [ev5, ev6, ev7, ev8]
  session: 3 events           gap     session: 4 events
  (window closes after 30min inactivity)
  Use: "user session duration", "clickstream session"

Watermark Strategies

The watermark problem: When aggregating by event time (e.g., counts per hour), how long do you wait for late-arriving events before closing the window and emitting the result?

Watermark = the stream processor’s estimate of “all events at or before this timestamp have likely arrived.”

When watermark advances past time T: the window for time T is closed and results emitted

Watermark strategies:

1. Fixed-lag watermark:

Watermark = MAX(event_time seen) - fixed_lag
Example: fixed_lag = 2 minutes
If latest event = 14:10:00, watermark = 14:08:00
Windows before 14:08 are safe to close

✅ Simple; predictable; low latency
❌ If events arrive more than 2 minutes late → silently dropped or ignored

2. Heuristic watermark:

Observe actual arrival delay distribution dynamically
Watermark = f(historical P99 late arrival delay)
If 99% of events arrive within 45 seconds, set watermark at 50s lag

✅ Adapts to actual data patterns
❌ Complex; still a heuristic (never 100% correct)

3. Perfect watermark (rare):

Only possible when you KNOW all input sources have sent all events
up to a certain time (e.g., batch files with known completion)

✅ No late events possible; results always correct
❌ Only achievable with coordinated producers; not general

Handling late events after watermark:

Strategy	Behavior	Trade-off
Drop	Discard late events	Simple; loses data
Grace period	Window stays open for extra N seconds after watermark	Small extra latency; handles most late arrivals
Side output	Route late events to separate stream for special handling	Flexible; more complex
Update result	Re-emit updated window aggregate when late event arrives	Downstream must handle corrections; complex

Late data trade-off:

                    Completeness
                        │
                   100% ┤ ........ (wait forever)
                        │        ↗
                        │      ↗  (practical zone: 95-99% within watermark)
                    50% ┤    ↗
                        │  ↗
                        └────────────────────────────► Latency
                        0s    1s    5s    1min  10min
                        
You cannot have both 100% completeness AND zero latency.
Watermark = the trade-off knob.

Stream Joins

Three types of stream joins:

1. Stream-Table Join (enrichment):

Enrich each stream event with data from a lookup table
Table = slowly changing reference data (user profiles, product catalog)
Implementation: load table into state store (RocksDB in Flink); look up at processing time

Stream: [order_id=1, user_id=42, amount=99]  ──► enriched with user tier
Table:  {user_id=42 → {name="Alice", tier="gold", country="US"}}
Output: [order_id=1, user_id=42, amount=99, tier="gold", country="US"]

Challenge: Table can change. Solution: Use CDC to keep state store fresh (stream-table join becomes effectively stream-stream with one side slowly changing).

2. Stream-Stream Join (event correlation):

Join events from two streams that are related within a time window
Both streams buffered in state store; join matching events within the window
Unmatched events expire after window duration

Stream A (ad impressions):  [user=42, ad_id=9, impression_time=14:00:05]
Stream B (ad clicks):       [user=42, ad_id=9, click_time=14:07:33]

Join: impression → click within 30 minutes, same user + ad_id
Window state: buffer all impressions; for each click, look up matching impression

Output: [user=42, ad_id=9, impression_time=14:00:05, click_time=14:07:33,
         time_to_click=7m28s]

3. Table-Table Join (incremental materialized view):

Both inputs are tables (or CDC streams of tables)
Maintains join result as incrementally updated materialized view

Table A: customer updates (via CDC)
Table B: order updates (via CDC)

Materialized view: JOIN of customer + order
  When customer row changes → recompute join for all their orders
  When order row changes → recompute join for that order's customer

Used in: ksqlDB, Kafka Streams, Flink Table API

Stream join comparison:

Join Type	State stored	Expiration	Use case
Stream-table	Lookup table (one side)	No expiry (table is persistent)	Enrichment with reference data
Stream-stream	Both streams buffered	Events expire after window	Event correlation, funnel analysis
Table-table	Full join result	No expiry (maintained incrementally)	Incremental materialized views

Fault Tolerance

The three delivery guarantees:

AT-MOST-ONCE:
  Process event; don't retry on failure
  Risk: events dropped on failure
  When: acceptable for non-critical metrics (logging, analytics)

AT-LEAST-ONCE:
  Retry on failure; might process event multiple times
  Requirement: processing must be idempotent (process twice = same result)
  When: acceptable for most use cases with idempotent writes

EXACTLY-ONCE:
  Each event processed exactly once, even with failures
  Hardest to implement; highest overhead
  When: financial transactions, billing, inventory updates

Exactly-once is harder than it sounds:

Problem: Consumer crashes after processing but before committing offset
  → On restart, re-reads and re-processes the event
  → Output written twice

Solution options:
  1. Idempotent writes: same event written twice = same result (via dedup ID)
  2. Kafka transactions: atomic commit of output + offset
  3. Flink checkpoints + rollback: snapshot state + input offset; rollback on failure

Kafka exactly-once semantics (transactions):

Producer API:
  producer.beginTransaction()
  producer.send(outputTopic, result)         // write output
  consumer.commitSync(offsetsToCommit)       // commit input offset
  producer.commitTransaction()               // atomic: both or neither

On failure: Transaction aborted → consumer re-reads; no double processing
Consumer must use isolation.level=read_committed (skips uncommitted msgs)

Flink exactly-once semantics (Chandy-Lamport checkpoints):

Step 1: JobManager injects BARRIER markers into all input streams
        [event1][event2][BARRIER_1][event3]...

Step 2: When an operator receives BARRIER from all inputs:
        - Save operator state to durable storage (S3, HDFS)
        - Record: "input offset at checkpoint 1 = partition_offset"
        - Forward BARRIER downstream

Step 3: All operators saved → checkpoint complete

On failure:
  - Roll back all operator states to last checkpoint
  - Re-read input from checkpoint's recorded offset
  - Re-process events after checkpoint → same results (deterministic)

Flink’s exactly-once requires:

Idempotent or transactional sinks (can’t duplicate-write to Elasticsearch without dedup)
All operators must checkpoint their state
Deterministic processing (same input → same output on replay)

Comparison Tables

Kafka vs Kinesis vs Pulsar vs RabbitMQ

Property	Apache Kafka	Amazon Kinesis	Apache Pulsar	RabbitMQ
Type	Log-based (pull)	Log-based (pull)	Log-based + queue	Message queue (push)
Retention	Configurable (days to forever)	1–365 days	Tiered (S3 for old msgs)	Until consumed
Replay	Yes (any offset)	Yes (within retention)	Yes (tiered storage)	No
Ordering	Per partition	Per shard	Per partition	Per queue
Throughput	Very high	High	High	Medium
Latency	Low (ms)	Low (ms)	Low (ms)	Very low (sub-ms)
Multi-consumer	Yes (consumer groups)	Yes (enhanced fan-out)	Yes (subscriptions)	Limited
Managed option	Confluent Cloud	AWS native	StreamNative	CloudAMQP
Ecosystem	Richest (Connect, Streams, ksqlDB)	AWS-integrated	Growing	Enterprise legacy
2026 market	Dominant	AWS shops	Niche/growing	Legacy

Decision guide:

AWS-native stack → Kinesis (tight IAM integration, no ops)
Maximum ecosystem / open source → Kafka (Debezium, Flink Connect, ksqlDB)
Tiered storage (cost-sensitive long retention) → Pulsar
Simple task queue, no replay needed → RabbitMQ or SQS

CDC Tools Comparison

Tool	Source DBs	Output	Architecture	Managed?
Debezium	PostgreSQL, MySQL, MongoDB, Oracle, SQL Server	Kafka	Kafka Connect plugin	No (OSS)
Maxwell	MySQL only	Kafka, Kinesis, stdout	Standalone daemon	No (OSS)
AWS DMS	Most major DBs	Kafka, Kinesis, S3	Managed AWS service	Yes (AWS)
Striim	Most major DBs	Kafka, various	Commercial	Yes
Fivetran CDC	PostgreSQL, MySQL	Fivetran destinations	Managed SaaS	Yes

Watermark Strategies Comparison

Strategy	Implementation	Completeness	Latency	Use when
Fixed lag	`watermark = max_event_time - lag`	~P95 of events	Predictable	Bounded network delay
Heuristic	Track distribution of arrival delays dynamically	~P99 adaptive	Variable	Variable delay patterns
Bounded out-of-orderness	Flink’s built-in; fixed lag with max out-of-order bound	Configurable	Configurable	Most Flink jobs
Monotonically increasing	No late events; watermark = latest event time	100%	Minimal	Sources with strict ordering (batch files)

Important Points Summary

Log-based brokers enable replay: Kafka’s retention policy is what makes CDC, event sourcing, and stream-processing fault tolerance practical. A message queue (RabbitMQ) that deletes messages after consumption cannot provide these properties.
CDC solves the dual-write problem: Writing to multiple systems simultaneously creates race conditions and partial failures. CDC reads the database’s replication log — atomic at the source, eventually consistent at derived systems.
Watermarks are heuristics, not guarantees: You can never know with certainty when all events for a time window have arrived. The watermark is a bet; late data arriving after the watermark must be handled explicitly.
Event time for business logic; processing time for SLAs: Always window by event time when computing revenue, session duration, or user activity. Use processing time only for operational metrics like “job latency.”
Exactly-once is expensive: Flink’s checkpoint protocol and Kafka transactions both add overhead. For many use cases, idempotent processing + at-least-once delivery is the right trade-off.
Three stream join types serve different needs: Stream-table (enrichment, no expiry), stream-stream (event correlation, windowed state), table-table (incremental materialized views).
Flink has become the dominant stream processor: Lower latency than Spark Structured Streaming, better state management, stronger exactly-once guarantees. ksqlDB (Confluent) provides SQL-over-Kafka for simpler use cases.
Kafka log compaction is the bridge between streams and databases: A compacted topic retains one value per key → reading from offset 0 gives current state → bootstraps new CDC consumers without replaying all history.
State is the hard part of stream processing: Stateless stream processing (filter, map) is trivial. Stateful operations (windows, joins, deduplication) require durable, fault-tolerant state stores (RocksDB, backed by Kafka or S3).
CDC + streaming lakehouse is the 2026 standard: PostgreSQL → Debezium → Kafka → Flink → Iceberg/Delta on S3 → query with Trino/DuckDB. Replaces ETL batch loading with continuous, low-latency data pipelines.

Modern Context (2026)

Apache Flink dominates stream processing:

Flink 1.18+: unified batch + stream (same runtime); Table API + Flink SQL
Lower latency than Spark Structured Streaming (true streaming vs micro-batch)
Stronger exactly-once semantics via Chandy-Lamport checkpointing
Deployed on Kubernetes (Flink Kubernetes Operator) or managed (AWS KDA, Confluent Cloud)

ksqlDB (Confluent Platform):

SQL interface directly over Kafka topics — no separate cluster needed
CREATE STREAM, CREATE TABLE, SELECT ... EMIT CHANGES syntax
Materialized views as Kafka topics; query results streamed continuously
Used for: event filtering, enrichment, simple aggregations — not suitable for complex stateful joins

Kafka ecosystem maturity:

Kafka 3.x (KRaft mode): ZooKeeper dependency removed; simpler operations
Kafka Connect: 100+ connectors (Debezium, JDBC, S3, Elasticsearch)
Schema Registry (Confluent): Avro/Protobuf schema management and evolution
Kafka Streams: embedded Java stream processing library (no separate cluster)

Streaming Lakehouse (2026 standard):

Apache Iceberg + Flink: stream writes to Iceberg tables on S3 with ACID guarantees
Delta Lake + Spark Structured Streaming: micro-batch streaming to Delta tables
Real-time BI: BI tools (Superset, Looker) query Iceberg/Delta directly — sub-minute data freshness without a separate OLTP/OLAP system

Real-time feature stores:

Feast, Tecton, Hopsworks: compute streaming features for online ML
Stream processing jobs compute features (e.g., rolling fraud signals) → feature store → online inference
Bridges stream processing and ML serving: fresh features within seconds, not hours

Apache Pulsar growing:

Tiered storage: old messages auto-offloaded to S3 (infinite retention at low cost)
Multi-tenancy and geo-replication built in
Gaining adoption in cost-sensitive environments; Kafka still dominates in 2026

Questions for Reflection

A startup is choosing between RabbitMQ and Kafka for their messaging infrastructure. Their use case is: (a) distributing background jobs, and (b) building a real-time analytics pipeline. What do you recommend and why?
Walk through exactly how Debezium captures changes from PostgreSQL. What happens if Debezium crashes and restarts? What happens if it misses some WAL entries?
Your stream processing job computes hourly revenue by event time. You discover that 5% of mobile events arrive 10+ minutes late. How do you configure your watermark strategy, and what do you do with the late events?
Explain the difference between stream-stream join and stream-table join. Give a concrete example of when you would use each.
Your team has achieved at-least-once delivery for a financial transaction aggregation pipeline. What specific steps would you take to upgrade to exactly-once? What overhead does this add?
Compare ksqlDB and Apache Flink for implementing a real-time fraud detection system that: detects 3 failed transactions within 5 minutes for the same card. What are the trade-offs?

ch11-batch-processing — Batch processing: the bounded counterpart; same dataflow concepts
ch06-replication — Replication fundamentals; CDC reads the replication log (WAL)
ch10-consistency-and-consensus — Exactly-once semantics requires consensus protocols underneath
ch13-philosophy-of-streaming — Philosophical treatment of streaming-first system design
ch11-stream-processing — 1st edition Ch11 (good baseline; 2E significantly updated)

Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch12-stream-processing

Chapter 12: Stream Processing

Overview

Key Concepts

Transmitting Event Streams

Messaging Systems

Log-Based Message Brokers

Databases and Streams

Keeping Systems in Sync

Change Data Capture (CDC)

State, Streams, and Immutability

Processing Streams

Uses of Stream Processing

Reasoning About Time

Watermark Strategies

Stream Joins

Fault Tolerance

Comparison Tables

Kafka vs Kinesis vs Pulsar vs RabbitMQ

CDC Tools Comparison

Watermark Strategies Comparison

Important Points Summary

Modern Context (2026)

Questions for Reflection

Graph View

Table of Contents

Backlinks

Study Notes by Niladri & AI

Explorer

ch12-stream-processing

Chapter 12: Stream Processing

Overview

Key Concepts

Transmitting Event Streams

Messaging Systems

Log-Based Message Brokers

Databases and Streams

Keeping Systems in Sync

Change Data Capture (CDC)

State, Streams, and Immutability

Processing Streams

Uses of Stream Processing

Reasoning About Time

Watermark Strategies

Stream Joins

Fault Tolerance

Comparison Tables

Kafka vs Kinesis vs Pulsar vs RabbitMQ

CDC Tools Comparison

Watermark Strategies Comparison

Important Points Summary

Modern Context (2026)

Questions for Reflection

Related Resources

Graph View

Table of Contents

Backlinks