Chapter 12 Flashcards — Stream Processing

flashcards ddia-2e chapter12 stream-processing

Core Concepts

What is stream processing and how does it differ from batch processing?
?
Stream processing: Process an unbounded (infinite) stream of events as they arrive; the job never terminates.

Latency: milliseconds to seconds
Input is never complete (new events keep arriving)
State is maintained across events (windows, joins, aggregations)

Batch processing: Process a bounded (finite) dataset; job starts, processes all data, terminates.

Latency: minutes to hours
Input is complete before processing begins

The unified view: Stream processing is batch processing applied to unbounded data. The same dataflow concepts (map, filter, group, join) apply — the difference is the time model. Batch uses event-time windows; stream must handle out-of-order arrival and watermarks.

Key tension: Correctness (wait for all events) vs latency (act now). Every stream processing design decision navigates this tension.

What is the key difference between a message queue (RabbitMQ) and a log-based broker (Kafka)?
?
Message Queue (RabbitMQ, SQS):

Message deleted after consumed by one consumer
No replay: once consumed, message is gone
Competing consumers: multiple consumers compete for each message (load-balanced)
Push-based: broker pushes to consumer
Good for: task queues, work distribution

Log-Based Broker (Kafka, Kinesis):

Messages retained for configurable period (not deleted on consumption)
Replay: any consumer can seek to any past offset and re-read
Fan-out: multiple independent consumer groups each read ALL messages
Pull-based: consumer controls its own read pace (natural backpressure)
Good for: event streaming, CDC, audit, stream processing, replay

The difference that matters: With Kafka, a consumer crash at offset 500 means re-read from offset 500 on restart — no data loss. With RabbitMQ, if message is consumed but processing crashes before ack, it can be lost.

Messaging Systems

What is the role of Kafka partitions, and how does partition count affect parallelism?
?
Partitions divide a topic for parallelism and ordering.

Within a partition: Messages totally ordered (sequence guaranteed)
Across partitions: No ordering guarantee

Producer routing:

With key: hash(key) % num_partitions → same key always goes to same partition (ordering per entity)
Without key: round-robin (load-balanced across partitions)

Consumer parallelism:

One consumer per partition (within a consumer group)
If 3 partitions + 3 consumers → all work in parallel
If 3 partitions + 5 consumers → 2 consumers sit idle
If 3 partitions + 2 consumers → one consumer reads 2 partitions

Practical rule: Set partition count = max expected consumer parallelism. Can only increase partitions (not decrease); key-based routing changes on repartition.

Why it matters for ordering: If you need all events for a given user to be processed in order, route user_id as the Kafka key → all events for that user go to the same partition → processed in order by one consumer.

What is Kafka log compaction and why is it essential for CDC?
?
Log compaction: Kafka retains only the most recent value for each key in a compacted topic.

Before compaction:
  [key=A: v1] [key=B: v2] [key=A: v3] [key=A: v4] [key=C: v5]

After compaction:
  [key=B: v2] [key=A: v4] [key=C: v5]
  (earlier versions of A removed; only latest kept)

Tombstone messages: A message with key=X, value=null means “delete key X”. After compaction, tombstone retained briefly then removed.

Why essential for CDC:

CDC topic: each key = a row in the database; value = current row state
Reading compacted topic from offset 0 = current snapshot of all rows
A new Elasticsearch indexer consumer can bootstrap from offset 0 (no need to replay full history)
Without compaction: consumer must replay all events from the beginning (potentially years) to build current state

Normal topic vs compacted topic:

Normal: retention by time or size; old messages expire
Compacted: retention by key; latest value always available; no time-based expiry for latest values

Change Data Capture (CDC)

What is Change Data Capture (CDC) and why is it better than dual writes?
?
Dual writes (anti-pattern):

Application: write to PostgreSQL AND Elasticsearch simultaneously
Problem 1: If PostgreSQL succeeds but Elasticsearch fails → diverged state
Problem 2: Two concurrent writers → race condition → different ordering in each system

CDC (correct approach):

Application: write to PostgreSQL only (single source of truth)
Debezium: reads PostgreSQL WAL → publishes changes to Kafka
Elasticsearch: consumes Kafka → applies changes

How CDC works:

Every INSERT/UPDATE/DELETE in PostgreSQL writes to the Write-Ahead Log (WAL)
Debezium reads the WAL (as a replication client)
Debezium transforms each change into a JSON/Avro event with before/after state
Event published to Kafka topic (one topic per table)
All derived systems subscribe to the Kafka topic independently

Benefits:

Single source of truth (DB remains authoritative)
Replay: rebuild Elasticsearch from scratch by replaying the topic
New derived systems can catch up from any point in history
No application code changes needed

CDC event structure (Debezium):

{"before": {...old values...}, "after": {...new values...},
 "op": "u",  // c=create, u=update, d=delete, r=read
 "ts_ms": 1716998400000}

Walk through the concrete setup of a Debezium CDC pipeline from PostgreSQL to Elasticsearch.
?
Components: PostgreSQL → Debezium (Kafka Connect) → Kafka → Flink → Elasticsearch

Step 1: Configure PostgreSQL:

-- Enable logical replication
ALTER SYSTEM SET wal_level = logical;
ALTER SYSTEM SET max_replication_slots = 4;
-- Create replication user
CREATE USER debezium REPLICATION LOGIN;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO debezium;
-- Create publication
CREATE PUBLICATION dbz_publication FOR ALL TABLES;

Step 2: Deploy Debezium Kafka Connect connector:

{
  "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
  "database.hostname": "postgres", "database.port": "5432",
  "database.user": "debezium", "database.dbname": "myapp",
  "table.include.list": "public.orders",
  "topic.prefix": "postgres",
  "plugin.name": "pgoutput"
}

Step 3: Kafka topic created: postgres.public.orders — receives all CDC events

Step 4: Flink job consumes topic, transforms events, writes to Elasticsearch with op routing:

c (create) → Elasticsearch index
u (update) → Elasticsearch update
d (delete) → Elasticsearch delete

Step 5: Elasticsearch search index updated within seconds of DB write

On Debezium restart: Debezium persists its WAL position (LSN offset) in Kafka → on restart, continues from last processed position. No events lost.

Reasoning About Time

What is the difference between event time and processing time, and which should you use for business logic?
?
Event time: When the event actually occurred (timestamp embedded in the event by the producer).
Processing time: When the event is processed by the stream processor (processor’s system clock).

Why they differ:

Mobile app buffers events offline 4 hours → event_time=10:00, arrives at Kafka at 14:00
Network delay: event at 14:00:00, arrives at processor at 14:00:35
Clock skew: device clock 5 minutes ahead of real time
Consumer backlog: processor falls behind; processes yesterday’s events today

Use event time for:

Revenue per hour (was the sale made during business hours on Tuesday?)
User session duration (how long was the user actually active?)
Daily/weekly active users (was the user active on that calendar day?)
Any business metric that cares about WHEN something happened

Use processing time for:

“Is our pipeline keeping up?” (lag monitoring)
“How fast is our stream processing?” (throughput SLAs)
Anything that doesn’t care about the actual moment of occurrence

Rule: Default to event time for all business logic. Only use processing time for operational metrics about the pipeline itself.

What is a watermark in stream processing, and what are the trade-offs between fixed-lag and heuristic watermarks?
?
Watermark: The stream processor’s estimate of “all events with event_time <= T have likely arrived.”

When watermark advances past time T → window for time T is closed → results emitted
Expressed as: watermark = max_event_time_seen - lag

Fixed-lag watermark:

watermark = max_event_time_seen - fixed_lag (e.g., 2 minutes)
If latest event = 14:10:00 → watermark = 14:08:00
Windows for 14:00–14:08 can be closed

✅ Simple; predictable latency
❌ Events arriving > 2 minutes late are dropped (or go to side output)
❌ Not adaptive to actual data patterns

Heuristic watermark:

Observe actual late arrival distribution (P95, P99)
Set lag = P99 observed delay + buffer
Adapt as patterns change

✅ Better completeness; adapts to actual data
❌ Complex to configure; still an estimate
❌ Past patterns may not predict future delays

The fundamental trade-off:

Larger lag → more completeness (wait longer for late events) → more latency
Smaller lag → less latency → more late events dropped
You CANNOT have 100% completeness + zero latency simultaneously
The watermark is the knob you turn to choose your position on this curve

Handling late events (after watermark):

Drop (simple; data loss)
Grace period (keep window open N extra seconds; Flink .allowedLateness())
Side output (route late events to separate stream for special handling)

Stream Joins

Explain the three types of stream joins and give a concrete use case for each.
?
1. Stream-Table Join (enrichment):

Join each stream event with a lookup table loaded into state store
Table side: slowly-changing reference data (user profiles, product catalog)
No expiration of table state; table is persistent

Stream: [order_id=1, user_id=42, amount=99]
Table:  {user_id=42 → tier="gold", country="US"}
Output: [order_id=1, user_id=42, amount=99, tier="gold"]

Use case: Enrich order events with customer tier; enrich click events with page metadata
Keep fresh: subscribe CDC events to update the state store

2. Stream-Stream Join (event correlation):

Correlate events from two streams within a time window
Both streams buffered in state; events expire after window duration
Unmatched events are dropped when window expires

Stream A: [user=42, ad_id=9, impression_t=14:00:05]
Stream B: [user=42, ad_id=9, click_t=14:07:33]
Join condition: same user+ad; click within 30min of impression
Output: [user=42, ad_id=9, time_to_click=7m28s]

Use case: Ad impression → click attribution; search query → purchase within 1 hour

3. Table-Table Join (incremental materialized view):

Both inputs are tables (CDC streams from two databases)
Maintains join result as incrementally updated materialized view

CDC stream A: customer row changes
CDC stream B: order row changes
Materialized view: customers JOIN orders
Updated on each CDC event → always reflects current DB state

Use case: ksqlDB materialized views; Kafka Streams global KTables

Fault Tolerance and Exactly-Once

What are the three delivery guarantees in stream processing? When do you use each?
?
At-most-once:

Process event; if failure, do NOT retry
Risk: events lost permanently on failure
Overhead: minimal (no tracking, no retry)
Use when: non-critical metrics, sampling, logging where some loss is acceptable

At-least-once:

Retry on failure; an event may be processed multiple times
Requirement: processing must be idempotent (processing same event twice = same result)
Overhead: low (track consumer offset; retry on failure)
Use when: most production use cases — acceptable if writes are idempotent (e.g., key-value upserts)

Exactly-once:

Each event processed exactly once, even with failures
Overhead: significant (transactions or checkpoints; idempotent sinks required)
Use when: financial transactions, billing, inventory counts, deduplication

Idempotency trick: Even with at-least-once, you can achieve exactly-once semantics if your sink deduplicates by event ID:

Event ID included in each message →
Elasticsearch/database: "if event_id already exists, skip"
→ Effectively exactly-once even with at-least-once delivery

How does Apache Flink achieve exactly-once semantics using the Chandy-Lamport checkpoint algorithm?
?
Core idea: Periodically snapshot all operator state + input offsets atomically → on failure, rollback to snapshot and replay.

Step 1: JobManager injects BARRIER markers into all input partitions:

Kafka partition 0: [event1][event2][BARRIER_1][event3]...
Kafka partition 1: [eventA][eventB][BARRIER_1][eventC]...

Step 2: Each operator processes BARRIER:

When BARRIER arrives from ALL input partitions:
- Operator saves its state (e.g., count-per-window HashMap) to durable storage (S3, HDFS)
- Records input offset at this checkpoint
- Forwards BARRIER to downstream operators
- Continues processing (no pause during checkpoint)

Step 3: Checkpoint complete:

All operators have saved state
JobManager records: “checkpoint 42 is consistent”

On failure:

Restart all operators
Restore each operator’s state from checkpoint 42
Re-read Kafka from the offsets recorded at checkpoint 42
Re-process events after checkpoint → deterministic → same results
Downstream receives same output as before failure (exactly-once, if sink is idempotent/transactional)

Requirements for true exactly-once:

Flink state checkpointed (covered above)
Idempotent or transactional sinks: If Elasticsearch receives same document twice → upsert (idempotent). If writing to Kafka → use Kafka transactions.

How do Kafka transactions enable exactly-once between Kafka topics?
?
Problem without transactions:

1. Consumer reads event from input topic at offset 500
2. Processes event → produces result to output topic ✅
3. Crashes before committing offset 500
4. Restart: re-reads offset 500 → produces result AGAIN → duplicate in output

Kafka transactions solution:

producer.initTransactions();
 
// For each batch of events:
producer.beginTransaction();
try {
    // Write output record
    producer.send(new ProducerRecord<>("output-topic", key, result));
    // Commit input offset AS PART OF transaction
    producer.sendOffsetsToTransaction(offsetsToCommit, consumerGroupId);
    // Commit: both output write + offset update are atomic
    producer.commitTransaction();
} catch (Exception e) {
    producer.abortTransaction();  // Rolls back both output write AND offset
}

Consumer must use: isolation.level=read_committed

Skips records that are part of an uncommitted (or aborted) transaction
Only sees records from committed transactions → no duplicates visible

Overhead: ~5–10% throughput reduction compared to non-transactional.
Use case: Kafka-to-Kafka stream processing pipelines (Kafka Streams, Flink-to-Kafka).

Kafka vs Kinesis vs Pulsar vs RabbitMQ

Compare Kafka, Kinesis, Pulsar, and RabbitMQ. When would you choose each?
?

Property	Kafka	Kinesis	Pulsar	RabbitMQ
Type	Log-based (pull)	Log-based (pull)	Log + tiered storage	Message queue (push)
Retention	Configurable	1–365 days	Tiered (S3 cheap)	Until consumed
Replay	Yes	Yes	Yes	No
Ordering	Per partition	Per shard	Per partition	Per queue
Ops burden	Medium (self-managed)	Low (AWS managed)	Medium	Low
Ecosystem	Richest (Debezium, Connect, Flink, ksqlDB)	AWS-native	Growing	Enterprise legacy

Choose Kafka:

Open source, maximum ecosystem (Debezium CDC, Kafka Connect, Flink, ksqlDB)
Need Kafka Streams embedded in application
Not locked into a single cloud

Choose Kinesis:

AWS-native stack (tight IAM, Lambda triggers, MSK managed Kafka alternative)
Simplicity over flexibility; AWS handles operations

Choose Pulsar:

Long message retention at low cost (tiered storage: recent in memory, old in S3)
Multi-tenancy (multiple teams, isolated namespaces on shared cluster)
Geo-replication built in

Choose RabbitMQ (or SQS):

Simple task queues with no need for replay
AMQP protocol required
Very low latency (sub-millisecond)

State Management

What is the difference between stateless and stateful stream processing? What makes stateful hard?
?
Stateless stream processing:

Each event processed independently; no memory of past events
Examples: parse JSON, filter events, enrich from a static table
Trivial fault tolerance: drop the event and retry
Horizontal scaling: any worker can process any event

Stateful stream processing:

Processing requires memory of past events (accumulated state)
Examples: count events per user in 5-minute windows; detect 3 failed transactions within 5 minutes; deduplicate by event ID
State stored in: RocksDB (Flink, Kafka Streams), embedded KV store
State must be durable: if worker fails, state must be recoverable

Why stateful is hard:

State recovery: On failure, state must be restored (Flink checkpoints to S3)
State size: Window state grows unbounded without expiry policies
Repartitioning: Adding workers requires redistributing state (expensive)
Correctness: Out-of-order events can invalidate already-emitted window results
Exactly-once: Stateful exactly-once requires both state checkpoint AND idempotent output

RocksDB in Flink:

Default state backend for large state
Writes state to local SSD; checkpoints serialized state to S3
Can handle state much larger than RAM (spills to disk)

Modern Tools

What is ksqlDB and how does it differ from Apache Flink for stream processing?
?
ksqlDB (Confluent):

SQL interface directly over Kafka topics
Runs inside Kafka cluster (no separate processing cluster)
Continuous queries: SELECT ... FROM topic EMIT CHANGES
Materialized views as Kafka topics (queryable via REST)

ksqlDB strengths:

No separate cluster; simpler ops
Fast to prototype; SQL familiar
Built-in Kafka integration (Kafka Connect sources/sinks)

ksqlDB limitations:

Less flexible than Flink for complex stateful logic
No Python API; SQL only
Less efficient for very high throughput
Limited windowed join options

Apache Flink strengths:

Full-featured stream processor; handles complex stateful pipelines
Exactly-once with strong guarantees
Python, Java, SQL APIs
Better performance for complex stateful jobs at scale
Batch + stream unified (same runtime)

Decision:

Simple filtering, enrichment, basic aggregations on Kafka → ksqlDB
Complex stateful joins, ML feature computation, high-throughput, exactly-once → Apache Flink
Unified batch + stream, Kubernetes deployment → Apache Flink

Architecture and Design

What is a streaming lakehouse and how does it replace traditional Lambda architecture?
?
Traditional Lambda architecture:

Raw data → Batch layer (Spark, hours lag) ──┐
         → Speed layer (Flink, seconds lag) ──┤→ Serving layer (merge both)

Problem: Two codebases; two sets of bugs; complex merge at query time.

Streaming Lakehouse (2026 standard):

Events → Flink → Apache Iceberg tables on S3 → Trino/DuckDB queries
          ↑ writes ACID transactions to Iceberg
          ↑ query tools read directly (no separate serving layer)

How Flink + Iceberg replaces Lambda:

Flink writes streaming events to Iceberg tables with ACID guarantees
Historical reprocessing: replay Kafka topic, write to new Iceberg table version
BI tools (Superset, Looker, Redash) query Iceberg directly via Trino or Athena
Sub-minute data freshness without a separate OLAP system
Time travel: query data as of any past checkpoint

Why better than Lambda:

Single codebase (Flink handles both real-time and historical)
No serving layer to maintain (BI tools query lakehouse directly)
ACID guarantees (no dirty reads from partially-written batch outputs)
Lower operational complexity

What is event sourcing and how does it relate to CDC?
?
Event sourcing: Store ALL state changes as an immutable sequence of domain events; current state is derived by replaying events.

Traditional: Store current state (mutable row)
  orders table: {id=1, status="shipped", amount=99}  ← last write wins

Event sourcing: Store all events (immutable append)
  order_events: [OrderPlaced(99), PaymentConfirmed, ItemShipped]
  Current state = reduce(order_events) = {status="shipped", ...}

Event sourcing properties:

Complete audit trail (every state transition recorded)
Time travel: derive state at any past point by replaying to that point
Reprocessing: change business logic → replay all events → new derived state
Events are facts; state is derived

Difference from CDC:

	CDC	Event Sourcing
Level	Infrastructure (DB row changes)	Application (domain events)
Designed for	Sync derived systems	Application architecture
Event granularity	SQL row-level (INSERT/UPDATE/DELETE)	Business events (“OrderShipped”)
Who uses	Platform/infrastructure team	Application developers
Relationship	CDC reads DB writes	Event sourcing stores domain events; CDC can stream them

In practice: Event sourcing systems often USE CDC infrastructure. The event store (Kafka or EventStoreDB) can be treated as a durable log; CDC pipelines can consume it to sync derived stores (Elasticsearch, data warehouse).

Total Cards: 16
Review Time: ~25 minutes
Priority: HIGH
Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch12-flashcards