Chapter 12 Flashcards — Stream Processing
flashcards ddia-2e chapter12 stream-processing
Core Concepts
What is stream processing and how does it differ from batch processing?
?
Stream processing: Process an unbounded (infinite) stream of events as they arrive; the job never terminates.
- Latency: milliseconds to seconds
- Input is never complete (new events keep arriving)
- State is maintained across events (windows, joins, aggregations)
Batch processing: Process a bounded (finite) dataset; job starts, processes all data, terminates.
- Latency: minutes to hours
- Input is complete before processing begins
The unified view: Stream processing is batch processing applied to unbounded data. The same dataflow concepts (map, filter, group, join) apply — the difference is the time model. Batch uses event-time windows; stream must handle out-of-order arrival and watermarks.
Key tension: Correctness (wait for all events) vs latency (act now). Every stream processing design decision navigates this tension.
What is the key difference between a message queue (RabbitMQ) and a log-based broker (Kafka)?
?
Message Queue (RabbitMQ, SQS):
- Message deleted after consumed by one consumer
- No replay: once consumed, message is gone
- Competing consumers: multiple consumers compete for each message (load-balanced)
- Push-based: broker pushes to consumer
- Good for: task queues, work distribution
Log-Based Broker (Kafka, Kinesis):
- Messages retained for configurable period (not deleted on consumption)
- Replay: any consumer can seek to any past offset and re-read
- Fan-out: multiple independent consumer groups each read ALL messages
- Pull-based: consumer controls its own read pace (natural backpressure)
- Good for: event streaming, CDC, audit, stream processing, replay
The difference that matters: With Kafka, a consumer crash at offset 500 means re-read from offset 500 on restart — no data loss. With RabbitMQ, if message is consumed but processing crashes before ack, it can be lost.
Messaging Systems
What is the role of Kafka partitions, and how does partition count affect parallelism?
?
Partitions divide a topic for parallelism and ordering.
Within a partition: Messages totally ordered (sequence guaranteed)
Across partitions: No ordering guarantee
Producer routing:
- With key:
hash(key) % num_partitions→ same key always goes to same partition (ordering per entity) - Without key: round-robin (load-balanced across partitions)
Consumer parallelism:
- One consumer per partition (within a consumer group)
- If 3 partitions + 3 consumers → all work in parallel
- If 3 partitions + 5 consumers → 2 consumers sit idle
- If 3 partitions + 2 consumers → one consumer reads 2 partitions
Practical rule: Set partition count = max expected consumer parallelism. Can only increase partitions (not decrease); key-based routing changes on repartition.
Why it matters for ordering: If you need all events for a given user to be processed in order, route user_id as the Kafka key → all events for that user go to the same partition → processed in order by one consumer.
What is Kafka log compaction and why is it essential for CDC?
?
Log compaction: Kafka retains only the most recent value for each key in a compacted topic.
Before compaction:
[key=A: v1] [key=B: v2] [key=A: v3] [key=A: v4] [key=C: v5]
After compaction:
[key=B: v2] [key=A: v4] [key=C: v5]
(earlier versions of A removed; only latest kept)
Tombstone messages: A message with key=X, value=null means “delete key X”. After compaction, tombstone retained briefly then removed.
Why essential for CDC:
- CDC topic: each key = a row in the database; value = current row state
- Reading compacted topic from offset 0 = current snapshot of all rows
- A new Elasticsearch indexer consumer can bootstrap from offset 0 (no need to replay full history)
- Without compaction: consumer must replay all events from the beginning (potentially years) to build current state
Normal topic vs compacted topic:
- Normal: retention by time or size; old messages expire
- Compacted: retention by key; latest value always available; no time-based expiry for latest values
Change Data Capture (CDC)
What is Change Data Capture (CDC) and why is it better than dual writes?
?
Dual writes (anti-pattern):
Application: write to PostgreSQL AND Elasticsearch simultaneously
Problem 1: If PostgreSQL succeeds but Elasticsearch fails → diverged state
Problem 2: Two concurrent writers → race condition → different ordering in each system
CDC (correct approach):
Application: write to PostgreSQL only (single source of truth)
Debezium: reads PostgreSQL WAL → publishes changes to Kafka
Elasticsearch: consumes Kafka → applies changes
How CDC works:
- Every INSERT/UPDATE/DELETE in PostgreSQL writes to the Write-Ahead Log (WAL)
- Debezium reads the WAL (as a replication client)
- Debezium transforms each change into a JSON/Avro event with before/after state
- Event published to Kafka topic (one topic per table)
- All derived systems subscribe to the Kafka topic independently
Benefits:
- Single source of truth (DB remains authoritative)
- Replay: rebuild Elasticsearch from scratch by replaying the topic
- New derived systems can catch up from any point in history
- No application code changes needed
CDC event structure (Debezium):
{"before": {...old values...}, "after": {...new values...},
"op": "u", // c=create, u=update, d=delete, r=read
"ts_ms": 1716998400000}Walk through the concrete setup of a Debezium CDC pipeline from PostgreSQL to Elasticsearch.
?
Components: PostgreSQL → Debezium (Kafka Connect) → Kafka → Flink → Elasticsearch
Step 1: Configure PostgreSQL:
-- Enable logical replication
ALTER SYSTEM SET wal_level = logical;
ALTER SYSTEM SET max_replication_slots = 4;
-- Create replication user
CREATE USER debezium REPLICATION LOGIN;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO debezium;
-- Create publication
CREATE PUBLICATION dbz_publication FOR ALL TABLES;Step 2: Deploy Debezium Kafka Connect connector:
{
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres", "database.port": "5432",
"database.user": "debezium", "database.dbname": "myapp",
"table.include.list": "public.orders",
"topic.prefix": "postgres",
"plugin.name": "pgoutput"
}Step 3: Kafka topic created: postgres.public.orders — receives all CDC events
Step 4: Flink job consumes topic, transforms events, writes to Elasticsearch with op routing:
c(create) → Elasticsearch indexu(update) → Elasticsearch updated(delete) → Elasticsearch delete
Step 5: Elasticsearch search index updated within seconds of DB write
On Debezium restart: Debezium persists its WAL position (LSN offset) in Kafka → on restart, continues from last processed position. No events lost.
Reasoning About Time
What is the difference between event time and processing time, and which should you use for business logic?
?
Event time: When the event actually occurred (timestamp embedded in the event by the producer).
Processing time: When the event is processed by the stream processor (processor’s system clock).
Why they differ:
- Mobile app buffers events offline 4 hours → event_time=10:00, arrives at Kafka at 14:00
- Network delay: event at 14:00:00, arrives at processor at 14:00:35
- Clock skew: device clock 5 minutes ahead of real time
- Consumer backlog: processor falls behind; processes yesterday’s events today
Use event time for:
- Revenue per hour (was the sale made during business hours on Tuesday?)
- User session duration (how long was the user actually active?)
- Daily/weekly active users (was the user active on that calendar day?)
- Any business metric that cares about WHEN something happened
Use processing time for:
- “Is our pipeline keeping up?” (lag monitoring)
- “How fast is our stream processing?” (throughput SLAs)
- Anything that doesn’t care about the actual moment of occurrence
Rule: Default to event time for all business logic. Only use processing time for operational metrics about the pipeline itself.
What is a watermark in stream processing, and what are the trade-offs between fixed-lag and heuristic watermarks?
?
Watermark: The stream processor’s estimate of “all events with event_time <= T have likely arrived.”
- When watermark advances past time T → window for time T is closed → results emitted
- Expressed as:
watermark = max_event_time_seen - lag
Fixed-lag watermark:
watermark = max_event_time_seen - fixed_lag (e.g., 2 minutes)
If latest event = 14:10:00 → watermark = 14:08:00
Windows for 14:00–14:08 can be closed
✅ Simple; predictable latency
❌ Events arriving > 2 minutes late are dropped (or go to side output)
❌ Not adaptive to actual data patterns
Heuristic watermark:
Observe actual late arrival distribution (P95, P99)
Set lag = P99 observed delay + buffer
Adapt as patterns change
✅ Better completeness; adapts to actual data
❌ Complex to configure; still an estimate
❌ Past patterns may not predict future delays
The fundamental trade-off:
- Larger lag → more completeness (wait longer for late events) → more latency
- Smaller lag → less latency → more late events dropped
- You CANNOT have 100% completeness + zero latency simultaneously
- The watermark is the knob you turn to choose your position on this curve
Handling late events (after watermark):
- Drop (simple; data loss)
- Grace period (keep window open N extra seconds; Flink
.allowedLateness()) - Side output (route late events to separate stream for special handling)
Stream Joins
Explain the three types of stream joins and give a concrete use case for each.
?
1. Stream-Table Join (enrichment):
- Join each stream event with a lookup table loaded into state store
- Table side: slowly-changing reference data (user profiles, product catalog)
- No expiration of table state; table is persistent
Stream: [order_id=1, user_id=42, amount=99]
Table: {user_id=42 → tier="gold", country="US"}
Output: [order_id=1, user_id=42, amount=99, tier="gold"]
- Use case: Enrich order events with customer tier; enrich click events with page metadata
- Keep fresh: subscribe CDC events to update the state store
2. Stream-Stream Join (event correlation):
- Correlate events from two streams within a time window
- Both streams buffered in state; events expire after window duration
- Unmatched events are dropped when window expires
Stream A: [user=42, ad_id=9, impression_t=14:00:05]
Stream B: [user=42, ad_id=9, click_t=14:07:33]
Join condition: same user+ad; click within 30min of impression
Output: [user=42, ad_id=9, time_to_click=7m28s]
- Use case: Ad impression → click attribution; search query → purchase within 1 hour
3. Table-Table Join (incremental materialized view):
- Both inputs are tables (CDC streams from two databases)
- Maintains join result as incrementally updated materialized view
CDC stream A: customer row changes
CDC stream B: order row changes
Materialized view: customers JOIN orders
Updated on each CDC event → always reflects current DB state
- Use case: ksqlDB materialized views; Kafka Streams global KTables
Fault Tolerance and Exactly-Once
What are the three delivery guarantees in stream processing? When do you use each?
?
At-most-once:
- Process event; if failure, do NOT retry
- Risk: events lost permanently on failure
- Overhead: minimal (no tracking, no retry)
- Use when: non-critical metrics, sampling, logging where some loss is acceptable
At-least-once:
- Retry on failure; an event may be processed multiple times
- Requirement: processing must be idempotent (processing same event twice = same result)
- Overhead: low (track consumer offset; retry on failure)
- Use when: most production use cases — acceptable if writes are idempotent (e.g., key-value upserts)
Exactly-once:
- Each event processed exactly once, even with failures
- Overhead: significant (transactions or checkpoints; idempotent sinks required)
- Use when: financial transactions, billing, inventory counts, deduplication
Idempotency trick: Even with at-least-once, you can achieve exactly-once semantics if your sink deduplicates by event ID:
Event ID included in each message →
Elasticsearch/database: "if event_id already exists, skip"
→ Effectively exactly-once even with at-least-once delivery
How does Apache Flink achieve exactly-once semantics using the Chandy-Lamport checkpoint algorithm?
?
Core idea: Periodically snapshot all operator state + input offsets atomically → on failure, rollback to snapshot and replay.
Step 1: JobManager injects BARRIER markers into all input partitions:
Kafka partition 0: [event1][event2][BARRIER_1][event3]...
Kafka partition 1: [eventA][eventB][BARRIER_1][eventC]...
Step 2: Each operator processes BARRIER:
- When BARRIER arrives from ALL input partitions:
- Operator saves its state (e.g., count-per-window HashMap) to durable storage (S3, HDFS)
- Records input offset at this checkpoint
- Forwards BARRIER to downstream operators
- Continues processing (no pause during checkpoint)
Step 3: Checkpoint complete:
- All operators have saved state
- JobManager records: “checkpoint 42 is consistent”
On failure:
- Restart all operators
- Restore each operator’s state from checkpoint 42
- Re-read Kafka from the offsets recorded at checkpoint 42
- Re-process events after checkpoint → deterministic → same results
- Downstream receives same output as before failure (exactly-once, if sink is idempotent/transactional)
Requirements for true exactly-once:
- Flink state checkpointed (covered above)
- Idempotent or transactional sinks: If Elasticsearch receives same document twice → upsert (idempotent). If writing to Kafka → use Kafka transactions.
How do Kafka transactions enable exactly-once between Kafka topics?
?
Problem without transactions:
1. Consumer reads event from input topic at offset 500
2. Processes event → produces result to output topic ✅
3. Crashes before committing offset 500
4. Restart: re-reads offset 500 → produces result AGAIN → duplicate in output
Kafka transactions solution:
producer.initTransactions();
// For each batch of events:
producer.beginTransaction();
try {
// Write output record
producer.send(new ProducerRecord<>("output-topic", key, result));
// Commit input offset AS PART OF transaction
producer.sendOffsetsToTransaction(offsetsToCommit, consumerGroupId);
// Commit: both output write + offset update are atomic
producer.commitTransaction();
} catch (Exception e) {
producer.abortTransaction(); // Rolls back both output write AND offset
}Consumer must use: isolation.level=read_committed
- Skips records that are part of an uncommitted (or aborted) transaction
- Only sees records from committed transactions → no duplicates visible
Overhead: ~5–10% throughput reduction compared to non-transactional.
Use case: Kafka-to-Kafka stream processing pipelines (Kafka Streams, Flink-to-Kafka).
Kafka vs Kinesis vs Pulsar vs RabbitMQ
Compare Kafka, Kinesis, Pulsar, and RabbitMQ. When would you choose each?
?
| Property | Kafka | Kinesis | Pulsar | RabbitMQ |
|---|---|---|---|---|
| Type | Log-based (pull) | Log-based (pull) | Log + tiered storage | Message queue (push) |
| Retention | Configurable | 1–365 days | Tiered (S3 cheap) | Until consumed |
| Replay | Yes | Yes | Yes | No |
| Ordering | Per partition | Per shard | Per partition | Per queue |
| Ops burden | Medium (self-managed) | Low (AWS managed) | Medium | Low |
| Ecosystem | Richest (Debezium, Connect, Flink, ksqlDB) | AWS-native | Growing | Enterprise legacy |
Choose Kafka:
- Open source, maximum ecosystem (Debezium CDC, Kafka Connect, Flink, ksqlDB)
- Need Kafka Streams embedded in application
- Not locked into a single cloud
Choose Kinesis:
- AWS-native stack (tight IAM, Lambda triggers, MSK managed Kafka alternative)
- Simplicity over flexibility; AWS handles operations
Choose Pulsar:
- Long message retention at low cost (tiered storage: recent in memory, old in S3)
- Multi-tenancy (multiple teams, isolated namespaces on shared cluster)
- Geo-replication built in
Choose RabbitMQ (or SQS):
- Simple task queues with no need for replay
- AMQP protocol required
- Very low latency (sub-millisecond)
State Management
What is the difference between stateless and stateful stream processing? What makes stateful hard?
?
Stateless stream processing:
- Each event processed independently; no memory of past events
- Examples: parse JSON, filter events, enrich from a static table
- Trivial fault tolerance: drop the event and retry
- Horizontal scaling: any worker can process any event
Stateful stream processing:
- Processing requires memory of past events (accumulated state)
- Examples: count events per user in 5-minute windows; detect 3 failed transactions within 5 minutes; deduplicate by event ID
- State stored in: RocksDB (Flink, Kafka Streams), embedded KV store
- State must be durable: if worker fails, state must be recoverable
Why stateful is hard:
- State recovery: On failure, state must be restored (Flink checkpoints to S3)
- State size: Window state grows unbounded without expiry policies
- Repartitioning: Adding workers requires redistributing state (expensive)
- Correctness: Out-of-order events can invalidate already-emitted window results
- Exactly-once: Stateful exactly-once requires both state checkpoint AND idempotent output
RocksDB in Flink:
- Default state backend for large state
- Writes state to local SSD; checkpoints serialized state to S3
- Can handle state much larger than RAM (spills to disk)
Modern Tools
What is ksqlDB and how does it differ from Apache Flink for stream processing?
?
ksqlDB (Confluent):
- SQL interface directly over Kafka topics
- Runs inside Kafka cluster (no separate processing cluster)
- Continuous queries:
SELECT ... FROM topic EMIT CHANGES - Materialized views as Kafka topics (queryable via REST)
ksqlDB strengths:
- No separate cluster; simpler ops
- Fast to prototype; SQL familiar
- Built-in Kafka integration (Kafka Connect sources/sinks)
ksqlDB limitations:
- Less flexible than Flink for complex stateful logic
- No Python API; SQL only
- Less efficient for very high throughput
- Limited windowed join options
Apache Flink strengths:
- Full-featured stream processor; handles complex stateful pipelines
- Exactly-once with strong guarantees
- Python, Java, SQL APIs
- Better performance for complex stateful jobs at scale
- Batch + stream unified (same runtime)
Decision:
- Simple filtering, enrichment, basic aggregations on Kafka → ksqlDB
- Complex stateful joins, ML feature computation, high-throughput, exactly-once → Apache Flink
- Unified batch + stream, Kubernetes deployment → Apache Flink
Architecture and Design
What is a streaming lakehouse and how does it replace traditional Lambda architecture?
?
Traditional Lambda architecture:
Raw data → Batch layer (Spark, hours lag) ──┐
→ Speed layer (Flink, seconds lag) ──┤→ Serving layer (merge both)
Problem: Two codebases; two sets of bugs; complex merge at query time.
Streaming Lakehouse (2026 standard):
Events → Flink → Apache Iceberg tables on S3 → Trino/DuckDB queries
↑ writes ACID transactions to Iceberg
↑ query tools read directly (no separate serving layer)
How Flink + Iceberg replaces Lambda:
- Flink writes streaming events to Iceberg tables with ACID guarantees
- Historical reprocessing: replay Kafka topic, write to new Iceberg table version
- BI tools (Superset, Looker, Redash) query Iceberg directly via Trino or Athena
- Sub-minute data freshness without a separate OLAP system
- Time travel: query data as of any past checkpoint
Why better than Lambda:
- Single codebase (Flink handles both real-time and historical)
- No serving layer to maintain (BI tools query lakehouse directly)
- ACID guarantees (no dirty reads from partially-written batch outputs)
- Lower operational complexity
What is event sourcing and how does it relate to CDC?
?
Event sourcing: Store ALL state changes as an immutable sequence of domain events; current state is derived by replaying events.
Traditional: Store current state (mutable row)
orders table: {id=1, status="shipped", amount=99} ← last write wins
Event sourcing: Store all events (immutable append)
order_events: [OrderPlaced(99), PaymentConfirmed, ItemShipped]
Current state = reduce(order_events) = {status="shipped", ...}
Event sourcing properties:
- Complete audit trail (every state transition recorded)
- Time travel: derive state at any past point by replaying to that point
- Reprocessing: change business logic → replay all events → new derived state
- Events are facts; state is derived
Difference from CDC:
| CDC | Event Sourcing | |
|---|---|---|
| Level | Infrastructure (DB row changes) | Application (domain events) |
| Designed for | Sync derived systems | Application architecture |
| Event granularity | SQL row-level (INSERT/UPDATE/DELETE) | Business events (“OrderShipped”) |
| Who uses | Platform/infrastructure team | Application developers |
| Relationship | CDC reads DB writes | Event sourcing stores domain events; CDC can stream them |
In practice: Event sourcing systems often USE CDC infrastructure. The event store (Kafka or EventStoreDB) can be treated as a durable log; CDC pipelines can consume it to sync derived stores (Elasticsearch, data warehouse).
Total Cards: 16
Review Time: ~25 minutes
Priority: HIGH
Last Updated: 2026-05-29