Chapter 11 Cheat Sheet - Stream Processing
One-Line Summaries
| Concept | One-Liner |
|---|---|
| Stream processing | Batch processing over unbounded (infinite) data streams |
| Message queue | Push-based; message deleted after one consumer receives it |
| Message log | Append-only; consumers track their own offset; replay supported |
| CDC | Change Data Capture — stream DB changes from replication log to Kafka |
| Event sourcing | Store events, not state; state = fold over event log |
| Event time | When the event actually occurred (timestamp in event) |
| Processing time | When the stream processor received/processed the event |
| Watermark | Estimate of how far behind in event time the stream is |
| Tumbling window | Fixed-size, non-overlapping time window |
| Sliding window | Overlapping windows (e.g., last 5 minutes, updated each minute) |
| Session window | Ends after inactivity; variable duration |
| Exactly-once | Each event processed exactly once despite failures |
Message Queue vs Message Log
Message Queue (RabbitMQ, SQS): Message Log (Kafka, Kinesis):
────────────────────────────── ────────────────────────────────────
PUSH: broker pushes to consumer PULL: consumer reads at own pace
Message DELETED after consumption Messages RETAINED (configurable time)
One consumer per message Multiple consumer groups (each reads all)
No replay REPLAY from any offset
Good for: task queues, work items Good for: events, CDC, stream processing
Kafka Architecture
Topic: user-events
┌─────────────────────────────────────────────────────────────────────┐
│ Partition 0: [event_1, event_2, event_5, event_8, ...] │
│ Partition 1: [event_3, event_6, event_9, ...] │
│ Partition 2: [event_4, event_7, event_10, ...] │
└─────────────────────────────────────────────────────────────────────┘
↑
Consumer Group A (analytics): Consumer Group B (billing):
Consumer A1 → Partition 0 Consumer B1 → Partition 0, 1, 2
Consumer A2 → Partition 1
Consumer A3 → Partition 2
Each consumer group reads all events independently
Offset: consumer tracks where it left off (per partition)
CDC Flow
PostgreSQL ──WAL──→ Debezium ──→ Kafka ──→ Multiple consumers:
(source of truth) (CDC tool) (events) ├─ Elasticsearch (search index)
├─ Snowflake (analytics)
├─ Redis (cache)
└─ Other microservices
vs Dual-write (WRONG):
App → DB ←─ race condition risk, partial failure
App → Kafka
CDC guarantees: only one write to DB; everything else derived from that
Event Time vs Processing Time
Event timeline: Processing timeline:
───────────────────────────────── ─────────────────────────────────────
User clicks at t=10:00:00 Kafka receives event at t=10:00:05
Mobile app buffers Stream processor processes at t=10:00:07
User comes online
Kafka receives at t=10:00:45 Event time = 10:00:00
Stream processor at t=10:00:48 Processing time = 10:00:48
LAG = 48 seconds!
Always use EVENT TIME for business logic:
✅ "Events in the 10:00 minute" = events with timestamps 10:00:00-10:00:59
❌ "Events in the 10:00 minute" = events processed during 10:00 (includes late arrivals at 10:00:48)
Window Types
Tumbling Window (size=5min):
|──5min──|──5min──|──5min──|──5min──|
[10:00 ][10:05 ][10:10 ][10:15 ]
Non-overlapping; each event in exactly one window
Sliding Window (size=10min, slide=5min):
|────────10min──────────|
|────────10min─────────|
|────────10min──────────|
[10:00-10:10][10:05-10:15][10:10-10:20]
Overlapping; each event appears in multiple windows
Session Window (gap=30min):
|──── user active ────| |── active ──|
^30min gap
Variable size; ends when user inactive 30min
One session per user activity burst
Watermarks and Late Events
Event stream (event time in brackets):
[10:00] [10:01] [10:03] [09:59 late!] [10:04] ...
Watermark at time T:
"I believe all events with timestamp < T have arrived"
Watermark = max(event_time_seen) - max_lateness_expected
If max lateness = 2 minutes:
Current max event time = 10:05 → watermark = 10:03
When watermark passes 10:00:00 → close 10:00 window → emit aggregate
Late event options:
1. Drop: simple; loses data
2. Side output: emit late events separately for special handling
3. Update window: reemit corrected aggregate (downstream must handle updates)
4. Grace period: keep window open until watermark + grace_period
Exactly-Once Semantics
At-most-once: At-least-once: Exactly-once:
───────────────── ────────────────────── ──────────────────────────────
Don't retry Retry on failure Retry + deduplicate
Might miss events Might duplicate events Each event processed once
APPROACHES:
1. Idempotent output (easiest):
Write with unique event_id; deduplicate at sink
Retry is safe — same ID → same result
2. Kafka Transactions:
Producer writes output + commits input offset atomically
Consumer sees both or neither
3. Flink Chandy-Lamport Checkpointing:
Barrier markers flow through stream
All operators snapshot state at barrier
On failure: rollback + replay from checkpoint
Stream Join Types
Stream-Table Join (enrichment):
Orders stream ──┐
├──→ Enrich with user profile ──→ enriched orders
Users table ────┘ (loaded into memory / cached from CDC)
Stream-Stream Join (correlation):
Searches stream ──┐
├──→ Match searches to clicks ──→ search sessions
Clicks stream ────┘ (both buffered in state; time window for match)
Table-Table Join (materialized view):
Orders CDC ──┐
├──→ Join orders + users ──→ materialized view
Users CDC ───┘ (incremental updates when either changes)
Key Trade-offs
| Decision | Pro | Con | When to Use |
|---|---|---|---|
| Message queue | Simple, push-based | No replay, one consumer | Task queues |
| Message log | Replay, fan-out | More complex | Events, CDC, streaming |
| Tumbling window | Simple, no overlap | Boundary effects (event at edge) | Hourly/daily aggregates |
| Session window | Natural user behavior | Unpredictable state size | User activity tracking |
| Event time | Correct business semantics | Late arrival complexity | Business metrics |
| Processing time | Simple, always available | Wrong for out-of-order | SLA monitoring |
| Exactly-once | Correct counts/sums | Expensive; lower throughput | Financial, billing |
| At-least-once + idempotent | Simple, faster | Requires idempotent design | Most analytics |
Red Flags
❌ Using processing time for business metrics (event order not guaranteed)
❌ Assuming Kafka messages are globally ordered (only within a partition)
❌ Dual writes to DB + Kafka (race condition; use CDC instead)
❌ Accumulating unbounded state without expiry (will OOM)
❌ Not handling late events (silent wrong results for windowed aggregates)
Green Flags
✅ Use event time for business logic; processing time for operational metrics
✅ Set watermarks based on observed event arrival patterns
✅ Use CDC (Debezium) to sync DB changes to derived systems
✅ Design sinks to be idempotent (accept at-least-once easily)
✅ Monitor consumer lag per partition (key operational metric)
Modern Additions (2026)
Flink SQL:
├─ Stream processing as SQL (windows, joins, aggregates)
├─ Same SQL for batch and stream (unified API)
└─ Example: SELECT window_start, user_id, COUNT(*) FROM TABLE(TUMBLE(events, INTERVAL '1' MINUTE))...
Streaming Lakehouse:
├─ Flink → Apache Iceberg on S3 (ACID streaming writes)
├─ Delta Lake + Spark Structured Streaming
└─ Low-latency analytics without separate systems
Feature stores (ML):
├─ Feast, Tecton: serve online features for ML inference
├─ Stream processing computes real-time features
└─ Feature freshness: seconds (vs batch's hours)
Interview Response Templates
When Asked to Design a Real-Time Analytics System
“I’d use Kafka as the event log — durable, replayable, fan-out to multiple consumers. Events would include timestamps (event time) rather than relying on when they arrive. A stream processor (Flink or Kafka Streams) would aggregate events using tumbling windows over event time, with watermarks to handle late arrivals. The output would go to a serving store (Redis for counts, Druid/ClickHouse for ad-hoc queries). I’d design all writes to be idempotent so at-least-once delivery is safe.”
When Asked About CDC vs Dual Writes
“Dual writes have a fundamental race condition: if the app writes to the DB and then to Kafka, a crash between the two leaves them inconsistent. CDC solves this by making the DB the single source of truth and deriving all other systems from the DB’s replication log. Debezium reads the PostgreSQL WAL or MySQL binlog and emits changes to Kafka — all downstream systems (search, cache, analytics) consume from Kafka. The DB is never out of sync with Kafka because Kafka is derived from the DB.”
Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-13