Chapter 11 Cheat Sheet - Stream Processing

One-Line Summaries

Concept	One-Liner
Stream processing	Batch processing over unbounded (infinite) data streams
Message queue	Push-based; message deleted after one consumer receives it
Message log	Append-only; consumers track their own offset; replay supported
CDC	Change Data Capture — stream DB changes from replication log to Kafka
Event sourcing	Store events, not state; state = fold over event log
Event time	When the event actually occurred (timestamp in event)
Processing time	When the stream processor received/processed the event
Watermark	Estimate of how far behind in event time the stream is
Tumbling window	Fixed-size, non-overlapping time window
Sliding window	Overlapping windows (e.g., last 5 minutes, updated each minute)
Session window	Ends after inactivity; variable duration
Exactly-once	Each event processed exactly once despite failures

Message Queue vs Message Log

Message Queue (RabbitMQ, SQS):        Message Log (Kafka, Kinesis):
──────────────────────────────         ────────────────────────────────────
PUSH: broker pushes to consumer        PULL: consumer reads at own pace
Message DELETED after consumption      Messages RETAINED (configurable time)
One consumer per message               Multiple consumer groups (each reads all)
No replay                              REPLAY from any offset
Good for: task queues, work items      Good for: events, CDC, stream processing

Kafka Architecture

Topic: user-events
┌─────────────────────────────────────────────────────────────────────┐
│ Partition 0: [event_1, event_2, event_5, event_8, ...]              │
│ Partition 1: [event_3, event_6, event_9, ...]                       │
│ Partition 2: [event_4, event_7, event_10, ...]                      │
└─────────────────────────────────────────────────────────────────────┘
                              ↑
Consumer Group A (analytics):          Consumer Group B (billing):
  Consumer A1 → Partition 0             Consumer B1 → Partition 0, 1, 2
  Consumer A2 → Partition 1
  Consumer A3 → Partition 2

Each consumer group reads all events independently
Offset: consumer tracks where it left off (per partition)

CDC Flow

PostgreSQL ──WAL──→ Debezium ──→ Kafka ──→ Multiple consumers:
(source of truth)  (CDC tool)  (events)     ├─ Elasticsearch (search index)
                                             ├─ Snowflake (analytics)
                                             ├─ Redis (cache)
                                             └─ Other microservices

vs Dual-write (WRONG):
App → DB  ←─ race condition risk, partial failure
App → Kafka

CDC guarantees: only one write to DB; everything else derived from that

Event Time vs Processing Time

Event timeline:                         Processing timeline:
─────────────────────────────────       ─────────────────────────────────────
User clicks at t=10:00:00               Kafka receives event at t=10:00:05
Mobile app buffers                      Stream processor processes at t=10:00:07
User comes online
Kafka receives at t=10:00:45            Event time = 10:00:00
Stream processor at t=10:00:48          Processing time = 10:00:48
                                        LAG = 48 seconds!

Always use EVENT TIME for business logic:
✅ "Events in the 10:00 minute" = events with timestamps 10:00:00-10:00:59
❌ "Events in the 10:00 minute" = events processed during 10:00 (includes late arrivals at 10:00:48)

Window Types

Tumbling Window (size=5min):
  |──5min──|──5min──|──5min──|──5min──|
  [10:00   ][10:05  ][10:10  ][10:15  ]
  Non-overlapping; each event in exactly one window

Sliding Window (size=10min, slide=5min):
  |────────10min──────────|
          |────────10min─────────|
                  |────────10min──────────|
  [10:00-10:10][10:05-10:15][10:10-10:20]
  Overlapping; each event appears in multiple windows

Session Window (gap=30min):
  |──── user active ────|   |── active ──|
                            ^30min gap
  Variable size; ends when user inactive 30min
  One session per user activity burst

Watermarks and Late Events

Event stream (event time in brackets):
  [10:00] [10:01] [10:03] [09:59 late!] [10:04] ...

Watermark at time T:
  "I believe all events with timestamp < T have arrived"
  
Watermark = max(event_time_seen) - max_lateness_expected
  If max lateness = 2 minutes:
  Current max event time = 10:05 → watermark = 10:03
  
When watermark passes 10:00:00 → close 10:00 window → emit aggregate

Late event options:
  1. Drop: simple; loses data
  2. Side output: emit late events separately for special handling
  3. Update window: reemit corrected aggregate (downstream must handle updates)
  4. Grace period: keep window open until watermark + grace_period

Exactly-Once Semantics

At-most-once:        At-least-once:         Exactly-once:
─────────────────    ──────────────────────  ──────────────────────────────
Don't retry          Retry on failure        Retry + deduplicate
Might miss events    Might duplicate events  Each event processed once

APPROACHES:
1. Idempotent output (easiest):
   Write with unique event_id; deduplicate at sink
   Retry is safe — same ID → same result

2. Kafka Transactions:
   Producer writes output + commits input offset atomically
   Consumer sees both or neither

3. Flink Chandy-Lamport Checkpointing:
   Barrier markers flow through stream
   All operators snapshot state at barrier
   On failure: rollback + replay from checkpoint

Stream Join Types

Stream-Table Join (enrichment):
  Orders stream ──┐
                  ├──→ Enrich with user profile ──→ enriched orders
  Users table ────┘ (loaded into memory / cached from CDC)

Stream-Stream Join (correlation):
  Searches stream ──┐
                    ├──→ Match searches to clicks ──→ search sessions
  Clicks stream ────┘ (both buffered in state; time window for match)

Table-Table Join (materialized view):
  Orders CDC ──┐
               ├──→ Join orders + users ──→ materialized view
  Users CDC ───┘ (incremental updates when either changes)

Key Trade-offs

Decision	Pro	Con	When to Use
Message queue	Simple, push-based	No replay, one consumer	Task queues
Message log	Replay, fan-out	More complex	Events, CDC, streaming
Tumbling window	Simple, no overlap	Boundary effects (event at edge)	Hourly/daily aggregates
Session window	Natural user behavior	Unpredictable state size	User activity tracking
Event time	Correct business semantics	Late arrival complexity	Business metrics
Processing time	Simple, always available	Wrong for out-of-order	SLA monitoring
Exactly-once	Correct counts/sums	Expensive; lower throughput	Financial, billing
At-least-once + idempotent	Simple, faster	Requires idempotent design	Most analytics

Red Flags

❌ Using processing time for business metrics (event order not guaranteed)
❌ Assuming Kafka messages are globally ordered (only within a partition)
❌ Dual writes to DB + Kafka (race condition; use CDC instead)
❌ Accumulating unbounded state without expiry (will OOM)
❌ Not handling late events (silent wrong results for windowed aggregates)

Green Flags

✅ Use event time for business logic; processing time for operational metrics
✅ Set watermarks based on observed event arrival patterns
✅ Use CDC (Debezium) to sync DB changes to derived systems
✅ Design sinks to be idempotent (accept at-least-once easily)
✅ Monitor consumer lag per partition (key operational metric)

Modern Additions (2026)

Flink SQL:
├─ Stream processing as SQL (windows, joins, aggregates)
├─ Same SQL for batch and stream (unified API)
└─ Example: SELECT window_start, user_id, COUNT(*) FROM TABLE(TUMBLE(events, INTERVAL '1' MINUTE))...

Streaming Lakehouse:
├─ Flink → Apache Iceberg on S3 (ACID streaming writes)
├─ Delta Lake + Spark Structured Streaming
└─ Low-latency analytics without separate systems

Feature stores (ML):
├─ Feast, Tecton: serve online features for ML inference
├─ Stream processing computes real-time features
└─ Feature freshness: seconds (vs batch's hours)

Interview Response Templates

When Asked to Design a Real-Time Analytics System

“I’d use Kafka as the event log — durable, replayable, fan-out to multiple consumers. Events would include timestamps (event time) rather than relying on when they arrive. A stream processor (Flink or Kafka Streams) would aggregate events using tumbling windows over event time, with watermarks to handle late arrivals. The output would go to a serving store (Redis for counts, Druid/ClickHouse for ad-hoc queries). I’d design all writes to be idempotent so at-least-once delivery is safe.”

When Asked About CDC vs Dual Writes

“Dual writes have a fundamental race condition: if the app writes to the DB and then to Kafka, a crash between the two leaves them inconsistent. CDC solves this by making the DB the single source of truth and deriving all other systems from the DB’s replication log. Debezium reads the PostgreSQL WAL or MySQL binlog and emits changes to Kafka — all downstream systems (search, cache, analytics) consume from Kafka. The DB is never out of sync with Kafka because Kafka is derived from the DB.”

Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch11-cheatsheet