Chapter 12 Cheat Sheet — Stream Processing

One-Line Summaries

Concept	One-Liner
Stream processing	Batch processing on unbounded (infinite) data; act on events as they arrive
Message log (Kafka)	Append-only log with offset tracking; enables replay, fan-out, backpressure
Message queue (RabbitMQ)	Push-based; message deleted after consumed; no replay
CDC	Capture DB changes from replication log → publish to Kafka → sync derived systems
Debezium	Open-source CDC platform; reads PostgreSQL WAL → Kafka Connect → Kafka topics
Watermark	Stream processor’s estimate of “events before time T have likely arrived”
Event time	When event actually occurred (use for business logic)
Processing time	When event arrives at processor (use for operational SLAs)
Stream-table join	Enrich stream events with lookup table data; no expiry; for enrichment
Stream-stream join	Correlate two event streams within a time window; both sides expire
Exactly-once	Each event processed exactly once; requires idempotent sinks + checkpoints
Log compaction	Retain only latest value per key in Kafka; enables CDC state bootstrapping

Log-Based vs Message Queue

MESSAGE QUEUE (RabbitMQ, SQS):
  ┌──────────┐    push    ┌──────────────────────────────┐
  │  Queue   │ ─────────► │ Consumer A (gets message,    │
  │          │            │            deletes it)        │
  └──────────┘            └──────────────────────────────┘
  
  After Consumer A processes: message GONE
  Consumer B: cannot read it (already deleted)
  On crash: message lost unless acknowledged before crash
  Good for: task queues, job distribution

LOG-BASED BROKER (Kafka, Kinesis):
  ┌─────────────────────────────────────────────────────┐
  │  LOG: [offset 0][offset 1][offset 2]...[offset N]   │
  └──┬────────────────────────────────────────────────┘
     │           │           │
     │           │           │ (each group reads independently)
  Consumer    Consumer     Consumer
  Group A     Group B      Group C
  (at off.7)  (at off.3)   (at off.N-1)
  
  Message retained until log expiry policy (not deleted on consumption)
  Any consumer group can replay from any offset
  Good for: event streaming, CDC, audit, replay

CDC Pipeline: PostgreSQL → Debezium → Kafka → Flink → Elasticsearch

┌───────────────┐      ┌──────────────┐      ┌──────────────┐
│  PostgreSQL   │      │   Debezium   │      │    Kafka     │
│               │      │ (Kafka       │      │              │
│  WAL (Write-  │─────►│  Connect)    │─────►│  Topic:      │
│  Ahead Log):  │      │              │      │  pg.public.  │
│               │      │  Reads WAL   │      │  orders      │
│  INSERT/      │      │  Transforms  │      │              │
│  UPDATE/      │      │  to JSON/    │      │  [CDC events]│
│  DELETE       │      │  Avro events │      └──────┬───────┘
│               │      └──────────────┘             │
└───────────────┘                                   │
                                                    ▼
                                         ┌──────────────────┐
                                         │   Apache Flink   │
                                         │                  │
                                         │  - Filter events │
                                         │  - Enrich with   │
                                         │    other tables  │
                                         │  - Transform     │
                                         └────────┬─────────┘
                                                  │
                                         ┌────────▼─────────┐
                                         │  Elasticsearch   │
                                         │  (search index   │
                                         │   updated in     │
                                         │   seconds)       │
                                         └──────────────────┘

CDC event structure:
  {"before": {...old row...}, "after": {...new row...}, "op": "u"}
  op = c (create), u (update), d (delete), r (read/snapshot)

Event Time vs Processing Time

Producer                    Broker              Processor
────────────────────────────────────────────────────────
Event occurs at:            Received at:        Processed at:
  event_time = 14:00:05      14:00:07             14:00:35
                ↑ USE THIS for business logic      ↑ this is processing time

Why they differ:
  Mobile app buffered offline 4 hours → event_time=10:00, processing_time=14:00
  Network delay
  Consumer paused/slow
  Clock skew

Rule:
  event_time  → use for: revenue per hour, session duration, user activity
  processing_time → use for: job throughput metrics, "is job keeping up?"

Watermark Strategies

FIXED-LAG WATERMARK:
  watermark(t) = max_event_time_seen - lag
  Example: lag=2min → if latest event=14:10, watermark=14:08
  Windows before 14:08 are closed and emitted
  
  ✅ Simple, predictable
  ❌ Events > 2min late are silently dropped (or go to side output)

HEURISTIC WATERMARK:
  Observe actual P99 late arrival delay → set lag accordingly
  Adapts to data patterns; more accurate
  
  ✅ Better completeness for variable delay patterns
  ❌ Complex; still a heuristic

GRACE PERIOD:
  Window stays open for extra N seconds after watermark passes
  Late events within grace period included; after grace → dropped/side-output
  
  ✅ Catches most late arrivals without unbounded waiting
  Flink: .withIdleness() + .allowedLateness(Duration.ofSeconds(30))

TRADE-OFF:
  Larger lag → more completeness, more latency
  Smaller lag → less latency, more data dropped

Stream Join Types

1. STREAM-TABLE JOIN (enrichment):
   ─────────────────────────────────────────────────────────
   Stream: [order_id=1, user_id=42, amount=99] ──────────►
   Table:  {user_id=42 → tier="gold", country="US"}  (loaded in state store)
   Output: [order_id=1, user_id=42, amount=99, tier="gold", country="US"]
   
   State: only lookup table (no stream state)
   Expiry: table never expires (persistent ref data)
   Keep table fresh: CDC feed updates state store

2. STREAM-STREAM JOIN (event correlation):
   ─────────────────────────────────────────────────────────
   Stream A (impressions): [user=42, ad=9, t=14:00] ──────►
   Stream B (clicks):      [user=42, ad=9, t=14:07] ──────►
   
   Join: same user+ad, click within 30min of impression
   State: buffer both streams; expire after 30min
   Output: [user=42, ad=9, impression=14:00, click=14:07, delay=7min]

3. TABLE-TABLE JOIN (incremental materialized view):
   ─────────────────────────────────────────────────────────
   Table A (CDC): customer updates
   Table B (CDC): order updates
   
   Materialized view = JOIN(customers, orders), updated on each CDC event
   Used in: ksqlDB, Kafka Streams, Flink Table API

Exactly-Once Delivery: Three Approaches

APPROACH 1: Idempotent writes
  Process event → write with unique event_id → sink deduplicates
  If event re-processed: same event_id → overwrite (no duplicate)
  Requirement: sink must support deduplication by ID
  Overhead: low

APPROACH 2: Kafka Transactions
  producer.beginTransaction()
  producer.send(output_topic, result)
  consumer.commitSync(offsets_to_commit)     // commit input offset
  producer.commitTransaction()               // atomic: output + offset
  
  Consumer must use: isolation.level=read_committed
  Overhead: moderate (~5-10% throughput reduction)

APPROACH 3: Flink Chandy-Lamport Checkpoints
  Step 1: Inject BARRIER into all input streams
  Step 2: Each operator snapshots state to S3/HDFS on BARRIER
  Step 3: Record input offset at checkpoint
  On failure: rollback all operators to last checkpoint; replay from offset
  
  Overhead: checkpoint interval (e.g., every 5 minutes of extra state I/O)
  Requirement: deterministic processing; idempotent or transactional sinks

Kafka vs Kinesis vs Pulsar vs RabbitMQ

	Kafka	Kinesis	Pulsar	RabbitMQ
Type	Log (pull)	Log (pull)	Log + queue	Queue (push)
Replay	Yes	Yes	Yes	No
Retention	Configurable	1–365 days	Tiered (S3)	Until acked
Throughput	Very high	High	High	Medium
Ops burden	Medium	Low (managed)	Medium	Low
Ecosystem	Richest	AWS-native	Growing	Enterprise
Choose when	Open source, rich ecosystem	AWS-native stack	Cost-sensitive retention	Simple task queue

Delivery Guarantees

AT-MOST-ONCE:
  Fire and forget; don't retry
  ❌ Events lost on failure
  ✅ Zero latency overhead
  Use: non-critical logging, sampling

AT-LEAST-ONCE:
  Retry on failure; may duplicate
  ✅ No data loss
  ❌ Downstream must be idempotent
  Use: most production use cases with idempotent sinks

EXACTLY-ONCE:
  Each event processed exactly once
  ✅ Correct for financial/billing use cases
  ❌ Overhead: transactions, checkpoints
  Use: financial transactions, inventory, billing

Log Compaction for CDC State

Normal Kafka topic (time-based retention):
  [user=1: A][user=2: B][user=1: B][user=1: C]  →  [EXPIRY]→
                                              ↑ all retained until TTL
  
Compacted Kafka topic:
  [user=1: A][user=2: B][user=1: B][user=1: C]
                    ↓ compaction ↓
  [user=2: B][user=1: C]     ← only LATEST value per key retained
  
Reading compacted topic from offset 0 = current state of all keys
Tombstone: user=1 with null value = DELETE user=1 from state

CDC bootstrapping:
  New Elasticsearch index consumer:
  1. Read compacted Kafka topic from offset 0 → get current state
  2. Continue reading new events → stay up to date
  No need to replay all history!

Red Flags

❌  Dual writes to database + Kafka (race conditions → divergence)
❌  Windowing by processing time for business metrics (use event time!)
❌  No late data handling (silently losing mobile/offline events)
❌  Using RabbitMQ when you need replay (messages deleted on consumption)
❌  Exactly-once without idempotent sinks (Flink checkpoints don't help if Elasticsearch double-writes)
❌  Stream-stream join with unbounded window (state grows forever → OOM)
❌  Kafka partition count < consumer count (consumers idle, no work to do)

Green Flags

✅  Kafka with retention policy for replay capability
✅  Debezium CDC: single source of truth stays in DB
✅  Event time windowing + watermark with grace period
✅  Broadcast hash join for large stream × small table (no shuffle)
✅  Log compaction for CDC topics (bootstrapping new consumers)
✅  Flink checkpointing + idempotent sink for exactly-once
✅  Side outputs for late events (don't discard; handle separately)
✅  Consumer group per independent downstream system (fan-out)

Quick Revision Time: 5 minutes
Interview Prep: 20 minutes
Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch12-cheatsheet

Chapter 12 Cheat Sheet — Stream Processing

One-Line Summaries

Log-Based vs Message Queue

CDC Pipeline: PostgreSQL → Debezium → Kafka → Flink → Elasticsearch

Event Time vs Processing Time

Watermark Strategies

Stream Join Types

Exactly-Once Delivery: Three Approaches

Kafka vs Kinesis vs Pulsar vs RabbitMQ

Delivery Guarantees

Log Compaction for CDC State

Red Flags

Green Flags

Graph View

Table of Contents