Chapter 12 Cheat Sheet — Stream Processing

One-Line Summaries

ConceptOne-Liner
Stream processingBatch processing on unbounded (infinite) data; act on events as they arrive
Message log (Kafka)Append-only log with offset tracking; enables replay, fan-out, backpressure
Message queue (RabbitMQ)Push-based; message deleted after consumed; no replay
CDCCapture DB changes from replication log → publish to Kafka → sync derived systems
DebeziumOpen-source CDC platform; reads PostgreSQL WAL → Kafka Connect → Kafka topics
WatermarkStream processor’s estimate of “events before time T have likely arrived”
Event timeWhen event actually occurred (use for business logic)
Processing timeWhen event arrives at processor (use for operational SLAs)
Stream-table joinEnrich stream events with lookup table data; no expiry; for enrichment
Stream-stream joinCorrelate two event streams within a time window; both sides expire
Exactly-onceEach event processed exactly once; requires idempotent sinks + checkpoints
Log compactionRetain only latest value per key in Kafka; enables CDC state bootstrapping

Log-Based vs Message Queue

MESSAGE QUEUE (RabbitMQ, SQS):
  ┌──────────┐    push    ┌──────────────────────────────┐
  │  Queue   │ ─────────► │ Consumer A (gets message,    │
  │          │            │            deletes it)        │
  └──────────┘            └──────────────────────────────┘
  
  After Consumer A processes: message GONE
  Consumer B: cannot read it (already deleted)
  On crash: message lost unless acknowledged before crash
  Good for: task queues, job distribution

LOG-BASED BROKER (Kafka, Kinesis):
  ┌─────────────────────────────────────────────────────┐
  │  LOG: [offset 0][offset 1][offset 2]...[offset N]   │
  └──┬────────────────────────────────────────────────┘
     │           │           │
     │           │           │ (each group reads independently)
  Consumer    Consumer     Consumer
  Group A     Group B      Group C
  (at off.7)  (at off.3)   (at off.N-1)
  
  Message retained until log expiry policy (not deleted on consumption)
  Any consumer group can replay from any offset
  Good for: event streaming, CDC, audit, replay
┌───────────────┐      ┌──────────────┐      ┌──────────────┐
│  PostgreSQL   │      │   Debezium   │      │    Kafka     │
│               │      │ (Kafka       │      │              │
│  WAL (Write-  │─────►│  Connect)    │─────►│  Topic:      │
│  Ahead Log):  │      │              │      │  pg.public.  │
│               │      │  Reads WAL   │      │  orders      │
│  INSERT/      │      │  Transforms  │      │              │
│  UPDATE/      │      │  to JSON/    │      │  [CDC events]│
│  DELETE       │      │  Avro events │      └──────┬───────┘
│               │      └──────────────┘             │
└───────────────┘                                   │
                                                    ▼
                                         ┌──────────────────┐
                                         │   Apache Flink   │
                                         │                  │
                                         │  - Filter events │
                                         │  - Enrich with   │
                                         │    other tables  │
                                         │  - Transform     │
                                         └────────┬─────────┘
                                                  │
                                         ┌────────▼─────────┐
                                         │  Elasticsearch   │
                                         │  (search index   │
                                         │   updated in     │
                                         │   seconds)       │
                                         └──────────────────┘

CDC event structure:
  {"before": {...old row...}, "after": {...new row...}, "op": "u"}
  op = c (create), u (update), d (delete), r (read/snapshot)

Event Time vs Processing Time

Producer                    Broker              Processor
────────────────────────────────────────────────────────
Event occurs at:            Received at:        Processed at:
  event_time = 14:00:05      14:00:07             14:00:35
                ↑ USE THIS for business logic      ↑ this is processing time

Why they differ:
  Mobile app buffered offline 4 hours → event_time=10:00, processing_time=14:00
  Network delay
  Consumer paused/slow
  Clock skew

Rule:
  event_time  → use for: revenue per hour, session duration, user activity
  processing_time → use for: job throughput metrics, "is job keeping up?"

Watermark Strategies

FIXED-LAG WATERMARK:
  watermark(t) = max_event_time_seen - lag
  Example: lag=2min → if latest event=14:10, watermark=14:08
  Windows before 14:08 are closed and emitted
  
  ✅ Simple, predictable
  ❌ Events > 2min late are silently dropped (or go to side output)

HEURISTIC WATERMARK:
  Observe actual P99 late arrival delay → set lag accordingly
  Adapts to data patterns; more accurate
  
  ✅ Better completeness for variable delay patterns
  ❌ Complex; still a heuristic

GRACE PERIOD:
  Window stays open for extra N seconds after watermark passes
  Late events within grace period included; after grace → dropped/side-output
  
  ✅ Catches most late arrivals without unbounded waiting
  Flink: .withIdleness() + .allowedLateness(Duration.ofSeconds(30))

TRADE-OFF:
  Larger lag → more completeness, more latency
  Smaller lag → less latency, more data dropped

Stream Join Types

1. STREAM-TABLE JOIN (enrichment):
   ─────────────────────────────────────────────────────────
   Stream: [order_id=1, user_id=42, amount=99] ──────────►
   Table:  {user_id=42 → tier="gold", country="US"}  (loaded in state store)
   Output: [order_id=1, user_id=42, amount=99, tier="gold", country="US"]
   
   State: only lookup table (no stream state)
   Expiry: table never expires (persistent ref data)
   Keep table fresh: CDC feed updates state store

2. STREAM-STREAM JOIN (event correlation):
   ─────────────────────────────────────────────────────────
   Stream A (impressions): [user=42, ad=9, t=14:00] ──────►
   Stream B (clicks):      [user=42, ad=9, t=14:07] ──────►
   
   Join: same user+ad, click within 30min of impression
   State: buffer both streams; expire after 30min
   Output: [user=42, ad=9, impression=14:00, click=14:07, delay=7min]

3. TABLE-TABLE JOIN (incremental materialized view):
   ─────────────────────────────────────────────────────────
   Table A (CDC): customer updates
   Table B (CDC): order updates
   
   Materialized view = JOIN(customers, orders), updated on each CDC event
   Used in: ksqlDB, Kafka Streams, Flink Table API

Exactly-Once Delivery: Three Approaches

APPROACH 1: Idempotent writes
  Process event → write with unique event_id → sink deduplicates
  If event re-processed: same event_id → overwrite (no duplicate)
  Requirement: sink must support deduplication by ID
  Overhead: low

APPROACH 2: Kafka Transactions
  producer.beginTransaction()
  producer.send(output_topic, result)
  consumer.commitSync(offsets_to_commit)     // commit input offset
  producer.commitTransaction()               // atomic: output + offset
  
  Consumer must use: isolation.level=read_committed
  Overhead: moderate (~5-10% throughput reduction)

APPROACH 3: Flink Chandy-Lamport Checkpoints
  Step 1: Inject BARRIER into all input streams
  Step 2: Each operator snapshots state to S3/HDFS on BARRIER
  Step 3: Record input offset at checkpoint
  On failure: rollback all operators to last checkpoint; replay from offset
  
  Overhead: checkpoint interval (e.g., every 5 minutes of extra state I/O)
  Requirement: deterministic processing; idempotent or transactional sinks

Kafka vs Kinesis vs Pulsar vs RabbitMQ

KafkaKinesisPulsarRabbitMQ
TypeLog (pull)Log (pull)Log + queueQueue (push)
ReplayYesYesYesNo
RetentionConfigurable1–365 daysTiered (S3)Until acked
ThroughputVery highHighHighMedium
Ops burdenMediumLow (managed)MediumLow
EcosystemRichestAWS-nativeGrowingEnterprise
Choose whenOpen source, rich ecosystemAWS-native stackCost-sensitive retentionSimple task queue

Delivery Guarantees

AT-MOST-ONCE:
  Fire and forget; don't retry
  ❌ Events lost on failure
  ✅ Zero latency overhead
  Use: non-critical logging, sampling

AT-LEAST-ONCE:
  Retry on failure; may duplicate
  ✅ No data loss
  ❌ Downstream must be idempotent
  Use: most production use cases with idempotent sinks

EXACTLY-ONCE:
  Each event processed exactly once
  ✅ Correct for financial/billing use cases
  ❌ Overhead: transactions, checkpoints
  Use: financial transactions, inventory, billing

Log Compaction for CDC State

Normal Kafka topic (time-based retention):
  [user=1: A][user=2: B][user=1: B][user=1: C]  →  [EXPIRY]→
                                              ↑ all retained until TTL
  
Compacted Kafka topic:
  [user=1: A][user=2: B][user=1: B][user=1: C]
                    ↓ compaction ↓
  [user=2: B][user=1: C]     ← only LATEST value per key retained
  
Reading compacted topic from offset 0 = current state of all keys
Tombstone: user=1 with null value = DELETE user=1 from state

CDC bootstrapping:
  New Elasticsearch index consumer:
  1. Read compacted Kafka topic from offset 0 → get current state
  2. Continue reading new events → stay up to date
  No need to replay all history!

Red Flags

❌  Dual writes to database + Kafka (race conditions → divergence)
❌  Windowing by processing time for business metrics (use event time!)
❌  No late data handling (silently losing mobile/offline events)
❌  Using RabbitMQ when you need replay (messages deleted on consumption)
❌  Exactly-once without idempotent sinks (Flink checkpoints don't help if Elasticsearch double-writes)
❌  Stream-stream join with unbounded window (state grows forever → OOM)
❌  Kafka partition count < consumer count (consumers idle, no work to do)

Green Flags

✅  Kafka with retention policy for replay capability
✅  Debezium CDC: single source of truth stays in DB
✅  Event time windowing + watermark with grace period
✅  Broadcast hash join for large stream × small table (no shuffle)
✅  Log compaction for CDC topics (bootstrapping new consumers)
✅  Flink checkpointing + idempotent sink for exactly-once
✅  Side outputs for late events (don't discard; handle separately)
✅  Consumer group per independent downstream system (fan-out)

Quick Revision Time: 5 minutes
Interview Prep: 20 minutes
Last Updated: 2026-05-29