Chapter 13 Cheat Sheet — A Philosophy of Streaming Systems

One-Line Summaries

Concept	One-Liner
Streaming as fundamental paradigm	Events are primary; databases are derived materialized views
Kappa architecture	Single stream processor; replay log for historical reprocessing; Lambda is dead
Event time vs processing time	Event time = when it happened (truth); processing time = when processor saw it (approximation)
Watermarks	Heuristic assertion: “all events with time ≤ T have arrived”; enables window closure
Exactly-once	Observable output appears exactly once; internally may re-process; idempotency is the practical tool
Materialized views	Pre-computed query results updated continuously as new events arrive
Stream-table duality	A stream is a table’s changelog; a table is a stream’s materialized snapshot
Correctness by construction	Design state as events, derivations as pure functions, side effects as idempotent operations
Fault tolerance = replayability	Immutable log + checkpoints = recovery from any failure by replay

Lambda vs. Kappa — Side by Side

LAMBDA (antipattern as of 2026):
                                   ┌─ Batch Layer (Spark) ──────────────────┐
Input ──────────────────────────── ┤                                         ├──▶ Serving Layer
                                   └─ Speed Layer (Flink/Storm) ─────────────┘

Problems:
  Two code paths (same computation implemented twice)
  Batch and speed outputs must be merged — complex
  Bugs in one don't surface in the other
  Doubles operational complexity

KAPPA (correct model):

Input ──▶ Kafka (durable log) ──▶ Stream Processor (Flink) ──▶ Serving Layer
                  │
                  └── Historical reprocessing: replay from offset 0
                      Same code, same operator, just bounded input

Benefits:
  One code path — one thing to test, debug, operate
  Historical + real-time handled identically
  Replay enables retroactive bug fixes
  Kafka is the recovery mechanism

Time Concepts Quick Reference

Event timeline (mobile user, offline for 20 minutes):

Real world:  [11:58 PM] ──── [11:59 PM] ──── [12:01 AM]
             Purchase A       Purchase B       Purchase C
                    ↓               ↓               ↓
Phone buffers events while offline ...
                    ↓
             [12:20 AM] — events arrive at stream processor

Processing time:  All three events arrive at 12:20 AM → go in 12 AM bucket
Event time:       A and B belong in 11 PM bucket; C belongs in 12 AM bucket

For revenue reporting, event time is correct.
For monitoring pipeline health, processing time is appropriate.

Watermark Decision Framework

Choosing watermark strategy:

  What is the max acceptable latency for window results?
  │
  ├─ Sub-second (real-time dashboard)
  │    └─▶ Aggressive watermark (small allowed lateness)
  │         Accept some late event dropping
  │
  ├─ Seconds to minutes (operational metrics)
  │    └─▶ Moderate watermark (e.g., 30s allowed lateness)
  │         Emit preliminary result + corrections via retractions
  │
  └─ Batch-style (revenue reporting, daily aggregates)
       └─▶ Conservative watermark (minutes of allowed lateness)
            Wait for stragglers; completeness over speed

Formula: watermark = max(event_time_seen) - allowed_lateness

Exactly-Once: What It Means and How to Get It

MISCONCEPTION:
  "Exactly-once means the processor runs the code exactly once per message"
  ─ FALSE. Messages can be re-delivered and re-processed internally.

CORRECT DEFINITION:
  "The observable effect in the output appears exactly once"
  ─ The output is indistinguishable from having processed each message once.

MECHANISMS (choose based on output target):

  Output to Kafka:
    ┌─ Idempotent producer (PID + sequence number)
    └─ Kafka transactions (atomic offset commit + record write)
    Result: exactly-once within Kafka topology

  Output to external database:
    └─ Idempotent write with dedup key
       Application generates UUID per event; sink deduplicates
    Result: exactly-once if sink supports deduplication

  Output to external API (email, payment):
    └─ Cannot use Kafka transactions across the boundary
       Must design the operation to be idempotent:
         Bad:  "charge $10 on event X" (not idempotent)
         Good: "set balance to $Y if event X not yet applied" (idempotent)
    Result: exactly-once requires idempotent operation design

END-TO-END RULE:
  Exactly-once in the stream processor does not automatically mean
  exactly-once application behavior. Must design idempotency at
  every external interaction boundary.

Streaming System State Types

State Type	Scope	Example	Scale Pattern
Operator state	Single operator instance	Kafka consumer offset	Redistributed on rescale
Keyed state	Per key, per operator	User’s running total, session data	Shards with key space
Window state	Within a time window	Events in last 5-minute window	Cleared on window close
Broadcast state	All instances	Currency lookup table	Replicated to all

Fault Tolerance Checkpoint Flow

Flink Distributed Snapshot (Chandy-Lamport inspired):

1. Checkpoint coordinator sends BARRIER to source operators
2. Source: snapshot local state → emit BARRIER into stream
3. Operator: receive BARRIER on all inputs →
              snapshot state → forward BARRIER downstream
4. Sink: acknowledge BARRIER → checkpoint complete

On failure:
  1. Restore all operators to last complete checkpoint state
  2. Replay events from Kafka at the offset saved in checkpoint
  3. Normal processing resumes

RocksDB state backend (large state):
  ├─ State stored on local SSD (RocksDB)
  ├─ Checkpoint = upload delta SST files to S3
  └─ Recovery = restore RocksDB from S3 + replay delta since checkpoint

Stream-Table Duality

STREAM → TABLE (materialization):
  Process stream events, accumulate in KV store
  Key: user_id
  Value: latest event or aggregated state
  Table = current snapshot

TABLE → STREAM (change data capture):
  Emit an event for every INSERT, UPDATE, DELETE
  CDC (Debezium, Kafka Connect)
  Stream = history of changes

CONSEQUENCE:
  Kafka topic with compaction = table (only latest value per key survives)
  Kafka topic without compaction = stream (full history)

In Kafka Streams:
  KTable = automatically materialized from topic (latest value per key)
  KStream = unbounded sequence of events (no materialization)

Materialized View Architecture (CQRS Pattern)

WRITE PATH:                              READ PATH:
─────────────────────                    ─────────────────────────────
User action / command                    Application query
        │                                        │
        ▼                                        ▼
  Kafka topic                           Materialized view store
  (event log /                          ├─ Redis (low-latency KV)
   source of truth)                     ├─ RocksDB (embedded, high throughput)
        │                               ├─ Elasticsearch (full-text)
        ▼                               └─ ClickHouse (OLAP aggregates)
  Stream Processor                               ▲
  (Flink / Kafka Streams)               ─────────┘
        │                               Stream processor maintains the view
        └──────────────────────────────▶ continuously as new events arrive

Key property: If the materialized view is lost or corrupted,
              replay the event log to rebuild it.
              The event log is the ground truth.

Correctness by Construction: Checklist

State design:
  □ All state changes are explicit events (not silent mutations)
  □ State is derived from events, not directly updated
  □ Events are immutable once written

Time handling:
  □ Event timestamp embedded in the event itself
  □ Watermark strategy documented and justified
  □ Late event policy defined (drop / accept / retract)

Fault tolerance:
  □ Checkpoints configured (interval, retention)
  □ Recovery tested: verify reprocessing produces correct output
  □ Kafka retention sufficient for reprocessing window

Idempotency:
  □ Every external write has a deduplication key
  □ Every API call is designed to be safe to retry
  □ Exactly-once semantics verified end-to-end (not just within Flink)

Privacy:
  □ Personal data encrypted with per-user key before writing to log
  □ Right-to-erasure workflow: delete key → events become indecipherable
  □ Derived views (search index, OLAP) also updated on erasure

Quick Comparison: Stream Processing Engines (2026)

Feature	Flink	Kafka Streams	Spark Structured Streaming
Exactly-once	Yes	Yes	Yes
Event time + watermarks	Full support	Limited	Full support
Stateful processing	Rich (keyed, windowed, broadcast)	Yes (KTable)	Limited (state store)
SQL interface	Flink SQL (streaming)	KSQL via ksqlDB	Spark SQL (streaming)
Deployment	Standalone / Kubernetes	Embedded in app	Spark cluster
State backend	RocksDB / heap / Paimon	RocksDB	Memory / checkpoint
Latency	Sub-second	Sub-second	Micro-batch (~100ms)
Best for	Complex stateful, event-time, exactly-once	Simple stream processing, Kafka-native	Unified batch + streaming, ML pipelines

The Philosophy in One Diagram

WORLD VIEW: Events are primary; state is derived

   Real World                 Data System
   ──────────                 ─────────────────────────────
   Fact occurs ──────────▶   Event logged (immutable)
   (user pays,                       │
    sensor fires,                    ▼
    button clicked)          Stream processor
                             (pure function: events → derived state)
                                      │
                             ┌────────▼──────────────────┐
                             │  Materialized views        │
                             │  (databases, caches,       │
                             │   search indexes)          │
                             └───────────────────────────┘
                             
   If derived view is wrong → replay events → rebuild view
   Events are never wrong   → they record what happened
   State can always be fixed → re-derive from events

Quick Revision Time: 6 minutes
Interview Prep: 15 minutes
Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch13-cheatsheet

Chapter 13 Cheat Sheet — A Philosophy of Streaming Systems

One-Line Summaries

Lambda vs. Kappa — Side by Side

Time Concepts Quick Reference

Watermark Decision Framework

Exactly-Once: What It Means and How to Get It

Streaming System State Types

Fault Tolerance Checkpoint Flow

Stream-Table Duality

Materialized View Architecture (CQRS Pattern)

Correctness by Construction: Checklist

Quick Comparison: Stream Processing Engines (2026)

The Philosophy in One Diagram

Graph View

Table of Contents