Chapter 13 Cheat Sheet — A Philosophy of Streaming Systems
One-Line Summaries
Concept
One-Liner
Streaming as fundamental paradigm
Events are primary; databases are derived materialized views
Kappa architecture
Single stream processor; replay log for historical reprocessing; Lambda is dead
Event time vs processing time
Event time = when it happened (truth); processing time = when processor saw it (approximation)
Watermarks
Heuristic assertion: “all events with time ≤ T have arrived”; enables window closure
Exactly-once
Observable output appears exactly once; internally may re-process; idempotency is the practical tool
Materialized views
Pre-computed query results updated continuously as new events arrive
Stream-table duality
A stream is a table’s changelog; a table is a stream’s materialized snapshot
Correctness by construction
Design state as events, derivations as pure functions, side effects as idempotent operations
Fault tolerance = replayability
Immutable log + checkpoints = recovery from any failure by replay
Lambda vs. Kappa — Side by Side
LAMBDA (antipattern as of 2026):
┌─ Batch Layer (Spark) ──────────────────┐
Input ──────────────────────────── ┤ ├──▶ Serving Layer
└─ Speed Layer (Flink/Storm) ─────────────┘
Problems:
Two code paths (same computation implemented twice)
Batch and speed outputs must be merged — complex
Bugs in one don't surface in the other
Doubles operational complexity
KAPPA (correct model):
Input ──▶ Kafka (durable log) ──▶ Stream Processor (Flink) ──▶ Serving Layer
│
└── Historical reprocessing: replay from offset 0
Same code, same operator, just bounded input
Benefits:
One code path — one thing to test, debug, operate
Historical + real-time handled identically
Replay enables retroactive bug fixes
Kafka is the recovery mechanism
Time Concepts Quick Reference
Event timeline (mobile user, offline for 20 minutes):
Real world: [11:58 PM] ──── [11:59 PM] ──── [12:01 AM]
Purchase A Purchase B Purchase C
↓ ↓ ↓
Phone buffers events while offline ...
↓
[12:20 AM] — events arrive at stream processor
Processing time: All three events arrive at 12:20 AM → go in 12 AM bucket
Event time: A and B belong in 11 PM bucket; C belongs in 12 AM bucket
For revenue reporting, event time is correct.
For monitoring pipeline health, processing time is appropriate.
Watermark Decision Framework
Choosing watermark strategy:
What is the max acceptable latency for window results?
│
├─ Sub-second (real-time dashboard)
│ └─▶ Aggressive watermark (small allowed lateness)
│ Accept some late event dropping
│
├─ Seconds to minutes (operational metrics)
│ └─▶ Moderate watermark (e.g., 30s allowed lateness)
│ Emit preliminary result + corrections via retractions
│
└─ Batch-style (revenue reporting, daily aggregates)
└─▶ Conservative watermark (minutes of allowed lateness)
Wait for stragglers; completeness over speed
Formula: watermark = max(event_time_seen) - allowed_lateness
Exactly-Once: What It Means and How to Get It
MISCONCEPTION:
"Exactly-once means the processor runs the code exactly once per message"
─ FALSE. Messages can be re-delivered and re-processed internally.
CORRECT DEFINITION:
"The observable effect in the output appears exactly once"
─ The output is indistinguishable from having processed each message once.
MECHANISMS (choose based on output target):
Output to Kafka:
┌─ Idempotent producer (PID + sequence number)
└─ Kafka transactions (atomic offset commit + record write)
Result: exactly-once within Kafka topology
Output to external database:
└─ Idempotent write with dedup key
Application generates UUID per event; sink deduplicates
Result: exactly-once if sink supports deduplication
Output to external API (email, payment):
└─ Cannot use Kafka transactions across the boundary
Must design the operation to be idempotent:
Bad: "charge $10 on event X" (not idempotent)
Good: "set balance to $Y if event X not yet applied" (idempotent)
Result: exactly-once requires idempotent operation design
END-TO-END RULE:
Exactly-once in the stream processor does not automatically mean
exactly-once application behavior. Must design idempotency at
every external interaction boundary.
Streaming System State Types
State Type
Scope
Example
Scale Pattern
Operator state
Single operator instance
Kafka consumer offset
Redistributed on rescale
Keyed state
Per key, per operator
User’s running total, session data
Shards with key space
Window state
Within a time window
Events in last 5-minute window
Cleared on window close
Broadcast state
All instances
Currency lookup table
Replicated to all
Fault Tolerance Checkpoint Flow
Flink Distributed Snapshot (Chandy-Lamport inspired):
1. Checkpoint coordinator sends BARRIER to source operators
2. Source: snapshot local state → emit BARRIER into stream
3. Operator: receive BARRIER on all inputs →
snapshot state → forward BARRIER downstream
4. Sink: acknowledge BARRIER → checkpoint complete
On failure:
1. Restore all operators to last complete checkpoint state
2. Replay events from Kafka at the offset saved in checkpoint
3. Normal processing resumes
RocksDB state backend (large state):
├─ State stored on local SSD (RocksDB)
├─ Checkpoint = upload delta SST files to S3
└─ Recovery = restore RocksDB from S3 + replay delta since checkpoint
Stream-Table Duality
STREAM → TABLE (materialization):
Process stream events, accumulate in KV store
Key: user_id
Value: latest event or aggregated state
Table = current snapshot
TABLE → STREAM (change data capture):
Emit an event for every INSERT, UPDATE, DELETE
CDC (Debezium, Kafka Connect)
Stream = history of changes
CONSEQUENCE:
Kafka topic with compaction = table (only latest value per key survives)
Kafka topic without compaction = stream (full history)
In Kafka Streams:
KTable = automatically materialized from topic (latest value per key)
KStream = unbounded sequence of events (no materialization)
Materialized View Architecture (CQRS Pattern)
WRITE PATH: READ PATH:
───────────────────── ─────────────────────────────
User action / command Application query
│ │
▼ ▼
Kafka topic Materialized view store
(event log / ├─ Redis (low-latency KV)
source of truth) ├─ RocksDB (embedded, high throughput)
│ ├─ Elasticsearch (full-text)
▼ └─ ClickHouse (OLAP aggregates)
Stream Processor ▲
(Flink / Kafka Streams) ─────────┘
│ Stream processor maintains the view
└──────────────────────────────▶ continuously as new events arrive
Key property: If the materialized view is lost or corrupted,
replay the event log to rebuild it.
The event log is the ground truth.
Correctness by Construction: Checklist
State design:
□ All state changes are explicit events (not silent mutations)
□ State is derived from events, not directly updated
□ Events are immutable once written
Time handling:
□ Event timestamp embedded in the event itself
□ Watermark strategy documented and justified
□ Late event policy defined (drop / accept / retract)
Fault tolerance:
□ Checkpoints configured (interval, retention)
□ Recovery tested: verify reprocessing produces correct output
□ Kafka retention sufficient for reprocessing window
Idempotency:
□ Every external write has a deduplication key
□ Every API call is designed to be safe to retry
□ Exactly-once semantics verified end-to-end (not just within Flink)
Privacy:
□ Personal data encrypted with per-user key before writing to log
□ Right-to-erasure workflow: delete key → events become indecipherable
□ Derived views (search index, OLAP) also updated on erasure
WORLD VIEW: Events are primary; state is derived
Real World Data System
────────── ─────────────────────────────
Fact occurs ──────────▶ Event logged (immutable)
(user pays, │
sensor fires, ▼
button clicked) Stream processor
(pure function: events → derived state)
│
┌────────▼──────────────────┐
│ Materialized views │
│ (databases, caches, │
│ search indexes) │
└───────────────────────────┘
If derived view is wrong → replay events → rebuild view
Events are never wrong → they record what happened
State can always be fixed → re-derive from events