Chapter 13 Cheat Sheet — A Philosophy of Streaming Systems

One-Line Summaries

ConceptOne-Liner
Streaming as fundamental paradigmEvents are primary; databases are derived materialized views
Kappa architectureSingle stream processor; replay log for historical reprocessing; Lambda is dead
Event time vs processing timeEvent time = when it happened (truth); processing time = when processor saw it (approximation)
WatermarksHeuristic assertion: “all events with time ≤ T have arrived”; enables window closure
Exactly-onceObservable output appears exactly once; internally may re-process; idempotency is the practical tool
Materialized viewsPre-computed query results updated continuously as new events arrive
Stream-table dualityA stream is a table’s changelog; a table is a stream’s materialized snapshot
Correctness by constructionDesign state as events, derivations as pure functions, side effects as idempotent operations
Fault tolerance = replayabilityImmutable log + checkpoints = recovery from any failure by replay

Lambda vs. Kappa — Side by Side

LAMBDA (antipattern as of 2026):
                                   ┌─ Batch Layer (Spark) ──────────────────┐
Input ──────────────────────────── ┤                                         ├──▶ Serving Layer
                                   └─ Speed Layer (Flink/Storm) ─────────────┘

Problems:
  Two code paths (same computation implemented twice)
  Batch and speed outputs must be merged — complex
  Bugs in one don't surface in the other
  Doubles operational complexity

KAPPA (correct model):

Input ──▶ Kafka (durable log) ──▶ Stream Processor (Flink) ──▶ Serving Layer
                  │
                  └── Historical reprocessing: replay from offset 0
                      Same code, same operator, just bounded input

Benefits:
  One code path — one thing to test, debug, operate
  Historical + real-time handled identically
  Replay enables retroactive bug fixes
  Kafka is the recovery mechanism

Time Concepts Quick Reference

Event timeline (mobile user, offline for 20 minutes):

Real world:  [11:58 PM] ──── [11:59 PM] ──── [12:01 AM]
             Purchase A       Purchase B       Purchase C
                    ↓               ↓               ↓
Phone buffers events while offline ...
                    ↓
             [12:20 AM] — events arrive at stream processor

Processing time:  All three events arrive at 12:20 AM → go in 12 AM bucket
Event time:       A and B belong in 11 PM bucket; C belongs in 12 AM bucket

For revenue reporting, event time is correct.
For monitoring pipeline health, processing time is appropriate.

Watermark Decision Framework

Choosing watermark strategy:

  What is the max acceptable latency for window results?
  │
  ├─ Sub-second (real-time dashboard)
  │    └─▶ Aggressive watermark (small allowed lateness)
  │         Accept some late event dropping
  │
  ├─ Seconds to minutes (operational metrics)
  │    └─▶ Moderate watermark (e.g., 30s allowed lateness)
  │         Emit preliminary result + corrections via retractions
  │
  └─ Batch-style (revenue reporting, daily aggregates)
       └─▶ Conservative watermark (minutes of allowed lateness)
            Wait for stragglers; completeness over speed

Formula: watermark = max(event_time_seen) - allowed_lateness

Exactly-Once: What It Means and How to Get It

MISCONCEPTION:
  "Exactly-once means the processor runs the code exactly once per message"
  ─ FALSE. Messages can be re-delivered and re-processed internally.

CORRECT DEFINITION:
  "The observable effect in the output appears exactly once"
  ─ The output is indistinguishable from having processed each message once.

MECHANISMS (choose based on output target):

  Output to Kafka:
    ┌─ Idempotent producer (PID + sequence number)
    └─ Kafka transactions (atomic offset commit + record write)
    Result: exactly-once within Kafka topology

  Output to external database:
    └─ Idempotent write with dedup key
       Application generates UUID per event; sink deduplicates
    Result: exactly-once if sink supports deduplication

  Output to external API (email, payment):
    └─ Cannot use Kafka transactions across the boundary
       Must design the operation to be idempotent:
         Bad:  "charge $10 on event X" (not idempotent)
         Good: "set balance to $Y if event X not yet applied" (idempotent)
    Result: exactly-once requires idempotent operation design

END-TO-END RULE:
  Exactly-once in the stream processor does not automatically mean
  exactly-once application behavior. Must design idempotency at
  every external interaction boundary.

Streaming System State Types

State TypeScopeExampleScale Pattern
Operator stateSingle operator instanceKafka consumer offsetRedistributed on rescale
Keyed statePer key, per operatorUser’s running total, session dataShards with key space
Window stateWithin a time windowEvents in last 5-minute windowCleared on window close
Broadcast stateAll instancesCurrency lookup tableReplicated to all

Fault Tolerance Checkpoint Flow

Flink Distributed Snapshot (Chandy-Lamport inspired):

1. Checkpoint coordinator sends BARRIER to source operators
2. Source: snapshot local state → emit BARRIER into stream
3. Operator: receive BARRIER on all inputs →
              snapshot state → forward BARRIER downstream
4. Sink: acknowledge BARRIER → checkpoint complete

On failure:
  1. Restore all operators to last complete checkpoint state
  2. Replay events from Kafka at the offset saved in checkpoint
  3. Normal processing resumes

RocksDB state backend (large state):
  ├─ State stored on local SSD (RocksDB)
  ├─ Checkpoint = upload delta SST files to S3
  └─ Recovery = restore RocksDB from S3 + replay delta since checkpoint

Stream-Table Duality

STREAM → TABLE (materialization):
  Process stream events, accumulate in KV store
  Key: user_id
  Value: latest event or aggregated state
  Table = current snapshot

TABLE → STREAM (change data capture):
  Emit an event for every INSERT, UPDATE, DELETE
  CDC (Debezium, Kafka Connect)
  Stream = history of changes

CONSEQUENCE:
  Kafka topic with compaction = table (only latest value per key survives)
  Kafka topic without compaction = stream (full history)

In Kafka Streams:
  KTable = automatically materialized from topic (latest value per key)
  KStream = unbounded sequence of events (no materialization)

Materialized View Architecture (CQRS Pattern)

WRITE PATH:                              READ PATH:
─────────────────────                    ─────────────────────────────
User action / command                    Application query
        │                                        │
        ▼                                        ▼
  Kafka topic                           Materialized view store
  (event log /                          ├─ Redis (low-latency KV)
   source of truth)                     ├─ RocksDB (embedded, high throughput)
        │                               ├─ Elasticsearch (full-text)
        ▼                               └─ ClickHouse (OLAP aggregates)
  Stream Processor                               ▲
  (Flink / Kafka Streams)               ─────────┘
        │                               Stream processor maintains the view
        └──────────────────────────────▶ continuously as new events arrive

Key property: If the materialized view is lost or corrupted,
              replay the event log to rebuild it.
              The event log is the ground truth.

Correctness by Construction: Checklist

State design:
  □ All state changes are explicit events (not silent mutations)
  □ State is derived from events, not directly updated
  □ Events are immutable once written

Time handling:
  □ Event timestamp embedded in the event itself
  □ Watermark strategy documented and justified
  □ Late event policy defined (drop / accept / retract)

Fault tolerance:
  □ Checkpoints configured (interval, retention)
  □ Recovery tested: verify reprocessing produces correct output
  □ Kafka retention sufficient for reprocessing window

Idempotency:
  □ Every external write has a deduplication key
  □ Every API call is designed to be safe to retry
  □ Exactly-once semantics verified end-to-end (not just within Flink)

Privacy:
  □ Personal data encrypted with per-user key before writing to log
  □ Right-to-erasure workflow: delete key → events become indecipherable
  □ Derived views (search index, OLAP) also updated on erasure

Quick Comparison: Stream Processing Engines (2026)

FeatureFlinkKafka StreamsSpark Structured Streaming
Exactly-onceYesYesYes
Event time + watermarksFull supportLimitedFull support
Stateful processingRich (keyed, windowed, broadcast)Yes (KTable)Limited (state store)
SQL interfaceFlink SQL (streaming)KSQL via ksqlDBSpark SQL (streaming)
DeploymentStandalone / KubernetesEmbedded in appSpark cluster
State backendRocksDB / heap / PaimonRocksDBMemory / checkpoint
LatencySub-secondSub-secondMicro-batch (~100ms)
Best forComplex stateful, event-time, exactly-onceSimple stream processing, Kafka-nativeUnified batch + streaming, ML pipelines

The Philosophy in One Diagram

WORLD VIEW: Events are primary; state is derived

   Real World                 Data System
   ──────────                 ─────────────────────────────
   Fact occurs ──────────▶   Event logged (immutable)
   (user pays,                       │
    sensor fires,                    ▼
    button clicked)          Stream processor
                             (pure function: events → derived state)
                                      │
                             ┌────────▼──────────────────┐
                             │  Materialized views        │
                             │  (databases, caches,       │
                             │   search indexes)          │
                             └───────────────────────────┘
                             
   If derived view is wrong → replay events → rebuild view
   Events are never wrong   → they record what happened
   State can always be fixed → re-derive from events

Quick Revision Time: 6 minutes
Interview Prep: 15 minutes
Last Updated: 2026-05-29