Chapter 11 Flashcards - Stream Processing

flashcards chapter-11 ddia


Basic Concepts

What is stream processing and how does it differ from batch processing?
?

  • Batch processing: Processes bounded (finite) dataset; waits for all data to arrive; processes in bulk

    • Input: complete dataset; Output: complete result
    • Latency: minutes to hours
    • Example: Nightly job to compute yesterday’s DAU
  • Stream processing: Processes unbounded (infinite) continuous stream of events

    • Input: events arriving continuously; Output: continuously updated results
    • Latency: milliseconds to seconds
    • Example: Real-time fraud detection on every card transaction
  • Relationship: Stream processing is batch processing where the batch is never “complete”

    • Same operators (map, filter, join, aggregate)
    • Different time semantics (windows, watermarks)
    • Different fault tolerance (checkpointing + replay vs rerun from scratch)

What are the key differences between a message queue and a message log?
?
Message queue (RabbitMQ, SQS, ActiveMQ):

  • Push model: Broker pushes messages to consumers
  • Message deleted after consumed by one consumer
  • Competing consumers: Multiple consumers share the load (each message consumed once)
  • No replay: Can’t re-read consumed messages
  • Good for: task queues, job dispatch, work distribution

Message log (Kafka, Kinesis, Pulsar):

  • Pull model: Consumers pull and track their own offset
  • Messages retained on disk (configurable: 7 days, forever, etc.)
  • Fan-out: Multiple independent consumer groups each read all messages
  • Replay: Consumer can re-read from any offset
  • Good for: event streams, CDC, audit log, stream processing, data integration

Messaging and CDC

What is Change Data Capture (CDC) and why is it better than dual writes?
?

  • CDC: Stream changes from a database’s internal replication log (binlog/WAL) to a message queue

  • Dual-write problem: App writes to DB, then to Kafka separately → race condition, partial failure

    • If app crashes between DB write and Kafka write → DB and Kafka out of sync
    • No atomic guarantee across two systems
  • CDC solution:

    • App writes ONLY to the DB (single source of truth)
    • CDC tool (Debezium) reads DB’s replication log → emits events to Kafka
    • All derived systems (Elasticsearch, Redis, Snowflake) consume from Kafka
  • Guarantee: DB and Kafka are always consistent (Kafka is derived from DB, not separately written)

  • Replay: Can rebuild any derived system from the beginning of the CDC log

  • Tools: Debezium (open-source, MySQL/PostgreSQL/MongoDB), AWS DMS, GoldenGate

What is event sourcing and how does it differ from CDC?
?

  • Traditional storage: Store current state (mutable rows); history lost on update

  • Event sourcing: Store immutable sequence of events; current state = fold/reduce over events

  • Example:

    • Traditional: accounts table with balance=900 (the $100 withdrawal is gone)
    • Event sourcing: Event log: [AccountOpened{balance=1000}, MoneyWithdrawn{amount=100}]
    • Current balance = apply all events = 900
  • Differences from CDC:

    • CDC: Low-level DB changes (INSERT/UPDATE/DELETE at SQL level)
    • Event sourcing: High-level domain events (“OrderPlaced”, “ItemShipped”)
    • CDC is infrastructure; Event sourcing is an architectural pattern
  • Advantages of event sourcing:

    • Complete audit trail
    • Time travel: replay to any point in history
    • Multiple projections: derive different views from same event log
    • Append-only log: no concurrent update conflicts

Event Time and Watermarks

What is the difference between event time and processing time in stream processing?
?

  • Event time: When the event actually occurred (timestamp embedded in the event payload)

    • Example: User clicked “buy” at 10:00:00 UTC
    • Accurate; reflects real-world timing
  • Processing time: When the stream processor receives/processes the event

    • Example: Stream processor processed the event at 10:00:07 UTC (7 second delay)
    • Can be much later than event time (network delay, mobile offline sync)
  • When they diverge:

    • Mobile apps buffer events offline; send in bulk hours later
    • Network congestion; events queued in Kafka for minutes
    • Multiple sources with different latencies
  • Best practice:

    • Event time for business metrics (correct, independent of processing delay)
    • Processing time for operational metrics (SLA monitoring, system health)
  • Example problem: “Events in the 10:00 window” — use event time or wrong events included

What is a watermark in stream processing and what are its limitations?
?

  • Watermark: An estimate of the current event time progress — “I believe all events with timestamp < W have arrived”

  • Purpose: Tell the stream processor when it’s safe to close a time window and emit results

  • How calculated:

    • Track maximum event timestamp seen in the stream
    • Subtract expected maximum late-arrival delay: watermark = max_event_time - max_lateness
    • Example: if max lateness = 2 min, and latest event was at 10:05 → watermark = 10:03
  • Window closure: When watermark passes window’s end time → window results emitted

  • Limitations:

    • Heuristic: Can’t know for certain if events will still arrive
    • Trade-off: Low watermark lag = correct but high latency; High watermark lag = fast but may miss late events
    • Adversarial events: One very delayed event can keep watermark from advancing (need max-lateness cap)
    • Different sources: Multiple partitions may have different lags; watermark = min across all

Windows

What are the three main window types in stream processing?
?
1. Tumbling window:

  • Fixed-size, non-overlapping windows
  • Each event belongs to exactly one window
  • Example: Count clicks per 1-minute period (10:00-10:01, 10:01-10:02, …)
  • Use case: Hourly/daily aggregates, rate limiting

2. Sliding window:

  • Overlapping windows of fixed size, advancing by slide interval
  • Each event may appear in multiple windows
  • Example: “Last 10 minutes” metric updated every 1 minute
    • Windows: [10:00-10:10], [10:01-10:11], [10:02-10:12], …
  • Use case: Moving averages, anomaly detection

3. Session window:

  • Variable-size window; ends after a gap in activity
  • Groups events from the same user/key until they’re inactive for gap duration
  • Example: User session ends after 30 minutes of inactivity
  • Use case: User session analytics, activity grouping

Stream Processing Patterns

What are the three types of stream joins?
?
1. Stream-table join (enrichment):

  • Enrich stream events with reference data from a table
  • Table: relatively static (user profiles, product catalog)
  • Implementation: cache table in memory; join at processing time
  • CDC keeps table fresh: Subscribe to table’s CDC stream to update local cache
  • Example: Enrich “order placed” event with user’s VIP status

2. Stream-stream join (event correlation):

  • Correlate events from two different streams within a time window
  • Both streams buffered in state store; match on key + time proximity
  • State grows: must expire old unmatched events
  • Example: Match “search query” events with “click” events within 30 minutes

3. Table-table join (incremental materialized view):

  • Both inputs are change streams from databases (CDC)
  • Maintains join as a continuously updated materialized view
  • Example: Continuously join orders CDC + customers CDC → enriched orders view

Fault Tolerance

What does “exactly-once” mean in stream processing and how is it achieved?
?

  • Exactly-once: Every input event contributes to the output exactly once, even if failures occur and processing is retried

  • Why hard: If processor crashes and replays from checkpoint → events between checkpoint and crash are reprocessed → output may be duplicated

  • Three approaches:

  1. Idempotent writes (simplest):

    • Process event → write output with unique event_id
    • If reprocessed: same event_id → deduplicate at sink → same result
    • Works when output is idempotent (setting a value) or deduplicatable
  2. Kafka Transactions (for Kafka-to-Kafka):

    • Atomically write output records + commit input offsets in one transaction
    • Consumer sees both or neither (atomic)
    • Built into Kafka producer API
  3. Flink Chandy-Lamport Checkpointing:

    • Stream special “barrier” markers alongside data
    • Every operator takes consistent snapshot when barrier passes through
    • Checkpoint = state snapshot + input offset at that point
    • On failure: rollback state + replay input from checkpoint’s offset

How does Flink’s Chandy-Lamport checkpointing achieve exactly-once semantics?
?

  • Algorithm: Adapted Chandy-Lamport distributed snapshot algorithm

  • Process:

    1. JobManager triggers checkpoint; sends barrier message to all sources
    2. Source writes current input offset to checkpoint; forwards barrier downstream
    3. Each operator: when barrier arrives on all inputs:
      • Snapshot its local state (to durable storage: S3/HDFS)
      • Forward barrier to its outputs
    4. Sink: when barrier arrives → register with checkpoint
    5. JobManager: once all operators confirmed → checkpoint complete
  • On failure:

    1. Roll back ALL operators to their last checkpoint state
    2. Sources re-read input from the offset recorded in that checkpoint
    3. Processing resumes from checkpoint; events between checkpoint and failure are replayed
    4. Sinks must accept duplicate writes idempotently OR use 2PC with sink
  • Result: Input offsets + operator state form a consistent snapshot; failure recovery is safe

Modern Context (2026)

How is the streaming lakehouse changing data architecture in 2026?
?

  • Traditional: Separate systems for streaming (Kafka) + batch analytics (Snowflake/Spark)

    • Two data copies, ETL jobs to move between them
    • Analytics always lagged behind real-time by hours
  • Streaming Lakehouse (2026 approach):

    • Flink → Apache Iceberg: Stream processor writes directly to Iceberg tables on S3
      • ACID streaming writes; immediately queryable
    • Spark Structured Streaming → Delta Lake: Micro-batch streaming to Delta tables
    • Single data store: Same Iceberg/Delta tables read by both streaming and batch consumers
  • Benefits:

    • Near-real-time analytics (seconds lag) without separate real-time DB
    • One data format (Parquet + transaction log) for all consumers
    • Streaming and batch jobs use same format; no separate ETL
  • Use case example: Flink writes user events to Iceberg; Trino/DuckDB queries events within minutes of occurrence; same table used for overnight batch aggregation

What are feature stores and how do they connect stream processing with ML?
?

  • Feature store: Centralized storage for ML features used in model training and inference

    • Offline feature store: Historical features for training (Parquet files, data warehouse)
    • Online feature store: Low-latency feature serving for real-time inference (Redis, DynamoDB)
  • Stream processing’s role:

    1. Events arrive in Kafka (user actions, transactions, etc.)
    2. Stream processor (Flink) computes real-time features
      • Example: “user’s last 10 purchases”, “average click-through rate last hour”
    3. Features written to online store (Redis) for low-latency serving
    4. Features also written to offline store (S3) for model retraining
  • Why important: Without feature store:

    • Training features computed differently from serving features → training-serving skew
    • Real-time features not available without implementing separately in each model
  • Products: Feast (open-source), Tecton, Hopsworks, Databricks Feature Store, AWS SageMaker Feature Store

Interview Scenarios

Design a real-time fraud detection system for credit card transactions.
?
Event flow: Transaction → Kafka → Fraud detector → Allow/Deny

Stream processing architecture (Apache Flink):

Features (stateful aggregations over event time):

  • Last 1-hour spending velocity per card
  • Count of transactions per merchant per hour
  • Count of failed attempts per card per day
  • Geographic distance between consecutive transactions
  • Average transaction amount deviation

State management:

  • Per-card state: rolling window aggregates (RocksDB in Flink)
  • State TTL: expire stale card state after 30 days of inactivity

Stream-table join:

  • Enrich transaction with cardholder profile (risk score, usual spending patterns)
  • Table updated via CDC from customer database

Decision:

  • Score transaction based on features
  • If score > threshold: decline immediately (synchronous response needed → very low latency)
  • Alternatively: approve + flag for async review (async fraud detection)

Latency target: <100ms for real-time scoring (within payment auth flow)

Fault tolerance: Flink with exactly-once checkpointing; checkpoint every 30s

When should you use CDC vs event sourcing for data integration across services?
?
Use CDC when:

  • You have an existing database-first application
  • You need to keep derived systems (search, cache, analytics) in sync with the DB
  • You don’t want to change application code significantly
  • Database is the source of truth; other systems are consumers
  • Example: Sync MySQL users table → Elasticsearch for user search

Use event sourcing when:

  • Building a new system from scratch (or major rewrite)
  • You need complete audit trail + time travel as core feature
  • Business domain is naturally event-driven (orders, payments, reservations)
  • You need multiple different “projections” from the same history
  • Strong eventual consistency is acceptable
  • Example: Order management system where every state change is a domain event

Hybrid (common in practice):

  • Core business logic uses event sourcing (domain events in EventStoreDB or Kafka)
  • Infrastructure integration uses CDC where needed
  • Materialized views (read models) derived from event stream

Key decision: “Is the database or the event log the primary source of truth?”

Quick Facts

What does Kafka consumer lag measure and why does it matter?
?

  • Consumer lag: Difference between producer’s latest offset and consumer’s last committed offset

    • Lag = 0: consumer is up-to-date
    • Lag = 10,000: consumer is 10,000 messages behind
  • Why it matters:

    • High lag = stream processing is not keeping up with incoming data
    • For real-time systems: lag translates directly to latency (processing events from 5 min ago)
    • Unbounded lag growth = consumer will never catch up → need more resources
  • Monitoring:

    • Alert when lag exceeds a threshold (e.g., 1 million messages or 5 minutes of lag)
    • Track lag per partition (one slow partition can block the whole consumer group)
    • Metrics: kafka_consumer_group_lag in Prometheus/Grafana
  • Causes of growing lag:

    • Consumer too slow: increase parallelism (add more consumer instances)
    • Processing too expensive: optimize code or reduce work per event
    • Spiky input: auto-scale consumer instances; provision for peak load

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13