Chapter 11 Flashcards - Stream Processing

flashcards chapter-11 ddia

Basic Concepts

What is stream processing and how does it differ from batch processing?
?

Batch processing: Processes bounded (finite) dataset; waits for all data to arrive; processes in bulk
- Input: complete dataset; Output: complete result
- Latency: minutes to hours
- Example: Nightly job to compute yesterday’s DAU
Stream processing: Processes unbounded (infinite) continuous stream of events
- Input: events arriving continuously; Output: continuously updated results
- Latency: milliseconds to seconds
- Example: Real-time fraud detection on every card transaction
Relationship: Stream processing is batch processing where the batch is never “complete”
- Same operators (map, filter, join, aggregate)
- Different time semantics (windows, watermarks)
- Different fault tolerance (checkpointing + replay vs rerun from scratch)

What are the key differences between a message queue and a message log?
?
Message queue (RabbitMQ, SQS, ActiveMQ):

Push model: Broker pushes messages to consumers
Message deleted after consumed by one consumer
Competing consumers: Multiple consumers share the load (each message consumed once)
No replay: Can’t re-read consumed messages
Good for: task queues, job dispatch, work distribution

Message log (Kafka, Kinesis, Pulsar):

Pull model: Consumers pull and track their own offset
Messages retained on disk (configurable: 7 days, forever, etc.)
Fan-out: Multiple independent consumer groups each read all messages
Replay: Consumer can re-read from any offset
Good for: event streams, CDC, audit log, stream processing, data integration

Messaging and CDC

What is Change Data Capture (CDC) and why is it better than dual writes?
?

CDC: Stream changes from a database’s internal replication log (binlog/WAL) to a message queue
Dual-write problem: App writes to DB, then to Kafka separately → race condition, partial failure
- If app crashes between DB write and Kafka write → DB and Kafka out of sync
- No atomic guarantee across two systems
CDC solution:
- App writes ONLY to the DB (single source of truth)
- CDC tool (Debezium) reads DB’s replication log → emits events to Kafka
- All derived systems (Elasticsearch, Redis, Snowflake) consume from Kafka
Guarantee: DB and Kafka are always consistent (Kafka is derived from DB, not separately written)
Replay: Can rebuild any derived system from the beginning of the CDC log
Tools: Debezium (open-source, MySQL/PostgreSQL/MongoDB), AWS DMS, GoldenGate

What is event sourcing and how does it differ from CDC?
?

Traditional storage: Store current state (mutable rows); history lost on update
Event sourcing: Store immutable sequence of events; current state = fold/reduce over events
Example:
- Traditional: accounts table with balance=900 (the $100 withdrawal is gone)
- Event sourcing: Event log: [AccountOpened{balance=1000}, MoneyWithdrawn{amount=100}]
- Current balance = apply all events = 900
Differences from CDC:
- CDC: Low-level DB changes (INSERT/UPDATE/DELETE at SQL level)
- Event sourcing: High-level domain events (“OrderPlaced”, “ItemShipped”)
- CDC is infrastructure; Event sourcing is an architectural pattern
Advantages of event sourcing:
- Complete audit trail
- Time travel: replay to any point in history
- Multiple projections: derive different views from same event log
- Append-only log: no concurrent update conflicts

Event Time and Watermarks

What is the difference between event time and processing time in stream processing?
?

Event time: When the event actually occurred (timestamp embedded in the event payload)
- Example: User clicked “buy” at 10:00:00 UTC
- Accurate; reflects real-world timing
Processing time: When the stream processor receives/processes the event
- Example: Stream processor processed the event at 10:00:07 UTC (7 second delay)
- Can be much later than event time (network delay, mobile offline sync)
When they diverge:
- Mobile apps buffer events offline; send in bulk hours later
- Network congestion; events queued in Kafka for minutes
- Multiple sources with different latencies
Best practice:
- Event time for business metrics (correct, independent of processing delay)
- Processing time for operational metrics (SLA monitoring, system health)
Example problem: “Events in the 10:00 window” — use event time or wrong events included

What is a watermark in stream processing and what are its limitations?
?

Watermark: An estimate of the current event time progress — “I believe all events with timestamp < W have arrived”
Purpose: Tell the stream processor when it’s safe to close a time window and emit results
How calculated:
- Track maximum event timestamp seen in the stream
- Subtract expected maximum late-arrival delay: watermark = max_event_time - max_lateness
- Example: if max lateness = 2 min, and latest event was at 10:05 → watermark = 10:03
Window closure: When watermark passes window’s end time → window results emitted
Limitations:
- Heuristic: Can’t know for certain if events will still arrive
- Trade-off: Low watermark lag = correct but high latency; High watermark lag = fast but may miss late events
- Adversarial events: One very delayed event can keep watermark from advancing (need max-lateness cap)
- Different sources: Multiple partitions may have different lags; watermark = min across all

Windows

What are the three main window types in stream processing?
?
1. Tumbling window:

Fixed-size, non-overlapping windows
Each event belongs to exactly one window
Example: Count clicks per 1-minute period (10:00-10:01, 10:01-10:02, …)
Use case: Hourly/daily aggregates, rate limiting

2. Sliding window:

Overlapping windows of fixed size, advancing by slide interval
Each event may appear in multiple windows
Example: “Last 10 minutes” metric updated every 1 minute
- Windows: [10:00-10:10], [10:01-10:11], [10:02-10:12], …
Use case: Moving averages, anomaly detection

3. Session window:

Variable-size window; ends after a gap in activity
Groups events from the same user/key until they’re inactive for gap duration
Example: User session ends after 30 minutes of inactivity
Use case: User session analytics, activity grouping

Stream Processing Patterns

What are the three types of stream joins?
?
1. Stream-table join (enrichment):

Enrich stream events with reference data from a table
Table: relatively static (user profiles, product catalog)
Implementation: cache table in memory; join at processing time
CDC keeps table fresh: Subscribe to table’s CDC stream to update local cache
Example: Enrich “order placed” event with user’s VIP status

2. Stream-stream join (event correlation):

Correlate events from two different streams within a time window
Both streams buffered in state store; match on key + time proximity
State grows: must expire old unmatched events
Example: Match “search query” events with “click” events within 30 minutes

3. Table-table join (incremental materialized view):

Both inputs are change streams from databases (CDC)
Maintains join as a continuously updated materialized view
Example: Continuously join orders CDC + customers CDC → enriched orders view

Fault Tolerance

What does “exactly-once” mean in stream processing and how is it achieved?
?

Exactly-once: Every input event contributes to the output exactly once, even if failures occur and processing is retried
Why hard: If processor crashes and replays from checkpoint → events between checkpoint and crash are reprocessed → output may be duplicated
Three approaches:

Idempotent writes (simplest):
- Process event → write output with unique event_id
- If reprocessed: same event_id → deduplicate at sink → same result
- Works when output is idempotent (setting a value) or deduplicatable
Kafka Transactions (for Kafka-to-Kafka):
- Atomically write output records + commit input offsets in one transaction
- Consumer sees both or neither (atomic)
- Built into Kafka producer API
Flink Chandy-Lamport Checkpointing:
- Stream special “barrier” markers alongside data
- Every operator takes consistent snapshot when barrier passes through
- Checkpoint = state snapshot + input offset at that point
- On failure: rollback state + replay input from checkpoint’s offset

How does Flink’s Chandy-Lamport checkpointing achieve exactly-once semantics?
?

Algorithm: Adapted Chandy-Lamport distributed snapshot algorithm
Process:
1. JobManager triggers checkpoint; sends barrier message to all sources
2. Source writes current input offset to checkpoint; forwards barrier downstream
3. Each operator: when barrier arrives on all inputs:
  - Snapshot its local state (to durable storage: S3/HDFS)
  - Forward barrier to its outputs
4. Sink: when barrier arrives → register with checkpoint
5. JobManager: once all operators confirmed → checkpoint complete
On failure:
1. Roll back ALL operators to their last checkpoint state
2. Sources re-read input from the offset recorded in that checkpoint
3. Processing resumes from checkpoint; events between checkpoint and failure are replayed
4. Sinks must accept duplicate writes idempotently OR use 2PC with sink
Result: Input offsets + operator state form a consistent snapshot; failure recovery is safe

Modern Context (2026)

How is the streaming lakehouse changing data architecture in 2026?
?

Traditional: Separate systems for streaming (Kafka) + batch analytics (Snowflake/Spark)
- Two data copies, ETL jobs to move between them
- Analytics always lagged behind real-time by hours
Streaming Lakehouse (2026 approach):
- Flink → Apache Iceberg: Stream processor writes directly to Iceberg tables on S3
  - ACID streaming writes; immediately queryable
- Spark Structured Streaming → Delta Lake: Micro-batch streaming to Delta tables
- Single data store: Same Iceberg/Delta tables read by both streaming and batch consumers
Benefits:
- Near-real-time analytics (seconds lag) without separate real-time DB
- One data format (Parquet + transaction log) for all consumers
- Streaming and batch jobs use same format; no separate ETL
Use case example: Flink writes user events to Iceberg; Trino/DuckDB queries events within minutes of occurrence; same table used for overnight batch aggregation

What are feature stores and how do they connect stream processing with ML?
?

Feature store: Centralized storage for ML features used in model training and inference
- Offline feature store: Historical features for training (Parquet files, data warehouse)
- Online feature store: Low-latency feature serving for real-time inference (Redis, DynamoDB)
Stream processing’s role:
1. Events arrive in Kafka (user actions, transactions, etc.)
2. Stream processor (Flink) computes real-time features
  - Example: “user’s last 10 purchases”, “average click-through rate last hour”
3. Features written to online store (Redis) for low-latency serving
4. Features also written to offline store (S3) for model retraining
Why important: Without feature store:
- Training features computed differently from serving features → training-serving skew
- Real-time features not available without implementing separately in each model
Products: Feast (open-source), Tecton, Hopsworks, Databricks Feature Store, AWS SageMaker Feature Store

Interview Scenarios

Design a real-time fraud detection system for credit card transactions.
?
Event flow: Transaction → Kafka → Fraud detector → Allow/Deny

Stream processing architecture (Apache Flink):

Features (stateful aggregations over event time):

Last 1-hour spending velocity per card
Count of transactions per merchant per hour
Count of failed attempts per card per day
Geographic distance between consecutive transactions
Average transaction amount deviation

State management:

Per-card state: rolling window aggregates (RocksDB in Flink)
State TTL: expire stale card state after 30 days of inactivity

Stream-table join:

Enrich transaction with cardholder profile (risk score, usual spending patterns)
Table updated via CDC from customer database

Decision:

Score transaction based on features
If score > threshold: decline immediately (synchronous response needed → very low latency)
Alternatively: approve + flag for async review (async fraud detection)

Latency target: <100ms for real-time scoring (within payment auth flow)

Fault tolerance: Flink with exactly-once checkpointing; checkpoint every 30s

When should you use CDC vs event sourcing for data integration across services?
?
Use CDC when:

You have an existing database-first application
You need to keep derived systems (search, cache, analytics) in sync with the DB
You don’t want to change application code significantly
Database is the source of truth; other systems are consumers
Example: Sync MySQL users table → Elasticsearch for user search

Use event sourcing when:

Building a new system from scratch (or major rewrite)
You need complete audit trail + time travel as core feature
Business domain is naturally event-driven (orders, payments, reservations)
You need multiple different “projections” from the same history
Strong eventual consistency is acceptable
Example: Order management system where every state change is a domain event

Hybrid (common in practice):

Core business logic uses event sourcing (domain events in EventStoreDB or Kafka)
Infrastructure integration uses CDC where needed
Materialized views (read models) derived from event stream

Key decision: “Is the database or the event log the primary source of truth?”

Quick Facts

What does Kafka consumer lag measure and why does it matter?
?

Consumer lag: Difference between producer’s latest offset and consumer’s last committed offset
- Lag = 0: consumer is up-to-date
- Lag = 10,000: consumer is 10,000 messages behind
Why it matters:
- High lag = stream processing is not keeping up with incoming data
- For real-time systems: lag translates directly to latency (processing events from 5 min ago)
- Unbounded lag growth = consumer will never catch up → need more resources
Monitoring:
- Alert when lag exceeds a threshold (e.g., 1 million messages or 5 minutes of lag)
- Track lag per partition (one slow partition can block the whole consumer group)
- Metrics: kafka_consumer_group_lag in Prometheus/Grafana
Causes of growing lag:
- Consumer too slow: increase parallelism (add more consumer instances)
- Processing too expensive: optimize code or reduce work per event
- Spiky input: auto-scale consumer instances; provision for peak load

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch11-flashcards