Chapter 11 Flashcards - Stream Processing
Basic Concepts
What is stream processing and how does it differ from batch processing?
?
-
Batch processing: Processes bounded (finite) dataset; waits for all data to arrive; processes in bulk
- Input: complete dataset; Output: complete result
- Latency: minutes to hours
- Example: Nightly job to compute yesterday’s DAU
-
Stream processing: Processes unbounded (infinite) continuous stream of events
- Input: events arriving continuously; Output: continuously updated results
- Latency: milliseconds to seconds
- Example: Real-time fraud detection on every card transaction
-
Relationship: Stream processing is batch processing where the batch is never “complete”
- Same operators (map, filter, join, aggregate)
- Different time semantics (windows, watermarks)
- Different fault tolerance (checkpointing + replay vs rerun from scratch)
What are the key differences between a message queue and a message log?
?
Message queue (RabbitMQ, SQS, ActiveMQ):
- Push model: Broker pushes messages to consumers
- Message deleted after consumed by one consumer
- Competing consumers: Multiple consumers share the load (each message consumed once)
- No replay: Can’t re-read consumed messages
- Good for: task queues, job dispatch, work distribution
Message log (Kafka, Kinesis, Pulsar):
- Pull model: Consumers pull and track their own offset
- Messages retained on disk (configurable: 7 days, forever, etc.)
- Fan-out: Multiple independent consumer groups each read all messages
- Replay: Consumer can re-read from any offset
- Good for: event streams, CDC, audit log, stream processing, data integration
Messaging and CDC
What is Change Data Capture (CDC) and why is it better than dual writes?
?
-
CDC: Stream changes from a database’s internal replication log (binlog/WAL) to a message queue
-
Dual-write problem: App writes to DB, then to Kafka separately → race condition, partial failure
- If app crashes between DB write and Kafka write → DB and Kafka out of sync
- No atomic guarantee across two systems
-
CDC solution:
- App writes ONLY to the DB (single source of truth)
- CDC tool (Debezium) reads DB’s replication log → emits events to Kafka
- All derived systems (Elasticsearch, Redis, Snowflake) consume from Kafka
-
Guarantee: DB and Kafka are always consistent (Kafka is derived from DB, not separately written)
-
Replay: Can rebuild any derived system from the beginning of the CDC log
-
Tools: Debezium (open-source, MySQL/PostgreSQL/MongoDB), AWS DMS, GoldenGate
What is event sourcing and how does it differ from CDC?
?
-
Traditional storage: Store current state (mutable rows); history lost on update
-
Event sourcing: Store immutable sequence of events; current state = fold/reduce over events
-
Example:
- Traditional:
accountstable withbalance=900(the $100 withdrawal is gone) - Event sourcing: Event log:
[AccountOpened{balance=1000}, MoneyWithdrawn{amount=100}] - Current balance = apply all events = 900
- Traditional:
-
Differences from CDC:
- CDC: Low-level DB changes (INSERT/UPDATE/DELETE at SQL level)
- Event sourcing: High-level domain events (“OrderPlaced”, “ItemShipped”)
- CDC is infrastructure; Event sourcing is an architectural pattern
-
Advantages of event sourcing:
- Complete audit trail
- Time travel: replay to any point in history
- Multiple projections: derive different views from same event log
- Append-only log: no concurrent update conflicts
Event Time and Watermarks
What is the difference between event time and processing time in stream processing?
?
-
Event time: When the event actually occurred (timestamp embedded in the event payload)
- Example: User clicked “buy” at 10:00:00 UTC
- Accurate; reflects real-world timing
-
Processing time: When the stream processor receives/processes the event
- Example: Stream processor processed the event at 10:00:07 UTC (7 second delay)
- Can be much later than event time (network delay, mobile offline sync)
-
When they diverge:
- Mobile apps buffer events offline; send in bulk hours later
- Network congestion; events queued in Kafka for minutes
- Multiple sources with different latencies
-
Best practice:
- Event time for business metrics (correct, independent of processing delay)
- Processing time for operational metrics (SLA monitoring, system health)
-
Example problem: “Events in the 10:00 window” — use event time or wrong events included
What is a watermark in stream processing and what are its limitations?
?
-
Watermark: An estimate of the current event time progress — “I believe all events with timestamp < W have arrived”
-
Purpose: Tell the stream processor when it’s safe to close a time window and emit results
-
How calculated:
- Track maximum event timestamp seen in the stream
- Subtract expected maximum late-arrival delay:
watermark = max_event_time - max_lateness - Example: if max lateness = 2 min, and latest event was at 10:05 → watermark = 10:03
-
Window closure: When watermark passes window’s end time → window results emitted
-
Limitations:
- Heuristic: Can’t know for certain if events will still arrive
- Trade-off: Low watermark lag = correct but high latency; High watermark lag = fast but may miss late events
- Adversarial events: One very delayed event can keep watermark from advancing (need max-lateness cap)
- Different sources: Multiple partitions may have different lags; watermark = min across all
Windows
What are the three main window types in stream processing?
?
1. Tumbling window:
- Fixed-size, non-overlapping windows
- Each event belongs to exactly one window
- Example: Count clicks per 1-minute period (10:00-10:01, 10:01-10:02, …)
- Use case: Hourly/daily aggregates, rate limiting
2. Sliding window:
- Overlapping windows of fixed size, advancing by slide interval
- Each event may appear in multiple windows
- Example: “Last 10 minutes” metric updated every 1 minute
- Windows: [10:00-10:10], [10:01-10:11], [10:02-10:12], …
- Use case: Moving averages, anomaly detection
3. Session window:
- Variable-size window; ends after a gap in activity
- Groups events from the same user/key until they’re inactive for gap duration
- Example: User session ends after 30 minutes of inactivity
- Use case: User session analytics, activity grouping
Stream Processing Patterns
What are the three types of stream joins?
?
1. Stream-table join (enrichment):
- Enrich stream events with reference data from a table
- Table: relatively static (user profiles, product catalog)
- Implementation: cache table in memory; join at processing time
- CDC keeps table fresh: Subscribe to table’s CDC stream to update local cache
- Example: Enrich “order placed” event with user’s VIP status
2. Stream-stream join (event correlation):
- Correlate events from two different streams within a time window
- Both streams buffered in state store; match on key + time proximity
- State grows: must expire old unmatched events
- Example: Match “search query” events with “click” events within 30 minutes
3. Table-table join (incremental materialized view):
- Both inputs are change streams from databases (CDC)
- Maintains join as a continuously updated materialized view
- Example: Continuously join
ordersCDC +customersCDC → enriched orders view
Fault Tolerance
What does “exactly-once” mean in stream processing and how is it achieved?
?
-
Exactly-once: Every input event contributes to the output exactly once, even if failures occur and processing is retried
-
Why hard: If processor crashes and replays from checkpoint → events between checkpoint and crash are reprocessed → output may be duplicated
-
Three approaches:
-
Idempotent writes (simplest):
- Process event → write output with unique event_id
- If reprocessed: same event_id → deduplicate at sink → same result
- Works when output is idempotent (setting a value) or deduplicatable
-
Kafka Transactions (for Kafka-to-Kafka):
- Atomically write output records + commit input offsets in one transaction
- Consumer sees both or neither (atomic)
- Built into Kafka producer API
-
Flink Chandy-Lamport Checkpointing:
- Stream special “barrier” markers alongside data
- Every operator takes consistent snapshot when barrier passes through
- Checkpoint = state snapshot + input offset at that point
- On failure: rollback state + replay input from checkpoint’s offset
How does Flink’s Chandy-Lamport checkpointing achieve exactly-once semantics?
?
-
Algorithm: Adapted Chandy-Lamport distributed snapshot algorithm
-
Process:
- JobManager triggers checkpoint; sends barrier message to all sources
- Source writes current input offset to checkpoint; forwards barrier downstream
- Each operator: when barrier arrives on all inputs:
- Snapshot its local state (to durable storage: S3/HDFS)
- Forward barrier to its outputs
- Sink: when barrier arrives → register with checkpoint
- JobManager: once all operators confirmed → checkpoint complete
-
On failure:
- Roll back ALL operators to their last checkpoint state
- Sources re-read input from the offset recorded in that checkpoint
- Processing resumes from checkpoint; events between checkpoint and failure are replayed
- Sinks must accept duplicate writes idempotently OR use 2PC with sink
-
Result: Input offsets + operator state form a consistent snapshot; failure recovery is safe
Modern Context (2026)
How is the streaming lakehouse changing data architecture in 2026?
?
-
Traditional: Separate systems for streaming (Kafka) + batch analytics (Snowflake/Spark)
- Two data copies, ETL jobs to move between them
- Analytics always lagged behind real-time by hours
-
Streaming Lakehouse (2026 approach):
- Flink → Apache Iceberg: Stream processor writes directly to Iceberg tables on S3
- ACID streaming writes; immediately queryable
- Spark Structured Streaming → Delta Lake: Micro-batch streaming to Delta tables
- Single data store: Same Iceberg/Delta tables read by both streaming and batch consumers
- Flink → Apache Iceberg: Stream processor writes directly to Iceberg tables on S3
-
Benefits:
- Near-real-time analytics (seconds lag) without separate real-time DB
- One data format (Parquet + transaction log) for all consumers
- Streaming and batch jobs use same format; no separate ETL
-
Use case example: Flink writes user events to Iceberg; Trino/DuckDB queries events within minutes of occurrence; same table used for overnight batch aggregation
What are feature stores and how do they connect stream processing with ML?
?
-
Feature store: Centralized storage for ML features used in model training and inference
- Offline feature store: Historical features for training (Parquet files, data warehouse)
- Online feature store: Low-latency feature serving for real-time inference (Redis, DynamoDB)
-
Stream processing’s role:
- Events arrive in Kafka (user actions, transactions, etc.)
- Stream processor (Flink) computes real-time features
- Example: “user’s last 10 purchases”, “average click-through rate last hour”
- Features written to online store (Redis) for low-latency serving
- Features also written to offline store (S3) for model retraining
-
Why important: Without feature store:
- Training features computed differently from serving features → training-serving skew
- Real-time features not available without implementing separately in each model
-
Products: Feast (open-source), Tecton, Hopsworks, Databricks Feature Store, AWS SageMaker Feature Store
Interview Scenarios
Design a real-time fraud detection system for credit card transactions.
?
Event flow: Transaction → Kafka → Fraud detector → Allow/Deny
Stream processing architecture (Apache Flink):
Features (stateful aggregations over event time):
- Last 1-hour spending velocity per card
- Count of transactions per merchant per hour
- Count of failed attempts per card per day
- Geographic distance between consecutive transactions
- Average transaction amount deviation
State management:
- Per-card state: rolling window aggregates (RocksDB in Flink)
- State TTL: expire stale card state after 30 days of inactivity
Stream-table join:
- Enrich transaction with cardholder profile (risk score, usual spending patterns)
- Table updated via CDC from customer database
Decision:
- Score transaction based on features
- If score > threshold: decline immediately (synchronous response needed → very low latency)
- Alternatively: approve + flag for async review (async fraud detection)
Latency target: <100ms for real-time scoring (within payment auth flow)
Fault tolerance: Flink with exactly-once checkpointing; checkpoint every 30s
When should you use CDC vs event sourcing for data integration across services?
?
Use CDC when:
- You have an existing database-first application
- You need to keep derived systems (search, cache, analytics) in sync with the DB
- You don’t want to change application code significantly
- Database is the source of truth; other systems are consumers
- Example: Sync MySQL users table → Elasticsearch for user search
Use event sourcing when:
- Building a new system from scratch (or major rewrite)
- You need complete audit trail + time travel as core feature
- Business domain is naturally event-driven (orders, payments, reservations)
- You need multiple different “projections” from the same history
- Strong eventual consistency is acceptable
- Example: Order management system where every state change is a domain event
Hybrid (common in practice):
- Core business logic uses event sourcing (domain events in EventStoreDB or Kafka)
- Infrastructure integration uses CDC where needed
- Materialized views (read models) derived from event stream
Key decision: “Is the database or the event log the primary source of truth?”
Quick Facts
What does Kafka consumer lag measure and why does it matter?
?
-
Consumer lag: Difference between producer’s latest offset and consumer’s last committed offset
- Lag = 0: consumer is up-to-date
- Lag = 10,000: consumer is 10,000 messages behind
-
Why it matters:
- High lag = stream processing is not keeping up with incoming data
- For real-time systems: lag translates directly to latency (processing events from 5 min ago)
- Unbounded lag growth = consumer will never catch up → need more resources
-
Monitoring:
- Alert when lag exceeds a threshold (e.g., 1 million messages or 5 minutes of lag)
- Track lag per partition (one slow partition can block the whole consumer group)
- Metrics:
kafka_consumer_group_lagin Prometheus/Grafana
-
Causes of growing lag:
- Consumer too slow: increase parallelism (add more consumer instances)
- Processing too expensive: optimize code or reduce work per event
- Spiky input: auto-scale consumer instances; provision for peak load
Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13