Chapter 11 Flashcards — Batch Processing
flashcards ddia-2e chapter11 batch-processing
Core Concepts
What is batch processing, and how does it differ from stream processing and request/response?
?
Batch processing: Process a bounded (finite) dataset in bulk; job starts, processes all data, terminates.
- Latency: minutes to hours
- Triggered by scheduler (not user request)
- Input is complete before processing starts
Stream processing: Process an unbounded (infinite) stream of events as they arrive.
- Latency: milliseconds to seconds
- Never terminates (long-running job)
Request/response (OLTP): Handle one user request at a time.
- Latency: milliseconds
- Processes tiny amount of data per request
Batch’s core principle: Immutable inputs → deterministic transformations → derived outputs. If job fails: re-run from the same input.
What is the “shuffle” in distributed batch processing, and why is it expensive?
?
Shuffle: The network redistribution of records so that all records with the same key end up on the same worker.
Why it’s needed: Operations like groupBy, join, and distinct require all same-key records to be co-located on one worker.
Why it’s expensive:
- Network I/O: All data may cross the network (potentially all mappers to all reducers)
- Sort: Records sorted by key before grouping
- Disk spill: Intermediate data written to disk when RAM exhausted
- Global barrier: Next stage cannot start until ALL upstream workers finish
Minimizing shuffle:
- Broadcast join: Replicate small table; no shuffle for large side
- Combiner/partial aggregation: Pre-aggregate locally before shuffle (works for sum, count, max)
- Co-partitioning: Pre-partition both inputs by the same key
- Salting: Split skewed key across multiple workers
What are the three main batch join strategies and when do you use each?
?
1. Broadcast Hash Join (map-side join):
- Load small table into RAM hash table on every worker
- Large table processed locally against hash table
- No shuffle for large side
- Use when: one side fits in RAM (typically < 1GB to 10GB)
- Spark:
large_df.join(broadcast(small_df), on="key")
2. Sort-Merge Join (reduce-side join):
- Both inputs shuffled and sorted by join key
- Merge two sorted streams in one pass
- Works for any input size (large × large)
- Use when: both sides are large; no broadcast option
3. Partitioned Hash Join (bucketed join):
- Both inputs pre-bucketed by the same key
- Each worker joins its local bucket pair (no cross-network shuffle)
- Use when: datasets are pre-bucketed and joined repeatedly
- Setup cost: one-time bucketing; then all joins are fast
Rule: Broadcast join whenever possible; sort-merge join for large × large.
DataFrame API and Query Languages
What is the difference between the RDD API, DataFrame API, and SQL in Spark? When do you use each?
?
RDD API (low-level):
- Direct manipulation of distributed collections; no schema
- No query optimizer; code is verbose
- Use when: custom operators; exact execution control needed
- Almost never used for new code in 2026
DataFrame API:
df.groupBy("event_type").agg(count("*").alias("n")).orderBy("n", ascending=False)- Schema-aware; lazy (builds logical plan)
- Catalyst optimizer: predicate pushdown, column pruning, join reordering
- Use when: data engineers, ML engineers; complex transformations
SQL (Spark SQL / Flink SQL / dbt):
SELECT event_type, COUNT(*) AS n FROM events GROUP BY event_type ORDER BY n DESC- Highest-level abstraction; most optimizer opportunities
- CBO (Cost-Based Optimizer) chooses join strategy using statistics
- Use when: analysts, BI engineers, dbt models, standard analytics
Key insight: DataFrame and SQL both compile to the same Catalyst logical plan — same performance; different authoring style.
What are narrow vs wide transformations in Spark, and why does the distinction matter?
?
Narrow transformations (no shuffle):
- Each output partition depends on only one input partition
- Can be pipelined together (fused) with no network transfer
- Examples:
select,filter,withColumn,map,flatMap,union
Wide transformations (require shuffle):
- Output partitions may depend on many input partitions
- Force a shuffle stage → data redistributed across network
- Examples:
groupBy,join,distinct,repartition,sort/orderBy
Why it matters:
- Shuffle is the dominant cost in batch jobs
- Minimizing wide transformations = minimizing job runtime
- Query optimizer fuses adjacent narrow transformations into single stage
- Each wide transformation = new “stage” in Spark DAG (visible in Spark UI)
Practical tip: Chain as many narrow transformations as possible before a wide one; let Catalyst fuse them.
ETL and ELT
What is the difference between ETL and ELT? When do you use each?
?
ETL (Extract → Transform → Load):
- Transform data before loading into the warehouse
- Transform happens in external compute (Spark, Informatica)
- Warehouse receives clean, schema-enforced data
- Raw data may be discarded after transform
ELT (Extract → Load → Transform):
- Load raw data first into warehouse/lakehouse
- Transform inside the warehouse using SQL (dbt)
- Raw data preserved for re-transformation
- Standard pattern in 2026: Fivetran/Airbyte (ingest) → dbt (transform)
When to use ETL:
- Transforms require custom code (Python, Scala) not expressible in SQL
- Heavy preprocessing (ML feature engineering, NLP, image processing)
- Data too large to load raw (very unusual with modern cloud warehouses)
When to use ELT:
- Transformations are SQL-expressible (the majority of analytics)
- You want to preserve raw data for future re-transformation
- You use a cloud warehouse (Snowflake, BigQuery, Databricks, Redshift)
- Analytics engineering with dbt (2026 standard)
What is dbt (data build tool) and what problem does it solve?
?
dbt = SQL-based transformation framework for data warehouses.
Problem it solves: Before dbt, SQL transformation pipelines were ad-hoc scripts with no version control, no tests, no documentation, no dependency management.
What dbt provides:
- Each model = one SQL SELECT statement → materializes as table or view
- DAG management:
{{ ref('model_name') }}creates dependency; dbt builds in order - Tests: Built-in assertions (
not_null,unique,accepted_values,referential_integrity) - Documentation: Auto-generates data catalog from model definitions
- Incremental models: Only process new data (not full recompute)
- Version control: SQL models live in git → reviewable, auditable
Incremental pattern:
{{ config(materialized='incremental', unique_key='event_id') }}
SELECT * FROM raw_events
{% if is_incremental() %}
WHERE occurred_at > (SELECT MAX(occurred_at) FROM {{ this }})
{% endif %}2026 status: Industry standard for analytics engineering; 50,000+ organizations; integrates with Snowflake, BigQuery, Databricks, DuckDB.
Machine Learning Pipelines
Why must you use a temporal (time-based) split instead of a random split when evaluating ML models on time-series data?
?
Random split problem (data leakage):
- A random split mixes events from all time periods across train and test
- A feature computed at time T might depend on events that happened AFTER T
- The model “learns” from future information it won’t have in production
- Result: inflated offline metrics that don’t match production performance
Temporal split (correct approach):
Timeline: ─────────────────────────────────────────────►
│
[───────── TRAIN ───────────│──── TEST ────]
All events before cutoff cutoff Events after cutoff
Examples of leakage via random split:
- Feature: “user’s total purchases in the next 7 days” (literally future)
- Feature: “user’s average rating” computed from all their ratings (includes future ones)
- Target: “did user churn?” where future data tells you who stayed
Rule: Train on the past; test on the future. Always mirror how the model will be used in production.
What is the difference between batch scoring and online inference? When do you use each?
?
Batch scoring (offline, pre-computed):
- Run a batch job (Spark) to score all users/items overnight
- Write predictions to a fast lookup store (Redis, DynamoDB, Cassandra)
- At request time: look up pre-computed prediction from cache
- Latency at serving: <1ms (cache lookup)
- Staleness: predictions are hours old (until next batch run)
Online inference (real-time):
- Model server (TorchServe, BentoML, Triton) loads model in memory
- At request time: compute features → call model → return prediction
- Latency at serving: 10–200ms (feature fetch + model compute)
- Freshness: uses latest features + model
When to use batch scoring:
- Predictions needed for all users/items (recommendations, ranking)
- Latency requirement is SLA (<1ms), not freshness
- Model is large; inference is expensive
- Features are not request-dependent
When to use online inference:
- Prediction needs request-time context (fraud detection: transaction details)
- Freshness matters (recent user actions must affect prediction)
- Personalization that varies per request
Architecture Patterns
What are the Lambda and Kappa architectures? Compare their trade-offs.
?
Lambda Architecture:
Raw data ──┬──► Batch layer (Spark, accurate, hours lag) ─┬──► Query
└──► Speed layer (Flink, approximate, seconds lag) ┘ (merge)
- Batch layer: Recomputes correct views over all history; hours of lag
- Speed layer: Handles recent data; approximate; eventually replaced by batch
- Serving layer: Merges results from both layers at query time
Lambda trade-offs:
- Pro: Batch provides accuracy; stream provides low latency
- Con: Two codebases (batch + stream) for same logic → divergence bugs
- Con: Complex merge logic at query time
Kappa Architecture:
Raw events ──► Single stream processor ──► Derived views
(handles real-time AND historical replay)
- Single codebase; replay from beginning of Kafka log for reprocessing
- Pro: Simpler; one system to maintain
- Con: Reprocessing may be slow; requires long log retention
2026 trend: Streaming Lakehouse (Flink → Iceberg/Delta on S3) is replacing both. Stream processing writes ACID tables queryable directly by BI tools without a separate serving layer.
How does the Lakehouse architecture change the batch processing workflow?
?
Problem with data lake: Raw Parquet files → no ACID, no schema enforcement, slow concurrent writes
Problem with data warehouse: Expensive proprietary formats; hard to use ML tools on warehouse data
Lakehouse (Delta Lake, Apache Iceberg, Apache Hudi):
- Open format: Parquet + transaction log on object storage (S3, GCS)
- ACID transactions: Concurrent batch writers don’t corrupt data
- Time travel: Query data as of any past point (audit, rollback)
- Schema evolution: Add columns without full rewrite
- Streaming + batch unified: Same table written by Flink (streaming) and Spark (batch)
Batch workflow with Lakehouse:
1. Ingest: Fivetran/Debezium → Kafka → raw zone (Parquet/Iceberg on S3)
2. Transform: dbt or Spark → reads raw zone → writes refined zone (Delta/Iceberg)
3. Serve: Trino, DuckDB, Tableau → query refined zone directly
Key benefit: Batch output immediately queryable by any SQL tool without an export step. Time travel enables “re-run yesterday’s job” without keeping extra copies.
Modern Tools
What is Apache Beam and what problem does it solve?
?
Apache Beam: A portable pipeline programming model for both batch and stream processing.
Problem: Different batch/stream frameworks have different APIs. Writing a pipeline for Spark means rewriting it for Flink or Google Dataflow.
How Beam solves it:
- Write one pipeline using the Beam API (Python or Java)
- Runner translates Beam pipeline to framework-specific execution
- Supported runners: Apache Flink, Apache Spark, Google Cloud Dataflow
When to use Beam:
- Multi-cloud or multi-runner requirements
- Teams using Google Cloud Dataflow (Beam is its native API)
- Portability requirements (write once, run anywhere)
Trade-off: Beam abstraction is slightly less flexible than native Spark/Flink APIs; some framework-specific optimizations not accessible through Beam.
What is data skew in batch processing and how do you handle it with salting?
?
Data skew: One key has far more records than others → one worker processes most of the data while others are idle.
Example: Join on user_id where user_id=99 has 100M events (celebrity/bot account).
Detection: Look at Spark UI → one task takes 100x longer than others.
Salting solution (two-phase aggregation):
Step 1: Add random salt prefix to skewed key (0 to N-1)
Original: (user_id=99, event_type="view", count=1)
Salted: (user_id=99_7, event_type="view", count=1)
(user_id=99_3, event_type="view", count=1)
→ Distributes key 99 across N workers
Step 2: Partial aggregate within each salt bucket
Worker 7: SUM(count where user_id=99_7) = 12M
Worker 3: SUM(count where user_id=99_3) = 8M
Step 3: Final aggregate across salt buckets
SUM across all user_id=99_* = 100M
Trade-off: Two-pass aggregation adds a second reduce phase; worth it when skew ratio > 10x.
Alternative for joins: Broadcast the large-key records explicitly; handle “exploded” records separately.
Comparisons and Trade-offs
Compare Spark, Flink (batch mode), dbt, and DuckDB for batch processing.
?
| Framework | Scale | API | Lazy? | Shuffle | Best for |
|---|---|---|---|---|---|
| Spark | Cluster, 100GB–PB | DataFrame/SQL/Python | Yes | Memory+disk | General batch; ML; ETL at scale |
| Flink batch | Cluster, 100GB–PB | Table API/SQL | Yes | Pipelined | Unified batch+stream; low-latency |
| dbt | Warehouse-scale | SQL only | Via SQL | None (SQL engine handles) | ELT transforms inside warehouse |
| DuckDB | Single machine, <500GB | SQL + DataFrame | Via SQL | N/A | Local analytics; notebook; fast SQL |
Decision guide:
- Need Python/distributed cluster → Spark
- Already using Flink for streaming → Flink batch (same runtime)
- All transforms expressible in SQL, have a warehouse → dbt
- Single machine, <500GB, speed matters → DuckDB
- Multi-cloud portability → Apache Beam (on top of Flink or Spark)
What is the Catalyst optimizer in Spark and what optimizations does it apply?
?
Catalyst: Spark’s query optimizer; converts a logical plan into an optimized physical plan.
Optimization phases:
- Analysis: Resolve column names, types; validate schema
- Logical optimization: Apply rule-based rewrites
- Predicate pushdown: Push
WHEREfilters as close to source as possible (read less data) - Column pruning: Only read columns referenced in query
- Constant folding: Evaluate constant expressions at plan time
- Null propagation: Simplify expressions with known nulls
- Predicate pushdown: Push
- Physical planning: Choose physical operators
- Join strategy: Broadcast hash join vs sort-merge join based on table size
- Partition pruning: Skip partitions that can’t match filter predicates
- Code generation: Generate JVM bytecode for tight CPU execution loops (Tungsten)
Why DataFrames outperform RDDs:
- RDDs are opaque functions; optimizer cannot see inside
- DataFrames have a schema → optimizer can apply all the above rules
- DataFrames automatically get broadcast join detection, predicate pushdown, column pruning
AQE (Adaptive Query Execution, Spark 3.x+): Applies optimizations at runtime based on actual data statistics observed during execution (e.g., dynamically changes join strategy mid-job).
Serving Derived Data
What is derived data and why does immutability matter for batch processing?
?
Derived data: Any dataset that can be computed from an authoritative (immutable) source.
- Examples: search indexes, recommendation scores, aggregated metrics, ML model predictions, materialized views, caches
The immutability principle:
- Source data (user events, transactions) is never modified
- Batch jobs read from immutable source → write to new derived output location
- If derived data is corrupted or deleted: just re-run the batch job
- No need for complex recovery procedures
Why this matters:
- Fault tolerance: Job failure = re-run from unchanged input
- Debugging: Can inspect any intermediate step (nothing was modified)
- Historical reprocessing: Change business logic → reprocess all history → new derived view
- A/B testing logic: Run old + new logic on same input; compare outputs before deploying
- Auditability: Derived data can always be traced to its source
Contrast with mutable state: If a batch job modifies its input (anti-pattern), re-running it produces different results → non-deterministic, non-reproducible.
What are the stages of a complete batch ML pipeline, and what are the key pitfalls at each stage?
?
Stage 1: Feature Engineering
- Transform raw events into numerical feature vectors
- Pitfall: Data leakage — features that encode future information
- Pitfall: Training-serving skew — feature computed differently in batch vs online
Stage 2: Temporal Train/Test Split
- Split at a time cutoff, not randomly
- Pitfall: Random split leaks future events into training set
- Pitfall: Cutoff too recent — insufficient test data
Stage 3: Model Training
- Spark MLlib for distributed tree models; PyTorch DDP for deep learning
- Pitfall: Wrong metric — optimizing accuracy on imbalanced dataset
- Pitfall: Overfitting to train period — model doesn’t generalize to new time periods
Stage 4: Evaluation
- Evaluate on temporal holdout; slice by cohort/geography/platform
- Pitfall: Aggregate metrics only — model may fail badly on important subgroups
Stage 5: Serving
- Batch scoring: write predictions to cache (Redis); serve at <1ms
- Online inference: compute at request time from fresh features
- Pitfall: Stale predictions — batch scoring without fresh feature refresh
- Pitfall: Serving-training skew — model server uses different preprocessing than training
Total Cards: 16
Review Time: ~25 minutes
Priority: HIGH
Last Updated: 2026-05-29