Chapter 11 Flashcards — Batch Processing

flashcards ddia-2e chapter11 batch-processing

Core Concepts

What is batch processing, and how does it differ from stream processing and request/response?
?
Batch processing: Process a bounded (finite) dataset in bulk; job starts, processes all data, terminates.

Latency: minutes to hours
Triggered by scheduler (not user request)
Input is complete before processing starts

Stream processing: Process an unbounded (infinite) stream of events as they arrive.

Latency: milliseconds to seconds
Never terminates (long-running job)

Request/response (OLTP): Handle one user request at a time.

Latency: milliseconds
Processes tiny amount of data per request

Batch’s core principle: Immutable inputs → deterministic transformations → derived outputs. If job fails: re-run from the same input.

What is the “shuffle” in distributed batch processing, and why is it expensive?
?
Shuffle: The network redistribution of records so that all records with the same key end up on the same worker.

Why it’s needed: Operations like groupBy, join, and distinct require all same-key records to be co-located on one worker.

Why it’s expensive:

Network I/O: All data may cross the network (potentially all mappers to all reducers)
Sort: Records sorted by key before grouping
Disk spill: Intermediate data written to disk when RAM exhausted
Global barrier: Next stage cannot start until ALL upstream workers finish

Minimizing shuffle:

Broadcast join: Replicate small table; no shuffle for large side
Combiner/partial aggregation: Pre-aggregate locally before shuffle (works for sum, count, max)
Co-partitioning: Pre-partition both inputs by the same key
Salting: Split skewed key across multiple workers

What are the three main batch join strategies and when do you use each?
?
1. Broadcast Hash Join (map-side join):

Load small table into RAM hash table on every worker
Large table processed locally against hash table
No shuffle for large side
Use when: one side fits in RAM (typically < 1GB to 10GB)
Spark: large_df.join(broadcast(small_df), on="key")

2. Sort-Merge Join (reduce-side join):

Both inputs shuffled and sorted by join key
Merge two sorted streams in one pass
Works for any input size (large × large)
Use when: both sides are large; no broadcast option

3. Partitioned Hash Join (bucketed join):

Both inputs pre-bucketed by the same key
Each worker joins its local bucket pair (no cross-network shuffle)
Use when: datasets are pre-bucketed and joined repeatedly
Setup cost: one-time bucketing; then all joins are fast

Rule: Broadcast join whenever possible; sort-merge join for large × large.

DataFrame API and Query Languages

What is the difference between the RDD API, DataFrame API, and SQL in Spark? When do you use each?
?
RDD API (low-level):

Direct manipulation of distributed collections; no schema
No query optimizer; code is verbose
Use when: custom operators; exact execution control needed
Almost never used for new code in 2026

DataFrame API:

df.groupBy("event_type").agg(count("*").alias("n")).orderBy("n", ascending=False)

Schema-aware; lazy (builds logical plan)
Catalyst optimizer: predicate pushdown, column pruning, join reordering
Use when: data engineers, ML engineers; complex transformations

SQL (Spark SQL / Flink SQL / dbt):

SELECT event_type, COUNT(*) AS n FROM events GROUP BY event_type ORDER BY n DESC

Highest-level abstraction; most optimizer opportunities
CBO (Cost-Based Optimizer) chooses join strategy using statistics
Use when: analysts, BI engineers, dbt models, standard analytics

Key insight: DataFrame and SQL both compile to the same Catalyst logical plan — same performance; different authoring style.

What are narrow vs wide transformations in Spark, and why does the distinction matter?
?
Narrow transformations (no shuffle):

Each output partition depends on only one input partition
Can be pipelined together (fused) with no network transfer
Examples: select, filter, withColumn, map, flatMap, union

Wide transformations (require shuffle):

Output partitions may depend on many input partitions
Force a shuffle stage → data redistributed across network
Examples: groupBy, join, distinct, repartition, sort/orderBy

Why it matters:

Shuffle is the dominant cost in batch jobs
Minimizing wide transformations = minimizing job runtime
Query optimizer fuses adjacent narrow transformations into single stage
Each wide transformation = new “stage” in Spark DAG (visible in Spark UI)

Practical tip: Chain as many narrow transformations as possible before a wide one; let Catalyst fuse them.

ETL and ELT

What is the difference between ETL and ELT? When do you use each?
?
ETL (Extract → Transform → Load):

Transform data before loading into the warehouse
Transform happens in external compute (Spark, Informatica)
Warehouse receives clean, schema-enforced data
Raw data may be discarded after transform

ELT (Extract → Load → Transform):

Load raw data first into warehouse/lakehouse
Transform inside the warehouse using SQL (dbt)
Raw data preserved for re-transformation
Standard pattern in 2026: Fivetran/Airbyte (ingest) → dbt (transform)

When to use ETL:

Transforms require custom code (Python, Scala) not expressible in SQL
Heavy preprocessing (ML feature engineering, NLP, image processing)
Data too large to load raw (very unusual with modern cloud warehouses)

When to use ELT:

Transformations are SQL-expressible (the majority of analytics)
You want to preserve raw data for future re-transformation
You use a cloud warehouse (Snowflake, BigQuery, Databricks, Redshift)
Analytics engineering with dbt (2026 standard)

What is dbt (data build tool) and what problem does it solve?
?
dbt = SQL-based transformation framework for data warehouses.

Problem it solves: Before dbt, SQL transformation pipelines were ad-hoc scripts with no version control, no tests, no documentation, no dependency management.

What dbt provides:

Each model = one SQL SELECT statement → materializes as table or view
DAG management: {{ ref('model_name') }} creates dependency; dbt builds in order
Tests: Built-in assertions (not_null, unique, accepted_values, referential_integrity)
Documentation: Auto-generates data catalog from model definitions
Incremental models: Only process new data (not full recompute)
Version control: SQL models live in git → reviewable, auditable

Incremental pattern:

{{ config(materialized='incremental', unique_key='event_id') }}
SELECT * FROM raw_events
{% if is_incremental() %}
WHERE occurred_at > (SELECT MAX(occurred_at) FROM {{ this }})
{% endif %}

2026 status: Industry standard for analytics engineering; 50,000+ organizations; integrates with Snowflake, BigQuery, Databricks, DuckDB.

Machine Learning Pipelines

Why must you use a temporal (time-based) split instead of a random split when evaluating ML models on time-series data?
?
Random split problem (data leakage):

A random split mixes events from all time periods across train and test
A feature computed at time T might depend on events that happened AFTER T
The model “learns” from future information it won’t have in production
Result: inflated offline metrics that don’t match production performance

Temporal split (correct approach):

Timeline: ─────────────────────────────────────────────►
                              │
  [───────── TRAIN ───────────│──── TEST ────]
  All events before cutoff   cutoff  Events after cutoff

Examples of leakage via random split:

Feature: “user’s total purchases in the next 7 days” (literally future)
Feature: “user’s average rating” computed from all their ratings (includes future ones)
Target: “did user churn?” where future data tells you who stayed

Rule: Train on the past; test on the future. Always mirror how the model will be used in production.

What is the difference between batch scoring and online inference? When do you use each?
?
Batch scoring (offline, pre-computed):

Run a batch job (Spark) to score all users/items overnight
Write predictions to a fast lookup store (Redis, DynamoDB, Cassandra)
At request time: look up pre-computed prediction from cache
Latency at serving: <1ms (cache lookup)
Staleness: predictions are hours old (until next batch run)

Online inference (real-time):

Model server (TorchServe, BentoML, Triton) loads model in memory
At request time: compute features → call model → return prediction
Latency at serving: 10–200ms (feature fetch + model compute)
Freshness: uses latest features + model

When to use batch scoring:

Predictions needed for all users/items (recommendations, ranking)
Latency requirement is SLA (<1ms), not freshness
Model is large; inference is expensive
Features are not request-dependent

When to use online inference:

Prediction needs request-time context (fraud detection: transaction details)
Freshness matters (recent user actions must affect prediction)
Personalization that varies per request

Architecture Patterns

What are the Lambda and Kappa architectures? Compare their trade-offs.
?
Lambda Architecture:

Raw data ──┬──► Batch layer  (Spark, accurate, hours lag)     ─┬──► Query
           └──► Speed layer  (Flink, approximate, seconds lag) ┘     (merge)

Batch layer: Recomputes correct views over all history; hours of lag
Speed layer: Handles recent data; approximate; eventually replaced by batch
Serving layer: Merges results from both layers at query time

Lambda trade-offs:

Pro: Batch provides accuracy; stream provides low latency
Con: Two codebases (batch + stream) for same logic → divergence bugs
Con: Complex merge logic at query time

Kappa Architecture:

Raw events ──► Single stream processor ──► Derived views
              (handles real-time AND historical replay)

Single codebase; replay from beginning of Kafka log for reprocessing
Pro: Simpler; one system to maintain
Con: Reprocessing may be slow; requires long log retention

2026 trend: Streaming Lakehouse (Flink → Iceberg/Delta on S3) is replacing both. Stream processing writes ACID tables queryable directly by BI tools without a separate serving layer.

How does the Lakehouse architecture change the batch processing workflow?
?
Problem with data lake: Raw Parquet files → no ACID, no schema enforcement, slow concurrent writes

Problem with data warehouse: Expensive proprietary formats; hard to use ML tools on warehouse data

Lakehouse (Delta Lake, Apache Iceberg, Apache Hudi):

Open format: Parquet + transaction log on object storage (S3, GCS)
ACID transactions: Concurrent batch writers don’t corrupt data
Time travel: Query data as of any past point (audit, rollback)
Schema evolution: Add columns without full rewrite
Streaming + batch unified: Same table written by Flink (streaming) and Spark (batch)

Batch workflow with Lakehouse:

1. Ingest:  Fivetran/Debezium → Kafka → raw zone (Parquet/Iceberg on S3)
2. Transform: dbt or Spark → reads raw zone → writes refined zone (Delta/Iceberg)
3. Serve:   Trino, DuckDB, Tableau → query refined zone directly

Key benefit: Batch output immediately queryable by any SQL tool without an export step. Time travel enables “re-run yesterday’s job” without keeping extra copies.

Modern Tools

What is Apache Beam and what problem does it solve?
?
Apache Beam: A portable pipeline programming model for both batch and stream processing.

Problem: Different batch/stream frameworks have different APIs. Writing a pipeline for Spark means rewriting it for Flink or Google Dataflow.

How Beam solves it:

Write one pipeline using the Beam API (Python or Java)
Runner translates Beam pipeline to framework-specific execution
Supported runners: Apache Flink, Apache Spark, Google Cloud Dataflow

When to use Beam:

Multi-cloud or multi-runner requirements
Teams using Google Cloud Dataflow (Beam is its native API)
Portability requirements (write once, run anywhere)

Trade-off: Beam abstraction is slightly less flexible than native Spark/Flink APIs; some framework-specific optimizations not accessible through Beam.

What is data skew in batch processing and how do you handle it with salting?
?
Data skew: One key has far more records than others → one worker processes most of the data while others are idle.

Example: Join on user_id where user_id=99 has 100M events (celebrity/bot account).

Detection: Look at Spark UI → one task takes 100x longer than others.

Salting solution (two-phase aggregation):

Step 1: Add random salt prefix to skewed key (0 to N-1)
  Original: (user_id=99, event_type="view", count=1)
  Salted:   (user_id=99_7, event_type="view", count=1)
            (user_id=99_3, event_type="view", count=1)
  → Distributes key 99 across N workers

Step 2: Partial aggregate within each salt bucket
  Worker 7: SUM(count where user_id=99_7) = 12M
  Worker 3: SUM(count where user_id=99_3) = 8M

Step 3: Final aggregate across salt buckets
  SUM across all user_id=99_* = 100M

Trade-off: Two-pass aggregation adds a second reduce phase; worth it when skew ratio > 10x.

Alternative for joins: Broadcast the large-key records explicitly; handle “exploded” records separately.

Comparisons and Trade-offs

Compare Spark, Flink (batch mode), dbt, and DuckDB for batch processing.
?

Framework	Scale	API	Lazy?	Shuffle	Best for
Spark	Cluster, 100GB–PB	DataFrame/SQL/Python	Yes	Memory+disk	General batch; ML; ETL at scale
Flink batch	Cluster, 100GB–PB	Table API/SQL	Yes	Pipelined	Unified batch+stream; low-latency
dbt	Warehouse-scale	SQL only	Via SQL	None (SQL engine handles)	ELT transforms inside warehouse
DuckDB	Single machine, <500GB	SQL + DataFrame	Via SQL	N/A	Local analytics; notebook; fast SQL

Decision guide:

Need Python/distributed cluster → Spark
Already using Flink for streaming → Flink batch (same runtime)
All transforms expressible in SQL, have a warehouse → dbt
Single machine, <500GB, speed matters → DuckDB
Multi-cloud portability → Apache Beam (on top of Flink or Spark)

What is the Catalyst optimizer in Spark and what optimizations does it apply?
?
Catalyst: Spark’s query optimizer; converts a logical plan into an optimized physical plan.

Optimization phases:

Analysis: Resolve column names, types; validate schema
Logical optimization: Apply rule-based rewrites
- Predicate pushdown: Push WHERE filters as close to source as possible (read less data)
- Column pruning: Only read columns referenced in query
- Constant folding: Evaluate constant expressions at plan time
- Null propagation: Simplify expressions with known nulls
Physical planning: Choose physical operators
- Join strategy: Broadcast hash join vs sort-merge join based on table size
- Partition pruning: Skip partitions that can’t match filter predicates
Code generation: Generate JVM bytecode for tight CPU execution loops (Tungsten)

Why DataFrames outperform RDDs:

RDDs are opaque functions; optimizer cannot see inside
DataFrames have a schema → optimizer can apply all the above rules
DataFrames automatically get broadcast join detection, predicate pushdown, column pruning

AQE (Adaptive Query Execution, Spark 3.x+): Applies optimizations at runtime based on actual data statistics observed during execution (e.g., dynamically changes join strategy mid-job).

Serving Derived Data

What is derived data and why does immutability matter for batch processing?
?
Derived data: Any dataset that can be computed from an authoritative (immutable) source.

Examples: search indexes, recommendation scores, aggregated metrics, ML model predictions, materialized views, caches

The immutability principle:

Source data (user events, transactions) is never modified
Batch jobs read from immutable source → write to new derived output location
If derived data is corrupted or deleted: just re-run the batch job
No need for complex recovery procedures

Why this matters:

Fault tolerance: Job failure = re-run from unchanged input
Debugging: Can inspect any intermediate step (nothing was modified)
Historical reprocessing: Change business logic → reprocess all history → new derived view
A/B testing logic: Run old + new logic on same input; compare outputs before deploying
Auditability: Derived data can always be traced to its source

Contrast with mutable state: If a batch job modifies its input (anti-pattern), re-running it produces different results → non-deterministic, non-reproducible.

What are the stages of a complete batch ML pipeline, and what are the key pitfalls at each stage?
?
Stage 1: Feature Engineering

Transform raw events into numerical feature vectors
Pitfall: Data leakage — features that encode future information
Pitfall: Training-serving skew — feature computed differently in batch vs online

Stage 2: Temporal Train/Test Split

Split at a time cutoff, not randomly
Pitfall: Random split leaks future events into training set
Pitfall: Cutoff too recent — insufficient test data

Stage 3: Model Training

Spark MLlib for distributed tree models; PyTorch DDP for deep learning
Pitfall: Wrong metric — optimizing accuracy on imbalanced dataset
Pitfall: Overfitting to train period — model doesn’t generalize to new time periods

Stage 4: Evaluation

Evaluate on temporal holdout; slice by cohort/geography/platform
Pitfall: Aggregate metrics only — model may fail badly on important subgroups

Stage 5: Serving

Batch scoring: write predictions to cache (Redis); serve at <1ms
Online inference: compute at request time from fresh features
Pitfall: Stale predictions — batch scoring without fresh feature refresh
Pitfall: Serving-training skew — model server uses different preprocessing than training

Total Cards: 16
Review Time: ~25 minutes
Priority: HIGH
Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch11-flashcards