Chapter 11 Flashcards — Batch Processing

flashcards ddia-2e chapter11 batch-processing


Core Concepts

What is batch processing, and how does it differ from stream processing and request/response?
?
Batch processing: Process a bounded (finite) dataset in bulk; job starts, processes all data, terminates.

  • Latency: minutes to hours
  • Triggered by scheduler (not user request)
  • Input is complete before processing starts

Stream processing: Process an unbounded (infinite) stream of events as they arrive.

  • Latency: milliseconds to seconds
  • Never terminates (long-running job)

Request/response (OLTP): Handle one user request at a time.

  • Latency: milliseconds
  • Processes tiny amount of data per request

Batch’s core principle: Immutable inputs → deterministic transformations → derived outputs. If job fails: re-run from the same input.

What is the “shuffle” in distributed batch processing, and why is it expensive?
?
Shuffle: The network redistribution of records so that all records with the same key end up on the same worker.

Why it’s needed: Operations like groupBy, join, and distinct require all same-key records to be co-located on one worker.

Why it’s expensive:

  1. Network I/O: All data may cross the network (potentially all mappers to all reducers)
  2. Sort: Records sorted by key before grouping
  3. Disk spill: Intermediate data written to disk when RAM exhausted
  4. Global barrier: Next stage cannot start until ALL upstream workers finish

Minimizing shuffle:

  • Broadcast join: Replicate small table; no shuffle for large side
  • Combiner/partial aggregation: Pre-aggregate locally before shuffle (works for sum, count, max)
  • Co-partitioning: Pre-partition both inputs by the same key
  • Salting: Split skewed key across multiple workers

What are the three main batch join strategies and when do you use each?
?
1. Broadcast Hash Join (map-side join):

  • Load small table into RAM hash table on every worker
  • Large table processed locally against hash table
  • No shuffle for large side
  • Use when: one side fits in RAM (typically < 1GB to 10GB)
  • Spark: large_df.join(broadcast(small_df), on="key")

2. Sort-Merge Join (reduce-side join):

  • Both inputs shuffled and sorted by join key
  • Merge two sorted streams in one pass
  • Works for any input size (large × large)
  • Use when: both sides are large; no broadcast option

3. Partitioned Hash Join (bucketed join):

  • Both inputs pre-bucketed by the same key
  • Each worker joins its local bucket pair (no cross-network shuffle)
  • Use when: datasets are pre-bucketed and joined repeatedly
  • Setup cost: one-time bucketing; then all joins are fast

Rule: Broadcast join whenever possible; sort-merge join for large × large.

DataFrame API and Query Languages

What is the difference between the RDD API, DataFrame API, and SQL in Spark? When do you use each?
?
RDD API (low-level):

  • Direct manipulation of distributed collections; no schema
  • No query optimizer; code is verbose
  • Use when: custom operators; exact execution control needed
  • Almost never used for new code in 2026

DataFrame API:

df.groupBy("event_type").agg(count("*").alias("n")).orderBy("n", ascending=False)
  • Schema-aware; lazy (builds logical plan)
  • Catalyst optimizer: predicate pushdown, column pruning, join reordering
  • Use when: data engineers, ML engineers; complex transformations

SQL (Spark SQL / Flink SQL / dbt):

SELECT event_type, COUNT(*) AS n FROM events GROUP BY event_type ORDER BY n DESC
  • Highest-level abstraction; most optimizer opportunities
  • CBO (Cost-Based Optimizer) chooses join strategy using statistics
  • Use when: analysts, BI engineers, dbt models, standard analytics

Key insight: DataFrame and SQL both compile to the same Catalyst logical plan — same performance; different authoring style.

What are narrow vs wide transformations in Spark, and why does the distinction matter?
?
Narrow transformations (no shuffle):

  • Each output partition depends on only one input partition
  • Can be pipelined together (fused) with no network transfer
  • Examples: select, filter, withColumn, map, flatMap, union

Wide transformations (require shuffle):

  • Output partitions may depend on many input partitions
  • Force a shuffle stage → data redistributed across network
  • Examples: groupBy, join, distinct, repartition, sort/orderBy

Why it matters:

  • Shuffle is the dominant cost in batch jobs
  • Minimizing wide transformations = minimizing job runtime
  • Query optimizer fuses adjacent narrow transformations into single stage
  • Each wide transformation = new “stage” in Spark DAG (visible in Spark UI)

Practical tip: Chain as many narrow transformations as possible before a wide one; let Catalyst fuse them.

ETL and ELT

What is the difference between ETL and ELT? When do you use each?
?
ETL (Extract → Transform → Load):

  • Transform data before loading into the warehouse
  • Transform happens in external compute (Spark, Informatica)
  • Warehouse receives clean, schema-enforced data
  • Raw data may be discarded after transform

ELT (Extract → Load → Transform):

  • Load raw data first into warehouse/lakehouse
  • Transform inside the warehouse using SQL (dbt)
  • Raw data preserved for re-transformation
  • Standard pattern in 2026: Fivetran/Airbyte (ingest) → dbt (transform)

When to use ETL:

  • Transforms require custom code (Python, Scala) not expressible in SQL
  • Heavy preprocessing (ML feature engineering, NLP, image processing)
  • Data too large to load raw (very unusual with modern cloud warehouses)

When to use ELT:

  • Transformations are SQL-expressible (the majority of analytics)
  • You want to preserve raw data for future re-transformation
  • You use a cloud warehouse (Snowflake, BigQuery, Databricks, Redshift)
  • Analytics engineering with dbt (2026 standard)

What is dbt (data build tool) and what problem does it solve?
?
dbt = SQL-based transformation framework for data warehouses.

Problem it solves: Before dbt, SQL transformation pipelines were ad-hoc scripts with no version control, no tests, no documentation, no dependency management.

What dbt provides:

  • Each model = one SQL SELECT statement → materializes as table or view
  • DAG management: {{ ref('model_name') }} creates dependency; dbt builds in order
  • Tests: Built-in assertions (not_null, unique, accepted_values, referential_integrity)
  • Documentation: Auto-generates data catalog from model definitions
  • Incremental models: Only process new data (not full recompute)
  • Version control: SQL models live in git → reviewable, auditable

Incremental pattern:

{{ config(materialized='incremental', unique_key='event_id') }}
SELECT * FROM raw_events
{% if is_incremental() %}
WHERE occurred_at > (SELECT MAX(occurred_at) FROM {{ this }})
{% endif %}

2026 status: Industry standard for analytics engineering; 50,000+ organizations; integrates with Snowflake, BigQuery, Databricks, DuckDB.

Machine Learning Pipelines

Why must you use a temporal (time-based) split instead of a random split when evaluating ML models on time-series data?
?
Random split problem (data leakage):

  • A random split mixes events from all time periods across train and test
  • A feature computed at time T might depend on events that happened AFTER T
  • The model “learns” from future information it won’t have in production
  • Result: inflated offline metrics that don’t match production performance

Temporal split (correct approach):

Timeline: ─────────────────────────────────────────────►
                              │
  [───────── TRAIN ───────────│──── TEST ────]
  All events before cutoff   cutoff  Events after cutoff

Examples of leakage via random split:

  • Feature: “user’s total purchases in the next 7 days” (literally future)
  • Feature: “user’s average rating” computed from all their ratings (includes future ones)
  • Target: “did user churn?” where future data tells you who stayed

Rule: Train on the past; test on the future. Always mirror how the model will be used in production.

What is the difference between batch scoring and online inference? When do you use each?
?
Batch scoring (offline, pre-computed):

  • Run a batch job (Spark) to score all users/items overnight
  • Write predictions to a fast lookup store (Redis, DynamoDB, Cassandra)
  • At request time: look up pre-computed prediction from cache
  • Latency at serving: <1ms (cache lookup)
  • Staleness: predictions are hours old (until next batch run)

Online inference (real-time):

  • Model server (TorchServe, BentoML, Triton) loads model in memory
  • At request time: compute features → call model → return prediction
  • Latency at serving: 10–200ms (feature fetch + model compute)
  • Freshness: uses latest features + model

When to use batch scoring:

  • Predictions needed for all users/items (recommendations, ranking)
  • Latency requirement is SLA (<1ms), not freshness
  • Model is large; inference is expensive
  • Features are not request-dependent

When to use online inference:

  • Prediction needs request-time context (fraud detection: transaction details)
  • Freshness matters (recent user actions must affect prediction)
  • Personalization that varies per request

Architecture Patterns

What are the Lambda and Kappa architectures? Compare their trade-offs.
?
Lambda Architecture:

Raw data ──┬──► Batch layer  (Spark, accurate, hours lag)     ─┬──► Query
           └──► Speed layer  (Flink, approximate, seconds lag) ┘     (merge)
  • Batch layer: Recomputes correct views over all history; hours of lag
  • Speed layer: Handles recent data; approximate; eventually replaced by batch
  • Serving layer: Merges results from both layers at query time

Lambda trade-offs:

  • Pro: Batch provides accuracy; stream provides low latency
  • Con: Two codebases (batch + stream) for same logic → divergence bugs
  • Con: Complex merge logic at query time

Kappa Architecture:

Raw events ──► Single stream processor ──► Derived views
              (handles real-time AND historical replay)
  • Single codebase; replay from beginning of Kafka log for reprocessing
  • Pro: Simpler; one system to maintain
  • Con: Reprocessing may be slow; requires long log retention

2026 trend: Streaming Lakehouse (Flink → Iceberg/Delta on S3) is replacing both. Stream processing writes ACID tables queryable directly by BI tools without a separate serving layer.

How does the Lakehouse architecture change the batch processing workflow?
?
Problem with data lake: Raw Parquet files → no ACID, no schema enforcement, slow concurrent writes

Problem with data warehouse: Expensive proprietary formats; hard to use ML tools on warehouse data

Lakehouse (Delta Lake, Apache Iceberg, Apache Hudi):

  • Open format: Parquet + transaction log on object storage (S3, GCS)
  • ACID transactions: Concurrent batch writers don’t corrupt data
  • Time travel: Query data as of any past point (audit, rollback)
  • Schema evolution: Add columns without full rewrite
  • Streaming + batch unified: Same table written by Flink (streaming) and Spark (batch)

Batch workflow with Lakehouse:

1. Ingest:  Fivetran/Debezium → Kafka → raw zone (Parquet/Iceberg on S3)
2. Transform: dbt or Spark → reads raw zone → writes refined zone (Delta/Iceberg)
3. Serve:   Trino, DuckDB, Tableau → query refined zone directly

Key benefit: Batch output immediately queryable by any SQL tool without an export step. Time travel enables “re-run yesterday’s job” without keeping extra copies.

Modern Tools

What is Apache Beam and what problem does it solve?
?
Apache Beam: A portable pipeline programming model for both batch and stream processing.

Problem: Different batch/stream frameworks have different APIs. Writing a pipeline for Spark means rewriting it for Flink or Google Dataflow.

How Beam solves it:

  • Write one pipeline using the Beam API (Python or Java)
  • Runner translates Beam pipeline to framework-specific execution
  • Supported runners: Apache Flink, Apache Spark, Google Cloud Dataflow

When to use Beam:

  • Multi-cloud or multi-runner requirements
  • Teams using Google Cloud Dataflow (Beam is its native API)
  • Portability requirements (write once, run anywhere)

Trade-off: Beam abstraction is slightly less flexible than native Spark/Flink APIs; some framework-specific optimizations not accessible through Beam.

What is data skew in batch processing and how do you handle it with salting?
?
Data skew: One key has far more records than others → one worker processes most of the data while others are idle.

Example: Join on user_id where user_id=99 has 100M events (celebrity/bot account).

Detection: Look at Spark UI → one task takes 100x longer than others.

Salting solution (two-phase aggregation):

Step 1: Add random salt prefix to skewed key (0 to N-1)
  Original: (user_id=99, event_type="view", count=1)
  Salted:   (user_id=99_7, event_type="view", count=1)
            (user_id=99_3, event_type="view", count=1)
  → Distributes key 99 across N workers

Step 2: Partial aggregate within each salt bucket
  Worker 7: SUM(count where user_id=99_7) = 12M
  Worker 3: SUM(count where user_id=99_3) = 8M

Step 3: Final aggregate across salt buckets
  SUM across all user_id=99_* = 100M

Trade-off: Two-pass aggregation adds a second reduce phase; worth it when skew ratio > 10x.

Alternative for joins: Broadcast the large-key records explicitly; handle “exploded” records separately.

Comparisons and Trade-offs

Compare Spark, Flink (batch mode), dbt, and DuckDB for batch processing.
?

FrameworkScaleAPILazy?ShuffleBest for
SparkCluster, 100GB–PBDataFrame/SQL/PythonYesMemory+diskGeneral batch; ML; ETL at scale
Flink batchCluster, 100GB–PBTable API/SQLYesPipelinedUnified batch+stream; low-latency
dbtWarehouse-scaleSQL onlyVia SQLNone (SQL engine handles)ELT transforms inside warehouse
DuckDBSingle machine, <500GBSQL + DataFrameVia SQLN/ALocal analytics; notebook; fast SQL

Decision guide:

  • Need Python/distributed cluster → Spark
  • Already using Flink for streaming → Flink batch (same runtime)
  • All transforms expressible in SQL, have a warehouse → dbt
  • Single machine, <500GB, speed matters → DuckDB
  • Multi-cloud portability → Apache Beam (on top of Flink or Spark)

What is the Catalyst optimizer in Spark and what optimizations does it apply?
?
Catalyst: Spark’s query optimizer; converts a logical plan into an optimized physical plan.

Optimization phases:

  1. Analysis: Resolve column names, types; validate schema
  2. Logical optimization: Apply rule-based rewrites
    • Predicate pushdown: Push WHERE filters as close to source as possible (read less data)
    • Column pruning: Only read columns referenced in query
    • Constant folding: Evaluate constant expressions at plan time
    • Null propagation: Simplify expressions with known nulls
  3. Physical planning: Choose physical operators
    • Join strategy: Broadcast hash join vs sort-merge join based on table size
    • Partition pruning: Skip partitions that can’t match filter predicates
  4. Code generation: Generate JVM bytecode for tight CPU execution loops (Tungsten)

Why DataFrames outperform RDDs:

  • RDDs are opaque functions; optimizer cannot see inside
  • DataFrames have a schema → optimizer can apply all the above rules
  • DataFrames automatically get broadcast join detection, predicate pushdown, column pruning

AQE (Adaptive Query Execution, Spark 3.x+): Applies optimizations at runtime based on actual data statistics observed during execution (e.g., dynamically changes join strategy mid-job).

Serving Derived Data

What is derived data and why does immutability matter for batch processing?
?
Derived data: Any dataset that can be computed from an authoritative (immutable) source.

  • Examples: search indexes, recommendation scores, aggregated metrics, ML model predictions, materialized views, caches

The immutability principle:

  • Source data (user events, transactions) is never modified
  • Batch jobs read from immutable source → write to new derived output location
  • If derived data is corrupted or deleted: just re-run the batch job
  • No need for complex recovery procedures

Why this matters:

  1. Fault tolerance: Job failure = re-run from unchanged input
  2. Debugging: Can inspect any intermediate step (nothing was modified)
  3. Historical reprocessing: Change business logic → reprocess all history → new derived view
  4. A/B testing logic: Run old + new logic on same input; compare outputs before deploying
  5. Auditability: Derived data can always be traced to its source

Contrast with mutable state: If a batch job modifies its input (anti-pattern), re-running it produces different results → non-deterministic, non-reproducible.

What are the stages of a complete batch ML pipeline, and what are the key pitfalls at each stage?
?
Stage 1: Feature Engineering

  • Transform raw events into numerical feature vectors
  • Pitfall: Data leakage — features that encode future information
  • Pitfall: Training-serving skew — feature computed differently in batch vs online

Stage 2: Temporal Train/Test Split

  • Split at a time cutoff, not randomly
  • Pitfall: Random split leaks future events into training set
  • Pitfall: Cutoff too recent — insufficient test data

Stage 3: Model Training

  • Spark MLlib for distributed tree models; PyTorch DDP for deep learning
  • Pitfall: Wrong metric — optimizing accuracy on imbalanced dataset
  • Pitfall: Overfitting to train period — model doesn’t generalize to new time periods

Stage 4: Evaluation

  • Evaluate on temporal holdout; slice by cohort/geography/platform
  • Pitfall: Aggregate metrics only — model may fail badly on important subgroups

Stage 5: Serving

  • Batch scoring: write predictions to cache (Redis); serve at <1ms
  • Online inference: compute at request time from fresh features
  • Pitfall: Stale predictions — batch scoring without fresh feature refresh
  • Pitfall: Serving-training skew — model server uses different preprocessing than training

Total Cards: 16
Review Time: ~25 minutes
Priority: HIGH
Last Updated: 2026-05-29