Chapter 11 Cheat Sheet — Batch Processing

One-Line Summaries

ConceptOne-Liner
Batch processingProcess bounded (finite) dataset; produce derived output; re-runnable from immutable source
ShuffleNetwork redistribution of records by key so same-key records reach same worker
Broadcast hash joinReplicate small table to every worker; join without shuffling large side
Sort-merge joinSort both inputs by join key; merge in one pass (large × large)
DataFrame APISchema-aware, lazy-evaluated table abstraction; optimized by Catalyst/CBO
ETLExtract → Transform (outside) → Load; transform before loading
ELTExtract → Load (raw) → Transform (inside warehouse); dbt standard pattern
Lambda architectureBatch layer (accurate) + speed layer (recent) merged at query time
Kappa architectureSingle stream processor handles both historical reprocessing and real-time
Batch scoringPre-compute ML predictions for all users overnight; serve from cache
dbt modelSQL SELECT statement materialized as table/view; versioned, tested, DAG-aware
Data skewOne key has disproportionate records → one worker overwhelmed; fix with salting

Shuffle: The Core Cost

Before shuffle:                      After shuffle:
Worker 1: [A=1, B=3, A=2]           Worker 1 (key A): [A=1, A=2, A=5]
Worker 2: [B=1, A=5, C=2]           Worker 2 (key B): [B=3, B=1, B=9]
Worker 3: [B=9, C=7, A=3]           Worker 3 (key C): [C=2, C=7]
           ↑                                    ↑ (+ A=3 from Worker 3)
     Records on 3 workers               Same-key records co-located
     (randomly distributed)             (after network transfer + sort)

Cost:
  - All data may cross the network (N × M connections)
  - Sort by key (O(n log n))
  - Disk spill if memory exhausted
  - Global barrier: next stage waits for ALL workers to finish shuffle

Join Strategy Decision Tree

                    Is one side small enough for RAM?
                           │
              ┌────────────┴────────────┐
             YES                        NO
              │                         │
   BROADCAST HASH JOIN          Are both sides pre-bucketed
   (no shuffle for large)       by the same key?
                                         │
                              ┌──────────┴──────────┐
                             YES                     NO
                              │                       │
                   PARTITIONED HASH JOIN    SORT-MERGE JOIN
                   (join local partitions)  (shuffle + sort both sides)

Approximate size threshold for broadcast: < 1GB–10GB (depends on cluster RAM)

Batch Frameworks Comparison

             Hadoop MR      Spark         Flink (batch)    dbt
─────────────────────────────────────────────────────────────────────
API          Java map/reduce DataFrame/SQL  Table API/SQL   SQL only
Scale        Petabytes      100GB–PB       100GB–PB        Warehouse
Shuffle      To disk always  Memory+disk    Pipelined       N/A (SQL)
Lazy eval    No             Yes (RDD/DF)   Yes             Yes (SQL)
Fault tol.   Disk checkpts  Lineage        Checkpoints     N/A
Speed        Slow (disk I/O) Fast           Fast            Fast (in DB)
Use 2026?    Legacy only     YES            YES             YES (ELT)

ETL vs ELT

ETL (Traditional):
  [Source DB] → [Informatica/Spark] → transform → [Warehouse star schema]
                 ↑ transform here                  (arrives clean)
  Problem: Transform logic lives outside warehouse; hard to debug; fixed schema

ELT (Modern):
  [Source DB] → [Fivetran/Airbyte] → [Warehouse raw layer] → [dbt SQL] → [Refined]
                  ↑ load raw                                    ↑ transform here
  Benefit: Raw data preserved; re-transform anytime; SQL in warehouse; dbt tests

ML Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                    BATCH ML PIPELINE                                │
│                                                                     │
│  Raw Events           Feature Engineering    Model Training         │
│  (S3/Lakehouse)  ──►  (Spark / dbt)      ──► (Spark MLlib /        │
│                        - encode categories    PyTorch DDP /         │
│                        - aggregate history    XGBoost)              │
│                        - temporal features                          │
│                               │                                     │
│                        Feature Store                                │
│                       (Feast / Tecton)                              │
│                               │                                     │
│                    ┌──────────┴──────────┐                          │
│              Batch Scoring          Online Inference                │
│              (Spark, nightly)       (TorchServe, real-time)        │
│              → Redis/DynamoDB       → Model server                 │
│              (pre-computed)         (on-demand)                    │
└─────────────────────────────────────────────────────────────────────┘

CRITICAL: Always use TEMPORAL split (not random):
  Train: events before cutoff date
  Test:  events after cutoff date
  Never allow future → past leakage!

Lambda vs Kappa Architecture

LAMBDA:
  Raw ──┬──► Batch layer  (Spark, nightly)    ──┐
        │    accurate, complete, hours lag       ├──► Serving layer
        └──► Speed layer  (Flink, seconds lag) ──┘    (merges both views)

  + Accurate historical + Low-latency recent
  - Two codebases; two sets of bugs; complex merge logic

KAPPA:
  Raw ──► Stream processor (Flink, Kafka Streams)  ──► Derived views
          (handles real-time AND historical replay)

  + Single codebase; simpler operations
  - Reprocessing can be slow; log storage expensive

2026 TREND: Streaming Lakehouse replaces both:
  Stream → Iceberg tables on S3 (ACID, queryable by BI tools directly)

DataFrame Operation Types

NARROW (no shuffle — each partition processed independently):
  select, filter, withColumn, map, flatMap, union, dropDuplicates*
  
WIDE (require shuffle — data moves between workers):
  groupBy, join, distinct, repartition, coalesce, sort/orderBy
  
  * dropDuplicates without repartition can be narrow (per-partition dedup)
  
ACTION (trigger execution — lazy plan runs):
  count(), show(), collect(), write(), toPandas(), save()

Skew Handling (Salting)

Problem:  user_id=BIGSTAR → 10M records → Worker 7 gets all of them
                                           (other 99 workers idle)

Solution (salting):
  Step 1: Add random salt 0-9 to skewed key
          (user_id=BIGSTAR, salt=3) → worker 13
          (user_id=BIGSTAR, salt=7) → worker 47
          
  Step 2: Partial aggregate within each salt group
          Worker 13: count(BIGSTAR_3) = 1.2M
          Worker 47: count(BIGSTAR_7) = 0.8M
          
  Step 3: Final aggregate across salt groups
          SUM(counts for BIGSTAR) = 10M

dbt Incremental Model Pattern

-- incremental model: only process NEW data each run
{{ config(materialized='incremental', unique_key='event_id') }}
 
SELECT
    event_id,
    user_id,
    event_type,
    occurred_at
FROM {{ ref('raw_events') }}
 
{% if is_incremental() %}
  -- Only process events since last run
  WHERE occurred_at > (SELECT MAX(occurred_at) FROM {{ this }})
{% endif %}

Key Numbers (Rule of Thumb, 2026)

MetricValue
Broadcast join threshold (Spark default)10MB (tune to 1–10GB)
DuckDB practical limit~500GB single machine
Spark partition target size~128MB per partition
dbt incremental run (vs full refresh)10–100x faster
Shuffle overhead vs broadcast5–20x slower for same join
MapReduce vs Spark (multi-step jobs)Spark 5–10x faster

Red Flags

❌  Using MapReduce for new projects (Spark/Flink are 5-10x faster)
❌  Random train/test split for time-series ML (leaks future → past)
❌  .toPandas() on a large Spark DataFrame (brings PB to driver node)
❌  Large × large join without checking for skew (straggler workers)
❌  Lambda architecture for new systems (operational complexity)
❌  ETL that discards raw data (can't reprocess with new logic)
❌  Non-deterministic batch jobs (random seeds, unordered ops) — breaks reruns

Green Flags

✅  Broadcast join for large × small (no shuffle overhead)
✅  Store raw data in immutable Lakehouse; transform with dbt
✅  Temporal split for ML feature engineering
✅  Salt skewed keys before groupBy/join
✅  Partition Parquet/Iceberg by query predicates (date, region)
✅  dbt incremental models for daily batch jobs (skip unchanged data)
✅  Cache frequently reused DataFrames in Spark (spark.catalog.cacheTable)
✅  Columnar format (Parquet/ORC) for analytics — 10x smaller, faster scans

Quick Revision Time: 5 minutes
Interview Prep: 20 minutes
Last Updated: 2026-05-29