Chapter 11 Cheat Sheet — Batch Processing

One-Line Summaries

Concept	One-Liner
Batch processing	Process bounded (finite) dataset; produce derived output; re-runnable from immutable source
Shuffle	Network redistribution of records by key so same-key records reach same worker
Broadcast hash join	Replicate small table to every worker; join without shuffling large side
Sort-merge join	Sort both inputs by join key; merge in one pass (large × large)
DataFrame API	Schema-aware, lazy-evaluated table abstraction; optimized by Catalyst/CBO
ETL	Extract → Transform (outside) → Load; transform before loading
ELT	Extract → Load (raw) → Transform (inside warehouse); dbt standard pattern
Lambda architecture	Batch layer (accurate) + speed layer (recent) merged at query time
Kappa architecture	Single stream processor handles both historical reprocessing and real-time
Batch scoring	Pre-compute ML predictions for all users overnight; serve from cache
dbt model	SQL SELECT statement materialized as table/view; versioned, tested, DAG-aware
Data skew	One key has disproportionate records → one worker overwhelmed; fix with salting

Shuffle: The Core Cost

Before shuffle:                      After shuffle:
Worker 1: [A=1, B=3, A=2]           Worker 1 (key A): [A=1, A=2, A=5]
Worker 2: [B=1, A=5, C=2]           Worker 2 (key B): [B=3, B=1, B=9]
Worker 3: [B=9, C=7, A=3]           Worker 3 (key C): [C=2, C=7]
           ↑                                    ↑ (+ A=3 from Worker 3)
     Records on 3 workers               Same-key records co-located
     (randomly distributed)             (after network transfer + sort)

Cost:
  - All data may cross the network (N × M connections)
  - Sort by key (O(n log n))
  - Disk spill if memory exhausted
  - Global barrier: next stage waits for ALL workers to finish shuffle

Join Strategy Decision Tree

                    Is one side small enough for RAM?
                           │
              ┌────────────┴────────────┐
             YES                        NO
              │                         │
   BROADCAST HASH JOIN          Are both sides pre-bucketed
   (no shuffle for large)       by the same key?
                                         │
                              ┌──────────┴──────────┐
                             YES                     NO
                              │                       │
                   PARTITIONED HASH JOIN    SORT-MERGE JOIN
                   (join local partitions)  (shuffle + sort both sides)

Approximate size threshold for broadcast: < 1GB–10GB (depends on cluster RAM)

Batch Frameworks Comparison

             Hadoop MR      Spark         Flink (batch)    dbt
─────────────────────────────────────────────────────────────────────
API          Java map/reduce DataFrame/SQL  Table API/SQL   SQL only
Scale        Petabytes      100GB–PB       100GB–PB        Warehouse
Shuffle      To disk always  Memory+disk    Pipelined       N/A (SQL)
Lazy eval    No             Yes (RDD/DF)   Yes             Yes (SQL)
Fault tol.   Disk checkpts  Lineage        Checkpoints     N/A
Speed        Slow (disk I/O) Fast           Fast            Fast (in DB)
Use 2026?    Legacy only     YES            YES             YES (ELT)

ETL vs ELT

ETL (Traditional):
  [Source DB] → [Informatica/Spark] → transform → [Warehouse star schema]
                 ↑ transform here                  (arrives clean)
  Problem: Transform logic lives outside warehouse; hard to debug; fixed schema

ELT (Modern):
  [Source DB] → [Fivetran/Airbyte] → [Warehouse raw layer] → [dbt SQL] → [Refined]
                  ↑ load raw                                    ↑ transform here
  Benefit: Raw data preserved; re-transform anytime; SQL in warehouse; dbt tests

ML Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                    BATCH ML PIPELINE                                │
│                                                                     │
│  Raw Events           Feature Engineering    Model Training         │
│  (S3/Lakehouse)  ──►  (Spark / dbt)      ──► (Spark MLlib /        │
│                        - encode categories    PyTorch DDP /         │
│                        - aggregate history    XGBoost)              │
│                        - temporal features                          │
│                               │                                     │
│                        Feature Store                                │
│                       (Feast / Tecton)                              │
│                               │                                     │
│                    ┌──────────┴──────────┐                          │
│              Batch Scoring          Online Inference                │
│              (Spark, nightly)       (TorchServe, real-time)        │
│              → Redis/DynamoDB       → Model server                 │
│              (pre-computed)         (on-demand)                    │
└─────────────────────────────────────────────────────────────────────┘

CRITICAL: Always use TEMPORAL split (not random):
  Train: events before cutoff date
  Test:  events after cutoff date
  Never allow future → past leakage!

Lambda vs Kappa Architecture

LAMBDA:
  Raw ──┬──► Batch layer  (Spark, nightly)    ──┐
        │    accurate, complete, hours lag       ├──► Serving layer
        └──► Speed layer  (Flink, seconds lag) ──┘    (merges both views)

  + Accurate historical + Low-latency recent
  - Two codebases; two sets of bugs; complex merge logic

KAPPA:
  Raw ──► Stream processor (Flink, Kafka Streams)  ──► Derived views
          (handles real-time AND historical replay)

  + Single codebase; simpler operations
  - Reprocessing can be slow; log storage expensive

2026 TREND: Streaming Lakehouse replaces both:
  Stream → Iceberg tables on S3 (ACID, queryable by BI tools directly)

DataFrame Operation Types

NARROW (no shuffle — each partition processed independently):
  select, filter, withColumn, map, flatMap, union, dropDuplicates*
  
WIDE (require shuffle — data moves between workers):
  groupBy, join, distinct, repartition, coalesce, sort/orderBy
  
  * dropDuplicates without repartition can be narrow (per-partition dedup)
  
ACTION (trigger execution — lazy plan runs):
  count(), show(), collect(), write(), toPandas(), save()

Skew Handling (Salting)

Problem:  user_id=BIGSTAR → 10M records → Worker 7 gets all of them
                                           (other 99 workers idle)

Solution (salting):
  Step 1: Add random salt 0-9 to skewed key
          (user_id=BIGSTAR, salt=3) → worker 13
          (user_id=BIGSTAR, salt=7) → worker 47
          
  Step 2: Partial aggregate within each salt group
          Worker 13: count(BIGSTAR_3) = 1.2M
          Worker 47: count(BIGSTAR_7) = 0.8M
          
  Step 3: Final aggregate across salt groups
          SUM(counts for BIGSTAR) = 10M

dbt Incremental Model Pattern

-- incremental model: only process NEW data each run
{{ config(materialized='incremental', unique_key='event_id') }}
 
SELECT
    event_id,
    user_id,
    event_type,
    occurred_at
FROM {{ ref('raw_events') }}
 
{% if is_incremental() %}
  -- Only process events since last run
  WHERE occurred_at > (SELECT MAX(occurred_at) FROM {{ this }})
{% endif %}

Key Numbers (Rule of Thumb, 2026)

Metric	Value
Broadcast join threshold (Spark default)	10MB (tune to 1–10GB)
DuckDB practical limit	~500GB single machine
Spark partition target size	~128MB per partition
dbt incremental run (vs full refresh)	10–100x faster
Shuffle overhead vs broadcast	5–20x slower for same join
MapReduce vs Spark (multi-step jobs)	Spark 5–10x faster

Red Flags

❌  Using MapReduce for new projects (Spark/Flink are 5-10x faster)
❌  Random train/test split for time-series ML (leaks future → past)
❌  .toPandas() on a large Spark DataFrame (brings PB to driver node)
❌  Large × large join without checking for skew (straggler workers)
❌  Lambda architecture for new systems (operational complexity)
❌  ETL that discards raw data (can't reprocess with new logic)
❌  Non-deterministic batch jobs (random seeds, unordered ops) — breaks reruns

Green Flags

✅  Broadcast join for large × small (no shuffle overhead)
✅  Store raw data in immutable Lakehouse; transform with dbt
✅  Temporal split for ML feature engineering
✅  Salt skewed keys before groupBy/join
✅  Partition Parquet/Iceberg by query predicates (date, region)
✅  dbt incremental models for daily batch jobs (skip unchanged data)
✅  Cache frequently reused DataFrames in Spark (spark.catalog.cacheTable)
✅  Columnar format (Parquet/ORC) for analytics — 10x smaller, faster scans

Quick Revision Time: 5 minutes
Interview Prep: 20 minutes
Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch11-cheatsheet

Chapter 11 Cheat Sheet — Batch Processing

One-Line Summaries

Shuffle: The Core Cost

Join Strategy Decision Tree

Batch Frameworks Comparison

ETL vs ELT

ML Pipeline Architecture

Lambda vs Kappa Architecture

DataFrame Operation Types

Skew Handling (Salting)

dbt Incremental Model Pattern

Key Numbers (Rule of Thumb, 2026)

Red Flags

Green Flags

Graph View

Table of Contents