Chapter 10 Cheat Sheet - Batch Processing

One-Line Summaries

Concept	One-Liner
Batch processing	Process bounded (finite) dataset in bulk; produce derived output
MapReduce	Map (extract KV pairs) + Shuffle (group by key) + Reduce (aggregate)
Shuffle	MapReduce phase that redistributes data by key across network
Dataflow engine	DAG-based batch execution (Spark/Flink); avoids unnecessary disk writes
Derived data	Any output that can be recomputed from immutable source inputs
Broadcast hash join	Load small table in memory on each worker; join without shuffle
Sort-merge join	Sort both inputs by key; merge-join in single pass
BSP model	Bulk Synchronous Parallel — iterative graph algorithm framework (Pregel)
Lineage	Spark’s record of how data was derived; enables recomputation on failure
Immutability	Input data never modified; all outputs are derived transformations

MapReduce Execution Flow

HDFS Input Files
       │
       ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Mapper 1   │     │  Mapper 2   │     │  Mapper 3   │
│ (KV pairs)  │     │ (KV pairs)  │     │ (KV pairs)  │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │
       └───────────────────┴───────────────────┘
                           │
                    SHUFFLE PHASE
              (sort + partition by key hash)
                           │
       ┌───────────────────┴───────────────────┐
       │                                       │
┌──────▼──────┐                       ┌────────▼────────┐
│  Reducer 1  │                       │   Reducer 2     │
│(key A→E)    │                       │ (key F→Z)       │
└──────┬──────┘                       └────────┬────────┘
       │                                       │
       └───────────────────────────────────────┘
                           │
                    HDFS Output Files

Fault tolerance: each mapper/reducer writes to local disk;
if task fails, only that task re-runs from its input split

MapReduce vs Dataflow Engines

MapReduce:                              Spark/Flink (Dataflow):
──────────────────────────              ────────────────────────────────────
Disk after every stage                  Pipeline stages in memory
map → disk → sort → disk → reduce       op1 → op2 → op3 (in-memory pipelining)

Fixed 2-phase structure                 Arbitrary DAG of operators
(map + reduce only)                     (filter, join, sort, aggregate, etc.)

Each job independent                    Full job as single DAG
(complex workflows = many jobs)         (optimizer sees entire computation)

Re-reads disk between jobs              Lazy evaluation + caching
                                        (reuse shuffled data across operators)

Result: 5-10x slower for               Result: Much faster for
multi-step computations                 multi-step computations

Join Strategies Comparison

Sort-Merge Join:
  Both inputs sorted by join key → one-pass merge
  ✅ Works for any size inputs
  ❌ Requires sorting both (expensive if not already sorted)
  Use when: large-large join

Broadcast Hash Join (map-side):
  Small table → broadcast to every worker → hash table in RAM
  Large table → each record looked up in hash table
  ✅ No shuffle! Much faster
  ❌ Small table must fit in RAM (typically <100MB-10GB)
  Use when: large-small join

Partitioned Hash Join:
  Both inputs partitioned by join key
  Each partition joined independently in parallel
  ✅ Parallel; no full shuffle needed
  ❌ Both inputs must be partitioned same way
  Use when: both inputs large but organized by same key

Unix Philosophy → MapReduce → Spark

UNIX PIPES:
  cat log.txt | awk '{print $7}' | sort | uniq -c | sort -rn | head -5
  ↑              ↑                  ↑        ↑          ↑          ↑
  Source       Mapper             Sort    Count       Sort       Limit
  (immutable)  (transform)
  
MAPREDUCE:
  Read HDFS → Map (transform) → Shuffle (sort+group by key) → Reduce (aggregate)
  (immutable)    (parallel)       (distributed sort)           (per-key)

SPARK:
  val result = sc.textFile("log.txt")    // Source (immutable RDD)
    .map(line => line.split(" ")(6))      // Map (transform)
    .countByValue()                       // Count (aggregate)
    .toSeq.sortBy(-_._2).take(5)         // Sort + Limit

Same philosophy at each level: immutable input, derived output, composable transforms

Spark Lineage and Fault Tolerance

RDD Lineage Graph:
  rawLog (from HDFS)
      │ map(extractURL)
      ▼
  urls (derived RDD)
      │ filter(isValidURL)
      ▼
  validUrls (derived RDD)
      │ map(normalize)
      ▼
  normalizedUrls (derived RDD)

Worker node fails while computing normalizedUrls:
  1. Spark detects failure
  2. Traces lineage: normalizedUrls ← validUrls ← urls ← rawLog
  3. Re-reads rawLog from HDFS (immutable source)
  4. Re-applies transformations on surviving workers
  5. Only the lost partitions are recomputed

Key: No need for expensive checkpointing for most transformations
Exception: Long lineages or expensive operations → checkpoint to disk

BSP Graph Processing (Pregel)

Graph: A --1--> B --2--> C
               |
               1
               ↓
               D

PageRank Computation (simplified):
Initial: all nodes rank = 1/N

Superstep 1:
  Each node sends rank/out-degree to neighbors:
  A → B: 0.25        B → C: 0.167
  A → C: 0.25        B → D: 0.167
  
Superstep 2:
  Each node sums received ranks + dampening factor
  B_rank = damping * (A/out_A + ...) + (1-damping)/N
  
Repeat until convergence (change < epsilon)

Used for: PageRank, shortest paths, connected components,
          graph clustering, community detection

Key Trade-offs

Decision	Pro	Con	When to Use
MapReduce	Battle-tested, fault-tolerant	Slow (disk I/O), inflexible	Legacy Hadoop; very large fault domains
Spark	Fast in-memory, flexible	Memory pressure, complex	Most batch jobs today
Flink	Low-latency batch/stream unified	Complex state management	Streaming + batch unified
DuckDB	Very fast for single node	No distributed processing	<100GB datasets, local analytics
Broadcast join	No shuffle overhead	Small table must fit in RAM	Dimension table joins
Sort-merge join	Works for all sizes	Requires sorting	Large-large joins

Red Flags

❌ Using MapReduce for new projects (Spark is ~10x faster for most workloads)
❌ Large-large joins without shuffle optimization (will be very slow)
❌ Non-deterministic batch jobs (random seeds, unordered operations) — breaks re-run reproducibility
❌ Writing batch output to same location as input (no immutability; bad on failure)
❌ Not partitioning input data — results in full table scans for every query

Green Flags

✅ Immutable inputs; write outputs to new location each run
✅ Use broadcast hash join for large-small table joins
✅ Partition data by query predicates (date, user_id) to enable partition pruning
✅ Cache frequently reused RDDs/DataFrames in Spark
✅ Use Parquet/ORC columnar format for analytics (10x smaller, faster scans)

Modern Additions (2026)

Lakehouse (Delta Lake, Iceberg):
├─ ACID transactions over Parquet on S3
├─ Batch jobs write incrementally (not full rewrites)
└─ Time travel: query data as of any point in history

dbt (data build tool):
├─ SQL-based batch transformation pipelines
├─ Version-controlled, tested, documented SQL models
└─ Industry standard for analytics engineering in 2026

DuckDB:
├─ In-process column store; reads Parquet/Delta from S3
├─ No cluster needed for <100GB workloads
└─ Replaces Spark for many analytics tasks

Interview Response Templates

When Asked to Design a Batch Pipeline

“I’d use a dataflow engine like Spark or Flink rather than raw MapReduce — they’re much more efficient for multi-step computations since they pipeline stages in memory rather than writing to disk between each step. I’d treat input data as immutable and write outputs to new locations; this gives free fault tolerance (just re-run) and historical data preservation. For storage format, I’d use Parquet with appropriate partitioning to enable efficient downstream queries.”

When Asked About Join Strategies

“The choice depends on table sizes. If one table is small enough to fit in memory (typically under a few GB), I’d use a broadcast hash join — no shuffle needed, extremely fast. For two large tables, I’d use sort-merge join, ideally co-partitioned by the join key to minimize shuffle. If both tables are pre-bucketed by the same key, a bucketed join can join local partitions without shuffling at all.”

Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch10-cheatsheet