Chapter 10 Cheat Sheet - Batch Processing
One-Line Summaries
| Concept | One-Liner |
|---|---|
| Batch processing | Process bounded (finite) dataset in bulk; produce derived output |
| MapReduce | Map (extract KV pairs) + Shuffle (group by key) + Reduce (aggregate) |
| Shuffle | MapReduce phase that redistributes data by key across network |
| Dataflow engine | DAG-based batch execution (Spark/Flink); avoids unnecessary disk writes |
| Derived data | Any output that can be recomputed from immutable source inputs |
| Broadcast hash join | Load small table in memory on each worker; join without shuffle |
| Sort-merge join | Sort both inputs by key; merge-join in single pass |
| BSP model | Bulk Synchronous Parallel — iterative graph algorithm framework (Pregel) |
| Lineage | Spark’s record of how data was derived; enables recomputation on failure |
| Immutability | Input data never modified; all outputs are derived transformations |
MapReduce Execution Flow
HDFS Input Files
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Mapper 1 │ │ Mapper 2 │ │ Mapper 3 │
│ (KV pairs) │ │ (KV pairs) │ │ (KV pairs) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└───────────────────┴───────────────────┘
│
SHUFFLE PHASE
(sort + partition by key hash)
│
┌───────────────────┴───────────────────┐
│ │
┌──────▼──────┐ ┌────────▼────────┐
│ Reducer 1 │ │ Reducer 2 │
│(key A→E) │ │ (key F→Z) │
└──────┬──────┘ └────────┬────────┘
│ │
└───────────────────────────────────────┘
│
HDFS Output Files
Fault tolerance: each mapper/reducer writes to local disk;
if task fails, only that task re-runs from its input split
MapReduce vs Dataflow Engines
MapReduce: Spark/Flink (Dataflow):
────────────────────────── ────────────────────────────────────
Disk after every stage Pipeline stages in memory
map → disk → sort → disk → reduce op1 → op2 → op3 (in-memory pipelining)
Fixed 2-phase structure Arbitrary DAG of operators
(map + reduce only) (filter, join, sort, aggregate, etc.)
Each job independent Full job as single DAG
(complex workflows = many jobs) (optimizer sees entire computation)
Re-reads disk between jobs Lazy evaluation + caching
(reuse shuffled data across operators)
Result: 5-10x slower for Result: Much faster for
multi-step computations multi-step computations
Join Strategies Comparison
Sort-Merge Join:
Both inputs sorted by join key → one-pass merge
✅ Works for any size inputs
❌ Requires sorting both (expensive if not already sorted)
Use when: large-large join
Broadcast Hash Join (map-side):
Small table → broadcast to every worker → hash table in RAM
Large table → each record looked up in hash table
✅ No shuffle! Much faster
❌ Small table must fit in RAM (typically <100MB-10GB)
Use when: large-small join
Partitioned Hash Join:
Both inputs partitioned by join key
Each partition joined independently in parallel
✅ Parallel; no full shuffle needed
❌ Both inputs must be partitioned same way
Use when: both inputs large but organized by same key
Unix Philosophy → MapReduce → Spark
UNIX PIPES:
cat log.txt | awk '{print $7}' | sort | uniq -c | sort -rn | head -5
↑ ↑ ↑ ↑ ↑ ↑
Source Mapper Sort Count Sort Limit
(immutable) (transform)
MAPREDUCE:
Read HDFS → Map (transform) → Shuffle (sort+group by key) → Reduce (aggregate)
(immutable) (parallel) (distributed sort) (per-key)
SPARK:
val result = sc.textFile("log.txt") // Source (immutable RDD)
.map(line => line.split(" ")(6)) // Map (transform)
.countByValue() // Count (aggregate)
.toSeq.sortBy(-_._2).take(5) // Sort + Limit
Same philosophy at each level: immutable input, derived output, composable transforms
Spark Lineage and Fault Tolerance
RDD Lineage Graph:
rawLog (from HDFS)
│ map(extractURL)
▼
urls (derived RDD)
│ filter(isValidURL)
▼
validUrls (derived RDD)
│ map(normalize)
▼
normalizedUrls (derived RDD)
Worker node fails while computing normalizedUrls:
1. Spark detects failure
2. Traces lineage: normalizedUrls ← validUrls ← urls ← rawLog
3. Re-reads rawLog from HDFS (immutable source)
4. Re-applies transformations on surviving workers
5. Only the lost partitions are recomputed
Key: No need for expensive checkpointing for most transformations
Exception: Long lineages or expensive operations → checkpoint to disk
BSP Graph Processing (Pregel)
Graph: A --1--> B --2--> C
|
1
↓
D
PageRank Computation (simplified):
Initial: all nodes rank = 1/N
Superstep 1:
Each node sends rank/out-degree to neighbors:
A → B: 0.25 B → C: 0.167
A → C: 0.25 B → D: 0.167
Superstep 2:
Each node sums received ranks + dampening factor
B_rank = damping * (A/out_A + ...) + (1-damping)/N
Repeat until convergence (change < epsilon)
Used for: PageRank, shortest paths, connected components,
graph clustering, community detection
Key Trade-offs
| Decision | Pro | Con | When to Use |
|---|---|---|---|
| MapReduce | Battle-tested, fault-tolerant | Slow (disk I/O), inflexible | Legacy Hadoop; very large fault domains |
| Spark | Fast in-memory, flexible | Memory pressure, complex | Most batch jobs today |
| Flink | Low-latency batch/stream unified | Complex state management | Streaming + batch unified |
| DuckDB | Very fast for single node | No distributed processing | <100GB datasets, local analytics |
| Broadcast join | No shuffle overhead | Small table must fit in RAM | Dimension table joins |
| Sort-merge join | Works for all sizes | Requires sorting | Large-large joins |
Red Flags
❌ Using MapReduce for new projects (Spark is ~10x faster for most workloads)
❌ Large-large joins without shuffle optimization (will be very slow)
❌ Non-deterministic batch jobs (random seeds, unordered operations) — breaks re-run reproducibility
❌ Writing batch output to same location as input (no immutability; bad on failure)
❌ Not partitioning input data — results in full table scans for every query
Green Flags
✅ Immutable inputs; write outputs to new location each run
✅ Use broadcast hash join for large-small table joins
✅ Partition data by query predicates (date, user_id) to enable partition pruning
✅ Cache frequently reused RDDs/DataFrames in Spark
✅ Use Parquet/ORC columnar format for analytics (10x smaller, faster scans)
Modern Additions (2026)
Lakehouse (Delta Lake, Iceberg):
├─ ACID transactions over Parquet on S3
├─ Batch jobs write incrementally (not full rewrites)
└─ Time travel: query data as of any point in history
dbt (data build tool):
├─ SQL-based batch transformation pipelines
├─ Version-controlled, tested, documented SQL models
└─ Industry standard for analytics engineering in 2026
DuckDB:
├─ In-process column store; reads Parquet/Delta from S3
├─ No cluster needed for <100GB workloads
└─ Replaces Spark for many analytics tasks
Interview Response Templates
When Asked to Design a Batch Pipeline
“I’d use a dataflow engine like Spark or Flink rather than raw MapReduce — they’re much more efficient for multi-step computations since they pipeline stages in memory rather than writing to disk between each step. I’d treat input data as immutable and write outputs to new locations; this gives free fault tolerance (just re-run) and historical data preservation. For storage format, I’d use Parquet with appropriate partitioning to enable efficient downstream queries.”
When Asked About Join Strategies
“The choice depends on table sizes. If one table is small enough to fit in memory (typically under a few GB), I’d use a broadcast hash join — no shuffle needed, extremely fast. For two large tables, I’d use sort-merge join, ideally co-partitioned by the join key to minimize shuffle. If both tables are pre-bucketed by the same key, a bucketed join can join local partitions without shuffling at all.”
Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-13