Chapter 10: Batch Processing
Overview
Batch processing is the paradigm of processing large amounts of bounded (finite) data in bulk, producing derived outputs. Chapter 10 traces the evolution from Unix pipes to MapReduce to modern dataflow engines, revealing that they all share a common design philosophy: immutable inputs, derived outputs, declarative transformations. Understanding batch processing is foundational to understanding modern data systems—stream processing (Ch11) is essentially batch processing on unbounded data.
Core philosophy: Treat data as immutable inputs; derive all outputs by transformation. If the job fails, re-run it from scratch—the input is unchanged.
Key Concepts
Unix Tools Philosophy
Unix pipes as batch processing:
cat /var/log/nginx/access.log | awk '{print $7}' | sort | uniq -c | sort -r -n | head -5This pipeline: reads log → extracts URL field → sorts → counts unique → sorts by count → shows top 5
Unix design principles (that also apply to MapReduce and Spark):
- One tool does one thing well (vs monolithic programs)
- Connect programs via pipes (uniform interface: stdin/stdout of text)
- Write programs to handle text streams (composable)
- Immutable inputs: Programs read from stdin; don’t modify it
The power of uniform interfaces: Because every Unix tool reads from stdin and writes to stdout (text), any tool can be composed with any other.
MapReduce as “Unix pipes for clusters”: MapReduce took the same immutable-input, derived-output philosophy and applied it to distributed computing across many machines.
MapReduce
Invented by: Google (Jeffrey Dean & Sanjay Ghemawat, 2004 paper)
Based on: Functional programming (map and fold/reduce operations)
Two phases:
Map phase:
- Input: Records from HDFS (or S3) files
- Mapper function: applied to each record; emits key-value pairs
- Output: Sorted key-value pairs, partitioned by key hash (one per reducer)
Reduce phase:
- Input: All values for one key (from all mappers), sorted by key
- Reducer function: processes all values for each key; emits output records
- Output: Written back to HDFS/S3
Shuffle (between map and reduce):
- All mapper outputs sorted
- Partitioned by key hash (all values for key X go to the same reducer)
- Most expensive phase (large data movement over network)
Example: Word count
- Mapper: for each word in document, emit (word, 1)
- Shuffle: group all (word, 1) by word
- Reducer: for each word, sum all 1s → emit (word, total_count)
MapReduce characteristics:
- Fault tolerance: Write intermediate results to disk; re-run failed tasks from checkpoint
- Scalability: Can run on thousands of nodes
- Batch-only: Processes bounded (finite) datasets
MapReduce limitations:
- Multiple MapReduce jobs create complex dataflow (job B depends on A, C on B and A, etc.)
- Each MapReduce job writes to disk and reads back → high I/O overhead
- SQL-like queries that should be simple require many MapReduce jobs
- Pig Latin, Hive created as higher-level abstractions over MapReduce
Dataflow Engines (Spark, Flink, Tez)
Problem with MapReduce: Forced every computation into map → shuffle → reduce structure; intermediate outputs always written to disk.
Dataflow engines:
- Represent the entire computation as a DAG (Directed Acyclic Graph) of operators
- Each operator is an arbitrary function (not limited to map/reduce)
- Intermediate data pipelined between operators in memory (not written to disk unless needed)
- Much more efficient than MapReduce for multi-step computations
Apache Spark:
- In-memory dataflow engine (resilient distributed dataset / DataFrame API)
- Lazy evaluation: build up a DAG of transformations; execute when action needed
- Fault tolerance: lineage-based (recompute lost data from lineage graph)
- Spark SQL, MLlib, GraphX, Structured Streaming built on same engine
Apache Flink:
- Designed for both batch and stream processing (treats batch as bounded stream)
- Lower latency than Spark for streaming; better state management
- SQL-first design (Table API) for batch; native streaming with exactly-once semantics
Key advantage of dataflow over MapReduce:
- No forced disk writes between stages (pipelining)
- More flexible operators (join, filter, group, sort, etc.)
- DAG optimizer can reorganize computation order
- 5-10x faster than MapReduce for multi-step jobs
Joins in Batch Processing
Three types of batch joins:
1. Sort-merge join:
- Sort records from both inputs by join key
- Merge-join the two sorted streams (walk through both in parallel)
- Efficient when both inputs need to be sorted anyway
- Used in MapReduce: both mapper outputs sorted; reducer sees all values for same key
2. Broadcast hash join (map-side join):
- One input is small enough to fit in memory
- Load small input into in-memory hash table on each mapper
- Each mapper looks up large input’s records against the hash table
- No reducer needed! No shuffle!
- Example: Join large log table with small lookup table
3. Partitioned hash join (bucketed join):
- Both inputs partitioned by join key
- Each partition of A joined only with corresponding partition of B
- Enables parallel joins without full shuffle
- Used in Spark, Hive bucketed joins
Graph Processing
Bulk Synchronous Parallel (BSP) model (Pregel, Apache Giraph, GraphX):
- Computation in synchronous iterations (“supersteps”)
- Each superstep: each vertex executes same function, sends messages to neighbors
- Next superstep: each vertex receives messages from previous superstep
- Repeat until convergence
PageRank (classic example):
- Each vertex’s rank = sum of (rank/out-degree) for each incoming vertex
- Multiple supersteps until convergence
When to use graph processing:
- Graph algorithms (PageRank, shortest path, connected components)
- When data structure is inherently a graph
Alternative: Process graphs in MapReduce via repeated iterations (less efficient, but works)
Key Design Principles
Immutability:
- Input data is never modified
- Derived data is recomputed when source changes
- Fault tolerance: just re-run the job
- Debugging: inspect intermediate results
- Historical analysis: reprocess old data with new algorithms
Derived data:
- Caches, indexes, materialized views, recommendations, search indexes — all derived data
- Can be regenerated from source if corrupted or lost
- Batch processing is the tool for generating derived data at scale
Determinism:
- For re-runs to produce identical output, transformations must be deterministic
- Non-determinism breaks reproducibility (same data, different results on re-run)
- Example: Non-deterministic: random sampling without seed; sorting with inconsistent comparators
High-level languages over raw MapReduce:
- Hive SQL, Pig Latin, Spark SQL, Flink SQL
- Higher-level = more optimization opportunities (query planner)
- Raw MapReduce code: hard to maintain, limited optimizer opportunities
Important Points
- MapReduce is Unix pipes for clusters: Same philosophy; different scale.
- Dataflow engines replaced MapReduce: Spark/Flink are 5-10x faster for multi-step jobs; MapReduce is now largely legacy.
- Batch processing produces derived data: All outputs can be regenerated from immutable inputs.
- Joins are expensive in batch: Shuffle is the bottleneck; map-side joins (broadcast) avoid it when one side is small.
- The BSP model for graphs is important: Understand Pregel/PageRank for interviews.
- Fault tolerance via recomputation: Write lineage (Spark) or write intermediate results (MapReduce); re-run failed tasks.
Examples & Case Studies
-
Google Web Crawl Analysis (MapReduce original use case)
- Map: process each web document, emit (word, 1) or (URL, outlinks)
- Reduce: count words across corpus; build link graph
- Used to build early Google Search index
-
Facebook Hive Data Warehouse
- PB-scale analytics over user interactions, ad clicks, etc.
- Hive compiles SQL to MapReduce/Tez/Spark
- Analysts query data without knowing MapReduce internals
-
Netflix Recommendations (Spark)
- Collaborative filtering (matrix factorization) via Spark MLlib
- Process billions of watch events to compute recommendations
- Run nightly batch job; write results to serving store
-
Wikipedia (Hadoop MapReduce)
- Process complete revision history (3TB compressed)
- Compute contributor statistics, article quality metrics, etc.
Questions
- What are the Unix philosophy principles and how do they apply to batch processing?
- How does MapReduce achieve fault tolerance?
- Why are dataflow engines (Spark/Flink) faster than MapReduce for multi-step jobs?
- What is the difference between sort-merge join, broadcast hash join, and partitioned hash join?
- When would you use BSP (Pregel) vs MapReduce for graph processing?
- What does it mean for batch processing to produce “derived data”?
- How does Spark’s lineage-based fault tolerance work?
- What problems does the shuffle phase in MapReduce solve and what are its costs?
Modern Context (2026)
Spark is the standard (2026):
- Spark on Kubernetes (via Spark Operator or Apache Spark on K8s): standard deployment
- Delta Lake as the data format (ACID, time travel, schema enforcement over Parquet)
- Spark 4.x: improved performance with adaptive query execution (AQE), better GPU support
- PySpark dominant for data engineering; Scala for performance-critical paths
Flink for streaming-first:
- Apache Flink 1.18+: unified batch and stream processing
- Better for real-time batch (micro-batch approach) vs Spark’s streaming
- Table API + Flink SQL: SQL-based batch and stream processing
DuckDB for in-process analytics:
- Single-node column-oriented database for analytical SQL
- Reads Parquet, Delta Lake, Iceberg directly from S3
- 10-100x faster than Spark for small-medium datasets (<100GB)
- Replaces Spark for many workloads that don’t need distributed processing
dbt (data build tool):
- SQL-based transformation pipeline (batch) for data warehouses
- Version-controlled SQL models → materialized as tables/views
- Industry-standard for analytics engineering (2026)
- Works with Snowflake, BigQuery, Databricks, DuckDB
Lakehouse as batch target:
- Delta Lake, Apache Iceberg, Apache Hudi
- Batch jobs write to Lakehouse (Parquet + transaction log on S3)
- Downstream consumers (Spark, DuckDB, Trino) read from Lakehouse
Status: Notes complete
Last Updated: 2026-04-13