Chapter 10: Batch Processing

Overview

Batch processing is the paradigm of processing large amounts of bounded (finite) data in bulk, producing derived outputs. Chapter 10 traces the evolution from Unix pipes to MapReduce to modern dataflow engines, revealing that they all share a common design philosophy: immutable inputs, derived outputs, declarative transformations. Understanding batch processing is foundational to understanding modern data systems—stream processing (Ch11) is essentially batch processing on unbounded data.

Core philosophy: Treat data as immutable inputs; derive all outputs by transformation. If the job fails, re-run it from scratch—the input is unchanged.

Key Concepts

Unix Tools Philosophy

Unix pipes as batch processing:

cat /var/log/nginx/access.log | awk '{print $7}' | sort | uniq -c | sort -r -n | head -5

This pipeline: reads log → extracts URL field → sorts → counts unique → sorts by count → shows top 5

Unix design principles (that also apply to MapReduce and Spark):

One tool does one thing well (vs monolithic programs)
Connect programs via pipes (uniform interface: stdin/stdout of text)
Write programs to handle text streams (composable)
Immutable inputs: Programs read from stdin; don’t modify it

The power of uniform interfaces: Because every Unix tool reads from stdin and writes to stdout (text), any tool can be composed with any other.

MapReduce as “Unix pipes for clusters”: MapReduce took the same immutable-input, derived-output philosophy and applied it to distributed computing across many machines.

MapReduce

Invented by: Google (Jeffrey Dean & Sanjay Ghemawat, 2004 paper)
Based on: Functional programming (map and fold/reduce operations)

Two phases:

Map phase:

Input: Records from HDFS (or S3) files
Mapper function: applied to each record; emits key-value pairs
Output: Sorted key-value pairs, partitioned by key hash (one per reducer)

Reduce phase:

Input: All values for one key (from all mappers), sorted by key
Reducer function: processes all values for each key; emits output records
Output: Written back to HDFS/S3

Shuffle (between map and reduce):

All mapper outputs sorted
Partitioned by key hash (all values for key X go to the same reducer)
Most expensive phase (large data movement over network)

Example: Word count

Mapper: for each word in document, emit (word, 1)
Shuffle: group all (word, 1) by word
Reducer: for each word, sum all 1s → emit (word, total_count)

MapReduce characteristics:

Fault tolerance: Write intermediate results to disk; re-run failed tasks from checkpoint
Scalability: Can run on thousands of nodes
Batch-only: Processes bounded (finite) datasets

MapReduce limitations:

Multiple MapReduce jobs create complex dataflow (job B depends on A, C on B and A, etc.)
Each MapReduce job writes to disk and reads back → high I/O overhead
SQL-like queries that should be simple require many MapReduce jobs
Pig Latin, Hive created as higher-level abstractions over MapReduce

Dataflow Engines (Spark, Flink, Tez)

Problem with MapReduce: Forced every computation into map → shuffle → reduce structure; intermediate outputs always written to disk.

Dataflow engines:

Represent the entire computation as a DAG (Directed Acyclic Graph) of operators
Each operator is an arbitrary function (not limited to map/reduce)
Intermediate data pipelined between operators in memory (not written to disk unless needed)
Much more efficient than MapReduce for multi-step computations

Apache Spark:

In-memory dataflow engine (resilient distributed dataset / DataFrame API)
Lazy evaluation: build up a DAG of transformations; execute when action needed
Fault tolerance: lineage-based (recompute lost data from lineage graph)
Spark SQL, MLlib, GraphX, Structured Streaming built on same engine

Apache Flink:

Designed for both batch and stream processing (treats batch as bounded stream)
Lower latency than Spark for streaming; better state management
SQL-first design (Table API) for batch; native streaming with exactly-once semantics

Key advantage of dataflow over MapReduce:

No forced disk writes between stages (pipelining)
More flexible operators (join, filter, group, sort, etc.)
DAG optimizer can reorganize computation order
5-10x faster than MapReduce for multi-step jobs

Joins in Batch Processing

Three types of batch joins:

1. Sort-merge join:

Sort records from both inputs by join key
Merge-join the two sorted streams (walk through both in parallel)
Efficient when both inputs need to be sorted anyway
Used in MapReduce: both mapper outputs sorted; reducer sees all values for same key

2. Broadcast hash join (map-side join):

One input is small enough to fit in memory
Load small input into in-memory hash table on each mapper
Each mapper looks up large input’s records against the hash table
No reducer needed! No shuffle!
Example: Join large log table with small lookup table

3. Partitioned hash join (bucketed join):

Both inputs partitioned by join key
Each partition of A joined only with corresponding partition of B
Enables parallel joins without full shuffle
Used in Spark, Hive bucketed joins

Graph Processing

Bulk Synchronous Parallel (BSP) model (Pregel, Apache Giraph, GraphX):

Computation in synchronous iterations (“supersteps”)
Each superstep: each vertex executes same function, sends messages to neighbors
Next superstep: each vertex receives messages from previous superstep
Repeat until convergence

PageRank (classic example):

Each vertex’s rank = sum of (rank/out-degree) for each incoming vertex
Multiple supersteps until convergence

When to use graph processing:

Graph algorithms (PageRank, shortest path, connected components)
When data structure is inherently a graph

Alternative: Process graphs in MapReduce via repeated iterations (less efficient, but works)

Key Design Principles

Immutability:

Input data is never modified
Derived data is recomputed when source changes
Fault tolerance: just re-run the job
Debugging: inspect intermediate results
Historical analysis: reprocess old data with new algorithms

Derived data:

Caches, indexes, materialized views, recommendations, search indexes — all derived data
Can be regenerated from source if corrupted or lost
Batch processing is the tool for generating derived data at scale

Determinism:

For re-runs to produce identical output, transformations must be deterministic
Non-determinism breaks reproducibility (same data, different results on re-run)
Example: Non-deterministic: random sampling without seed; sorting with inconsistent comparators

High-level languages over raw MapReduce:

Hive SQL, Pig Latin, Spark SQL, Flink SQL
Higher-level = more optimization opportunities (query planner)
Raw MapReduce code: hard to maintain, limited optimizer opportunities

Important Points

MapReduce is Unix pipes for clusters: Same philosophy; different scale.
Dataflow engines replaced MapReduce: Spark/Flink are 5-10x faster for multi-step jobs; MapReduce is now largely legacy.
Batch processing produces derived data: All outputs can be regenerated from immutable inputs.
Joins are expensive in batch: Shuffle is the bottleneck; map-side joins (broadcast) avoid it when one side is small.
The BSP model for graphs is important: Understand Pregel/PageRank for interviews.
Fault tolerance via recomputation: Write lineage (Spark) or write intermediate results (MapReduce); re-run failed tasks.

Examples & Case Studies

Google Web Crawl Analysis (MapReduce original use case)
- Map: process each web document, emit (word, 1) or (URL, outlinks)
- Reduce: count words across corpus; build link graph
- Used to build early Google Search index
Facebook Hive Data Warehouse
- PB-scale analytics over user interactions, ad clicks, etc.
- Hive compiles SQL to MapReduce/Tez/Spark
- Analysts query data without knowing MapReduce internals
Netflix Recommendations (Spark)
- Collaborative filtering (matrix factorization) via Spark MLlib
- Process billions of watch events to compute recommendations
- Run nightly batch job; write results to serving store
Wikipedia (Hadoop MapReduce)
- Process complete revision history (3TB compressed)
- Compute contributor statistics, article quality metrics, etc.

Questions

What are the Unix philosophy principles and how do they apply to batch processing?
How does MapReduce achieve fault tolerance?
Why are dataflow engines (Spark/Flink) faster than MapReduce for multi-step jobs?
What is the difference between sort-merge join, broadcast hash join, and partitioned hash join?
When would you use BSP (Pregel) vs MapReduce for graph processing?
What does it mean for batch processing to produce “derived data”?
How does Spark’s lineage-based fault tolerance work?
What problems does the shuffle phase in MapReduce solve and what are its costs?

Modern Context (2026)

Spark is the standard (2026):

Spark on Kubernetes (via Spark Operator or Apache Spark on K8s): standard deployment
Delta Lake as the data format (ACID, time travel, schema enforcement over Parquet)
Spark 4.x: improved performance with adaptive query execution (AQE), better GPU support
PySpark dominant for data engineering; Scala for performance-critical paths

Flink for streaming-first:

Apache Flink 1.18+: unified batch and stream processing
Better for real-time batch (micro-batch approach) vs Spark’s streaming
Table API + Flink SQL: SQL-based batch and stream processing

DuckDB for in-process analytics:

Single-node column-oriented database for analytical SQL
Reads Parquet, Delta Lake, Iceberg directly from S3
10-100x faster than Spark for small-medium datasets (<100GB)
Replaces Spark for many workloads that don’t need distributed processing

dbt (data build tool):

SQL-based transformation pipeline (batch) for data warehouses
Version-controlled SQL models → materialized as tables/views
Industry-standard for analytics engineering (2026)
Works with Snowflake, BigQuery, Databricks, DuckDB

Lakehouse as batch target:

Delta Lake, Apache Iceberg, Apache Hudi
Batch jobs write to Lakehouse (Parquet + transaction log on S3)
Downstream consumers (Spark, DuckDB, Trino) read from Lakehouse

Status: Notes complete
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch10-batch-processing