Chapter 10 Flashcards - Batch Processing
Basic Concepts
What is batch processing and how does it differ from online (OLTP) processing?
?
-
Batch processing: Process a large amount of bounded (finite) data in bulk; produce a derived output
- Input is complete; processing starts and finishes
- Results used downstream (recommendations, search indexes, analytics)
- Latency: minutes to hours (not real-time)
-
OLTP (online processing): Handle user requests in real-time; process small amounts of data per request
- Low latency (ms); user waits for response
- Triggered by user actions
-
Key batch philosophy:
- Inputs are immutable (never modified)
- Outputs are derived (can be regenerated)
- Fault tolerance: re-run the job (input unchanged)
-
Historical parallel: Unix pipes = batch processing for single machine; MapReduce = Unix pipes for cluster
What are the Unix design principles that apply to MapReduce and Spark?
?
- One tool, one thing well: Each program does one operation; compose via pipes
- Uniform interface: stdin/stdout (text) allows any program to connect to any other
- Immutable inputs: Programs read from stdin; don’t modify it
- Composability: Any output can be piped as input to the next stage
How these translate:
- MapReduce: mapper reads records, emits KV; reducer reads KV, emits output
- Spark: chain of transformations (map, filter, groupBy) form a DAG
- Input HDFS/S3 files = immutable; output = new files
- Fault tolerance = re-run; input is unchanged
Key insight: MapReduce is “Unix pipes for clusters” — same philosophy, different execution environment
MapReduce
Describe the three phases of MapReduce and what happens in each.
?
Map phase:
- Mapper function called for each input record (from HDFS/S3 split)
- Emits zero or more (key, value) pairs
- Each mapper writes output to local disk, sorted by key
Shuffle phase (between map and reduce):
- MapReduce framework moves mapper outputs to reducers
- All (key, value) pairs with the same key go to the same reducer
- Data sorted by key; partitioned by
hash(key) % num_reducers - Most expensive phase (large data movement over network)
Reduce phase:
- Reducer function receives all values for one key in order
- Aggregates, filters, or transforms as needed
- Writes output back to HDFS/S3
Example (word count):
- Map:
(doc, "the quick brown fox")→(the, 1),(quick, 1),(brown, 1),(fox, 1) - Shuffle: all
(the, 1)from all mappers → Reducer 1; all(quick, 1)→ Reducer 2 - Reduce:
(the, [1,1,1,...])→(the, 47)
How does MapReduce achieve fault tolerance?
?
-
Principle: Inputs are immutable (on HDFS/S3); intermediate and final outputs are deterministic
-
Implementation:
- Mapper/reducer writes output to its local disk (not in-memory; survives task crash)
- If a mapper task fails: ONLY that mapper re-runs (other mappers’ output preserved)
- If a reducer fails: reads mapper outputs again (still on mapper local disks)
- Final output written to HDFS/S3 atomically (rename)
-
Granularity: Task-level restart, not job-level restart
-
Speculative execution: Run duplicate copies of slow tasks; use whichever finishes first
-
Cost of fault tolerance: High disk I/O (write intermediate results to disk after each phase)
-
MapReduce’s main disadvantage: This disk I/O overhead is why Spark is faster
Dataflow Engines
Why are Spark and Flink faster than MapReduce for multi-step computations?
?
MapReduce limitation: Every stage (map → reduce → map → reduce) writes to disk and reads back
- Multi-step job with 10 stages = 10 disk writes + 10 disk reads
Spark/Flink advantages:
- In-memory pipelining: Stages are connected directly; data flows from one to next in memory
- Only write to disk when: checkpointing, shuffle (some), or output
- Arbitrary DAG: Not limited to map-shuffle-reduce structure; any operator (filter, join, sort, window)
- Optimizer: Sees entire computation graph → can reorder, fuse, eliminate stages
- Result: 5-100x faster than MapReduce for typical multi-step analytical jobs
Spark’s fault tolerance (different from MapReduce):
- Lineage graph: tracks how each RDD was derived
- On failure: recompute only lost partitions by following lineage from last stable data
- Checkpoint: can persist lineage to disk for very long computations
What is Spark’s lazy evaluation model and why does it matter?
?
-
Lazy evaluation: Spark transformations (map, filter, join, groupBy) are NOT executed immediately
- They build up a logical plan (DAG of transformations)
- Only executed when an action is called (count(), collect(), save(), show())
-
Why it matters:
- Query optimization: Optimizer sees entire plan before execution; can reorder, push down filters
- No wasted computation: Intermediate results only computed when needed
- Pipeline fusion: Adjacent stages can be fused into a single pass over data
-
Transformations (lazy):
map(),filter(),flatMap(),groupBy(),join(),select() -
Actions (trigger execution):
count(),collect(),show(),save(),write() -
Example:
rdd.filter(x > 10).map(x * 2).take(5)— filter and map fused into one pass; stops after 5 results
Joins in Batch
When would you use a broadcast hash join vs a sort-merge join?
?
Broadcast hash join (map-side join):
- One table small enough to fit in RAM (broadcast to all workers)
- Join executed locally at each worker (no shuffle!)
- ✅ Very fast; no network data movement
- ✅ Works for large × small table joins
- ❌ Small table must fit in RAM (~100MB to few GB typical)
- Example: Join 1TB log table with 10MB country lookup table
Sort-merge join:
- Both tables sorted by join key; merged in one pass
- Requires shuffling one or both inputs by join key
- ✅ Works for any size inputs (including large × large)
- ❌ Requires sorting (expensive) if data not already sorted
- ✅ Efficient when data is pre-sorted or pre-partitioned
- Example: Join two 500GB tables by user_id
Rule of thumb: Use broadcast hash join whenever one side is “small”; fall back to sort-merge for large × large
Design Principles
What is “derived data” in the context of batch processing?
?
-
Derived data: Any data that was computed from an authoritative source (input data)
- Can be regenerated if lost or corrupted
- No unique state — source is the ground truth
-
Examples of derived data:
- Search indexes (derived from crawled pages)
- Recommendation models (derived from user interaction history)
- Analytics aggregates (derived from raw events)
- Materialized views (derived from base tables)
- Read caches (derived from write-side database)
-
Key property: If derived data is lost → just rerun the batch job from the immutable source
-
Why this matters: Simplifies fault tolerance; no need for complex recovery procedures
-
Contrast with source-of-truth data: The user’s writes, events, transactions — this is not derived; it must be preserved
What is the BSP (Bulk Synchronous Parallel) model and how is it used for graph processing?
?
-
BSP (Valiant 1990): Parallel computation in synchronous “supersteps”
-
Each superstep:
- Every vertex executes its function (e.g., PageRank calculation)
- Vertices send messages to neighbors
- Barrier: all vertices wait until all messages delivered
- Next superstep: vertices process received messages
-
Used by: Apache Giraph, Pregel (Google), GraphX (Spark), Gelly (Flink)
-
PageRank example:
- Init: all vertices get rank 1/N
- Each superstep: vertex sends
rank / out_degreeto each neighbor - Each vertex receives from all in-neighbors: new rank = damping * sum + (1-damping)/N
- Repeat until convergence
-
When to use: Graph algorithms (PageRank, shortest path, connected components, community detection)
-
Alternative: Process graphs in MapReduce (iterative, less efficient)
Modern Context (2026)
What is the role of dbt (data build tool) in modern batch data engineering?
?
-
dbt = data build tool; a transformation framework for data warehouses/lakehouses
-
What it does: Defines batch transformations as SQL SELECT statements (“models”)
- Each model = one table or view in the warehouse
- Models can depend on other models (dbt manages the DAG)
dbt runexecutes all models in dependency order
-
Key features:
- Version control: SQL models in git (reviewable, testable)
- Tests: Assert business logic (unique, not null, referential integrity)
- Documentation: Auto-generates data catalog from model definitions
- Incremental models: Only process new data (not full recompute each run)
-
2026 status: Industry standard for analytics engineering
- Works with: Snowflake, BigQuery, Redshift, Databricks, DuckDB
- Over 40,000 users (2024); ubiquitous in data teams
-
Replaces: Hand-written SQL scripts, Informatica ETL, complex Spark jobs for transformations
How does the Lakehouse architecture change batch processing workflows?
?
-
Problem with pure data warehouse: Expensive proprietary format; hard to use ML tools on warehouse data
-
Problem with pure data lake: Raw Parquet files; no ACID, no schema enforcement, slow queries
-
Lakehouse (Delta Lake, Apache Iceberg, Apache Hudi):
- Open format: Parquet files + transaction log on object storage (S3/GCS)
- ACID transactions: concurrent writers don’t corrupt data
- Schema enforcement and evolution
- Time travel: query data as of any point (audit, rollback)
- Streaming + batch unified (can read latest data or historical snapshots)
-
Batch workflow:
- Source (Kafka, databases) → raw zone (Parquet on S3)
- dbt or Spark → transforms raw → refined zone (Delta/Iceberg tables)
- BI tools (Tableau, Looker), DuckDB, Trino → query refined zone
-
Why it matters: Batch output is now queryable immediately by any SQL tool; no separate export step
Interview Scenarios
Design a batch pipeline to compute weekly active users (WAU) for a social media app.
?
Input: Raw event stream (Kafka) → landing zone (Parquet on S3, partitioned by date)
Output: WAU count per user cohort, refreshed daily
Batch pipeline:
- Extract: Each night, batch job reads yesterday’s events from S3 (date-partitioned)
- Transform (Spark or dbt):
SELECT DATE_TRUNC('week', event_date) as week, user_cohort, COUNT(DISTINCT user_id) as wau FROM events WHERE event_type = 'active' AND event_date BETWEEN DATEADD(day, -7, CURRENT_DATE) AND CURRENT_DATE GROUP BY week, user_cohort - Load: Write to Delta Lake table (or Snowflake/BigQuery); overwrite current week’s partition
Key decisions:
- Partitioning: Partition events by date → only read last 7 days (partition pruning)
- Idempotent: Re-running overwrites same output partition (no duplicate counts)
- Incrementally: Don’t reprocess all history — only add yesterday’s events
- Storage: Delta Lake for ACID writes + time travel (audit previous WAU calculations)
A data scientist wants to train a recommendation model on 6 months of user interactions (500GB). How do you set up the batch pipeline?
?
Architecture: Lakehouse → Feature extraction → Model training
Step 1: Data preparation (Spark on Databricks/EMR):
interactions = spark.read.format("delta").load("s3://data/interactions/")
.filter(col("date") >= "2025-10-01")
.filter(col("event_type").isin(["view", "purchase", "rate"]))
.select("user_id", "item_id", "event_type", "timestamp", "rating")Step 2: Feature engineering (Spark):
- User features: watch history, genre preferences, time-of-day patterns
- Item features: metadata, aggregate statistics, embeddings
- Join user + item features by user_id, item_id
Step 3: Train-test split:
- Temporal split: data before cutoff = train; after = test
- NO random split (would leak future to past)
Step 4: Model training (Spark MLlib or PyTorch distributed):
- Matrix factorization (ALS) for collaborative filtering: fits in Spark
- Neural models: need GPU cluster + PyTorch/TorchDistributed
Step 5: Serve predictions:
- Batch score all user-item pairs overnight
- Write top-N recommendations per user to Redis/DynamoDB
- Serve from cache at request time (batch-computed, low-latency serving)
Quick Facts
What is the shuffle phase in MapReduce and why is it the bottleneck?
?
-
Shuffle: The network data transfer between mappers and reducers
- All (key, value) pairs with key K must reach the same reducer
- Requires network transfer of (potentially all) mapper output
-
Why it’s the bottleneck:
- Network bandwidth is much slower than local disk I/O
- Shuffle involves all mappers sending to all reducers (N × M connections)
- Sort-by-key required before reduce can start
- Cannot start reducing until all mappers finish (blocking barrier)
-
Size of shuffle: Can be as large as the total input data size
- If input = 1TB, shuffle can be ~1TB of network traffic
-
Optimizations:
- Combiner (mini-reducer at mapper): Pre-aggregate locally before shuffle (reduces data sent)
- Compression: Compress shuffle data
- Broadcast join: Avoid shuffle entirely when one input is small
Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13