Chapter 10 Flashcards - Batch Processing

flashcards chapter-10 ddia

Basic Concepts

What is batch processing and how does it differ from online (OLTP) processing?
?

Batch processing: Process a large amount of bounded (finite) data in bulk; produce a derived output
- Input is complete; processing starts and finishes
- Results used downstream (recommendations, search indexes, analytics)
- Latency: minutes to hours (not real-time)
OLTP (online processing): Handle user requests in real-time; process small amounts of data per request
- Low latency (ms); user waits for response
- Triggered by user actions
Key batch philosophy:
- Inputs are immutable (never modified)
- Outputs are derived (can be regenerated)
- Fault tolerance: re-run the job (input unchanged)
Historical parallel: Unix pipes = batch processing for single machine; MapReduce = Unix pipes for cluster

What are the Unix design principles that apply to MapReduce and Spark?
?

One tool, one thing well: Each program does one operation; compose via pipes
Uniform interface: stdin/stdout (text) allows any program to connect to any other
Immutable inputs: Programs read from stdin; don’t modify it
Composability: Any output can be piped as input to the next stage

How these translate:

MapReduce: mapper reads records, emits KV; reducer reads KV, emits output
Spark: chain of transformations (map, filter, groupBy) form a DAG
Input HDFS/S3 files = immutable; output = new files
Fault tolerance = re-run; input is unchanged

Key insight: MapReduce is “Unix pipes for clusters” — same philosophy, different execution environment

MapReduce

Describe the three phases of MapReduce and what happens in each.
?
Map phase:

Mapper function called for each input record (from HDFS/S3 split)
Emits zero or more (key, value) pairs
Each mapper writes output to local disk, sorted by key

Shuffle phase (between map and reduce):

MapReduce framework moves mapper outputs to reducers
All (key, value) pairs with the same key go to the same reducer
Data sorted by key; partitioned by hash(key) % num_reducers
Most expensive phase (large data movement over network)

Reduce phase:

Reducer function receives all values for one key in order
Aggregates, filters, or transforms as needed
Writes output back to HDFS/S3

Example (word count):

Map: (doc, "the quick brown fox") → (the, 1), (quick, 1), (brown, 1), (fox, 1)
Shuffle: all (the, 1) from all mappers → Reducer 1; all (quick, 1) → Reducer 2
Reduce: (the, [1,1,1,...]) → (the, 47)

How does MapReduce achieve fault tolerance?
?

Principle: Inputs are immutable (on HDFS/S3); intermediate and final outputs are deterministic
Implementation:
1. Mapper/reducer writes output to its local disk (not in-memory; survives task crash)
2. If a mapper task fails: ONLY that mapper re-runs (other mappers’ output preserved)
3. If a reducer fails: reads mapper outputs again (still on mapper local disks)
4. Final output written to HDFS/S3 atomically (rename)
Granularity: Task-level restart, not job-level restart
Speculative execution: Run duplicate copies of slow tasks; use whichever finishes first
Cost of fault tolerance: High disk I/O (write intermediate results to disk after each phase)
MapReduce’s main disadvantage: This disk I/O overhead is why Spark is faster

Dataflow Engines

Why are Spark and Flink faster than MapReduce for multi-step computations?
?
MapReduce limitation: Every stage (map → reduce → map → reduce) writes to disk and reads back

Multi-step job with 10 stages = 10 disk writes + 10 disk reads

Spark/Flink advantages:

In-memory pipelining: Stages are connected directly; data flows from one to next in memory
- Only write to disk when: checkpointing, shuffle (some), or output
Arbitrary DAG: Not limited to map-shuffle-reduce structure; any operator (filter, join, sort, window)
Optimizer: Sees entire computation graph → can reorder, fuse, eliminate stages
Result: 5-100x faster than MapReduce for typical multi-step analytical jobs

Spark’s fault tolerance (different from MapReduce):

Lineage graph: tracks how each RDD was derived
On failure: recompute only lost partitions by following lineage from last stable data
Checkpoint: can persist lineage to disk for very long computations

What is Spark’s lazy evaluation model and why does it matter?
?

Lazy evaluation: Spark transformations (map, filter, join, groupBy) are NOT executed immediately
- They build up a logical plan (DAG of transformations)
- Only executed when an action is called (count(), collect(), save(), show())
Why it matters:
1. Query optimization: Optimizer sees entire plan before execution; can reorder, push down filters
2. No wasted computation: Intermediate results only computed when needed
3. Pipeline fusion: Adjacent stages can be fused into a single pass over data
Transformations (lazy): map(), filter(), flatMap(), groupBy(), join(), select()
Actions (trigger execution): count(), collect(), show(), save(), write()
Example: rdd.filter(x > 10).map(x * 2).take(5) — filter and map fused into one pass; stops after 5 results

Joins in Batch

When would you use a broadcast hash join vs a sort-merge join?
?
Broadcast hash join (map-side join):

One table small enough to fit in RAM (broadcast to all workers)
Join executed locally at each worker (no shuffle!)
✅ Very fast; no network data movement
✅ Works for large × small table joins
❌ Small table must fit in RAM (~100MB to few GB typical)
Example: Join 1TB log table with 10MB country lookup table

Sort-merge join:

Both tables sorted by join key; merged in one pass
Requires shuffling one or both inputs by join key
✅ Works for any size inputs (including large × large)
❌ Requires sorting (expensive) if data not already sorted
✅ Efficient when data is pre-sorted or pre-partitioned
Example: Join two 500GB tables by user_id

Rule of thumb: Use broadcast hash join whenever one side is “small”; fall back to sort-merge for large × large

Design Principles

What is “derived data” in the context of batch processing?
?

Derived data: Any data that was computed from an authoritative source (input data)
- Can be regenerated if lost or corrupted
- No unique state — source is the ground truth
Examples of derived data:
- Search indexes (derived from crawled pages)
- Recommendation models (derived from user interaction history)
- Analytics aggregates (derived from raw events)
- Materialized views (derived from base tables)
- Read caches (derived from write-side database)
Key property: If derived data is lost → just rerun the batch job from the immutable source
Why this matters: Simplifies fault tolerance; no need for complex recovery procedures
Contrast with source-of-truth data: The user’s writes, events, transactions — this is not derived; it must be preserved

What is the BSP (Bulk Synchronous Parallel) model and how is it used for graph processing?
?

BSP (Valiant 1990): Parallel computation in synchronous “supersteps”
Each superstep:
1. Every vertex executes its function (e.g., PageRank calculation)
2. Vertices send messages to neighbors
3. Barrier: all vertices wait until all messages delivered
4. Next superstep: vertices process received messages
Used by: Apache Giraph, Pregel (Google), GraphX (Spark), Gelly (Flink)
PageRank example:
- Init: all vertices get rank 1/N
- Each superstep: vertex sends rank / out_degree to each neighbor
- Each vertex receives from all in-neighbors: new rank = damping * sum + (1-damping)/N
- Repeat until convergence
When to use: Graph algorithms (PageRank, shortest path, connected components, community detection)
Alternative: Process graphs in MapReduce (iterative, less efficient)

Modern Context (2026)

What is the role of dbt (data build tool) in modern batch data engineering?
?

dbt = data build tool; a transformation framework for data warehouses/lakehouses
What it does: Defines batch transformations as SQL SELECT statements (“models”)
- Each model = one table or view in the warehouse
- Models can depend on other models (dbt manages the DAG)
- dbt run executes all models in dependency order
Key features:
- Version control: SQL models in git (reviewable, testable)
- Tests: Assert business logic (unique, not null, referential integrity)
- Documentation: Auto-generates data catalog from model definitions
- Incremental models: Only process new data (not full recompute each run)
2026 status: Industry standard for analytics engineering
- Works with: Snowflake, BigQuery, Redshift, Databricks, DuckDB
- Over 40,000 users (2024); ubiquitous in data teams
Replaces: Hand-written SQL scripts, Informatica ETL, complex Spark jobs for transformations

How does the Lakehouse architecture change batch processing workflows?
?

Problem with pure data warehouse: Expensive proprietary format; hard to use ML tools on warehouse data
Problem with pure data lake: Raw Parquet files; no ACID, no schema enforcement, slow queries
Lakehouse (Delta Lake, Apache Iceberg, Apache Hudi):
- Open format: Parquet files + transaction log on object storage (S3/GCS)
- ACID transactions: concurrent writers don’t corrupt data
- Schema enforcement and evolution
- Time travel: query data as of any point (audit, rollback)
- Streaming + batch unified (can read latest data or historical snapshots)
Batch workflow:
1. Source (Kafka, databases) → raw zone (Parquet on S3)
2. dbt or Spark → transforms raw → refined zone (Delta/Iceberg tables)
3. BI tools (Tableau, Looker), DuckDB, Trino → query refined zone
Why it matters: Batch output is now queryable immediately by any SQL tool; no separate export step

Interview Scenarios

Design a batch pipeline to compute weekly active users (WAU) for a social media app.
?
Input: Raw event stream (Kafka) → landing zone (Parquet on S3, partitioned by date)
Output: WAU count per user cohort, refreshed daily

Batch pipeline:

Extract: Each night, batch job reads yesterday’s events from S3 (date-partitioned)

Transform (Spark or dbt):

SELECT
  DATE_TRUNC('week', event_date) as week,
  user_cohort,
  COUNT(DISTINCT user_id) as wau
FROM events
WHERE event_type = 'active'
  AND event_date BETWEEN DATEADD(day, -7, CURRENT_DATE) AND CURRENT_DATE
GROUP BY week, user_cohort

Load: Write to Delta Lake table (or Snowflake/BigQuery); overwrite current week’s partition

Key decisions:

Partitioning: Partition events by date → only read last 7 days (partition pruning)
Idempotent: Re-running overwrites same output partition (no duplicate counts)
Incrementally: Don’t reprocess all history — only add yesterday’s events
Storage: Delta Lake for ACID writes + time travel (audit previous WAU calculations)

A data scientist wants to train a recommendation model on 6 months of user interactions (500GB). How do you set up the batch pipeline?
?
Architecture: Lakehouse → Feature extraction → Model training

Step 1: Data preparation (Spark on Databricks/EMR):

interactions = spark.read.format("delta").load("s3://data/interactions/")
  .filter(col("date") >= "2025-10-01")
  .filter(col("event_type").isin(["view", "purchase", "rate"]))
  .select("user_id", "item_id", "event_type", "timestamp", "rating")

Step 2: Feature engineering (Spark):

User features: watch history, genre preferences, time-of-day patterns
Item features: metadata, aggregate statistics, embeddings
Join user + item features by user_id, item_id

Step 3: Train-test split:

Temporal split: data before cutoff = train; after = test
NO random split (would leak future to past)

Step 4: Model training (Spark MLlib or PyTorch distributed):

Matrix factorization (ALS) for collaborative filtering: fits in Spark
Neural models: need GPU cluster + PyTorch/TorchDistributed

Step 5: Serve predictions:

Batch score all user-item pairs overnight
Write top-N recommendations per user to Redis/DynamoDB
Serve from cache at request time (batch-computed, low-latency serving)

Quick Facts

What is the shuffle phase in MapReduce and why is it the bottleneck?
?

Shuffle: The network data transfer between mappers and reducers
- All (key, value) pairs with key K must reach the same reducer
- Requires network transfer of (potentially all) mapper output
Why it’s the bottleneck:
- Network bandwidth is much slower than local disk I/O
- Shuffle involves all mappers sending to all reducers (N × M connections)
- Sort-by-key required before reduce can start
- Cannot start reducing until all mappers finish (blocking barrier)
Size of shuffle: Can be as large as the total input data size
- If input = 1TB, shuffle can be ~1TB of network traffic
Optimizations:
- Combiner (mini-reducer at mapper): Pre-aggregate locally before shuffle (reduces data sent)
- Compression: Compress shuffle data
- Broadcast join: Avoid shuffle entirely when one input is small

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch10-flashcards