Chapter 10 Flashcards - Batch Processing

flashcards chapter-10 ddia


Basic Concepts

What is batch processing and how does it differ from online (OLTP) processing?
?

  • Batch processing: Process a large amount of bounded (finite) data in bulk; produce a derived output

    • Input is complete; processing starts and finishes
    • Results used downstream (recommendations, search indexes, analytics)
    • Latency: minutes to hours (not real-time)
  • OLTP (online processing): Handle user requests in real-time; process small amounts of data per request

    • Low latency (ms); user waits for response
    • Triggered by user actions
  • Key batch philosophy:

    • Inputs are immutable (never modified)
    • Outputs are derived (can be regenerated)
    • Fault tolerance: re-run the job (input unchanged)
  • Historical parallel: Unix pipes = batch processing for single machine; MapReduce = Unix pipes for cluster

What are the Unix design principles that apply to MapReduce and Spark?
?

  1. One tool, one thing well: Each program does one operation; compose via pipes
  2. Uniform interface: stdin/stdout (text) allows any program to connect to any other
  3. Immutable inputs: Programs read from stdin; don’t modify it
  4. Composability: Any output can be piped as input to the next stage

How these translate:

  • MapReduce: mapper reads records, emits KV; reducer reads KV, emits output
  • Spark: chain of transformations (map, filter, groupBy) form a DAG
  • Input HDFS/S3 files = immutable; output = new files
  • Fault tolerance = re-run; input is unchanged

Key insight: MapReduce is “Unix pipes for clusters” — same philosophy, different execution environment

MapReduce

Describe the three phases of MapReduce and what happens in each.
?
Map phase:

  • Mapper function called for each input record (from HDFS/S3 split)
  • Emits zero or more (key, value) pairs
  • Each mapper writes output to local disk, sorted by key

Shuffle phase (between map and reduce):

  • MapReduce framework moves mapper outputs to reducers
  • All (key, value) pairs with the same key go to the same reducer
  • Data sorted by key; partitioned by hash(key) % num_reducers
  • Most expensive phase (large data movement over network)

Reduce phase:

  • Reducer function receives all values for one key in order
  • Aggregates, filters, or transforms as needed
  • Writes output back to HDFS/S3

Example (word count):

  • Map: (doc, "the quick brown fox")(the, 1), (quick, 1), (brown, 1), (fox, 1)
  • Shuffle: all (the, 1) from all mappers → Reducer 1; all (quick, 1) → Reducer 2
  • Reduce: (the, [1,1,1,...])(the, 47)

How does MapReduce achieve fault tolerance?
?

  • Principle: Inputs are immutable (on HDFS/S3); intermediate and final outputs are deterministic

  • Implementation:

    1. Mapper/reducer writes output to its local disk (not in-memory; survives task crash)
    2. If a mapper task fails: ONLY that mapper re-runs (other mappers’ output preserved)
    3. If a reducer fails: reads mapper outputs again (still on mapper local disks)
    4. Final output written to HDFS/S3 atomically (rename)
  • Granularity: Task-level restart, not job-level restart

  • Speculative execution: Run duplicate copies of slow tasks; use whichever finishes first

  • Cost of fault tolerance: High disk I/O (write intermediate results to disk after each phase)

  • MapReduce’s main disadvantage: This disk I/O overhead is why Spark is faster

Dataflow Engines

Why are Spark and Flink faster than MapReduce for multi-step computations?
?
MapReduce limitation: Every stage (map → reduce → map → reduce) writes to disk and reads back

  • Multi-step job with 10 stages = 10 disk writes + 10 disk reads

Spark/Flink advantages:

  1. In-memory pipelining: Stages are connected directly; data flows from one to next in memory
    • Only write to disk when: checkpointing, shuffle (some), or output
  2. Arbitrary DAG: Not limited to map-shuffle-reduce structure; any operator (filter, join, sort, window)
  3. Optimizer: Sees entire computation graph → can reorder, fuse, eliminate stages
  4. Result: 5-100x faster than MapReduce for typical multi-step analytical jobs

Spark’s fault tolerance (different from MapReduce):

  • Lineage graph: tracks how each RDD was derived
  • On failure: recompute only lost partitions by following lineage from last stable data
  • Checkpoint: can persist lineage to disk for very long computations

What is Spark’s lazy evaluation model and why does it matter?
?

  • Lazy evaluation: Spark transformations (map, filter, join, groupBy) are NOT executed immediately

    • They build up a logical plan (DAG of transformations)
    • Only executed when an action is called (count(), collect(), save(), show())
  • Why it matters:

    1. Query optimization: Optimizer sees entire plan before execution; can reorder, push down filters
    2. No wasted computation: Intermediate results only computed when needed
    3. Pipeline fusion: Adjacent stages can be fused into a single pass over data
  • Transformations (lazy): map(), filter(), flatMap(), groupBy(), join(), select()

  • Actions (trigger execution): count(), collect(), show(), save(), write()

  • Example: rdd.filter(x > 10).map(x * 2).take(5) — filter and map fused into one pass; stops after 5 results

Joins in Batch

When would you use a broadcast hash join vs a sort-merge join?
?
Broadcast hash join (map-side join):

  • One table small enough to fit in RAM (broadcast to all workers)
  • Join executed locally at each worker (no shuffle!)
  • ✅ Very fast; no network data movement
  • ✅ Works for large × small table joins
  • ❌ Small table must fit in RAM (~100MB to few GB typical)
  • Example: Join 1TB log table with 10MB country lookup table

Sort-merge join:

  • Both tables sorted by join key; merged in one pass
  • Requires shuffling one or both inputs by join key
  • ✅ Works for any size inputs (including large × large)
  • ❌ Requires sorting (expensive) if data not already sorted
  • ✅ Efficient when data is pre-sorted or pre-partitioned
  • Example: Join two 500GB tables by user_id

Rule of thumb: Use broadcast hash join whenever one side is “small”; fall back to sort-merge for large × large

Design Principles

What is “derived data” in the context of batch processing?
?

  • Derived data: Any data that was computed from an authoritative source (input data)

    • Can be regenerated if lost or corrupted
    • No unique state — source is the ground truth
  • Examples of derived data:

    • Search indexes (derived from crawled pages)
    • Recommendation models (derived from user interaction history)
    • Analytics aggregates (derived from raw events)
    • Materialized views (derived from base tables)
    • Read caches (derived from write-side database)
  • Key property: If derived data is lost → just rerun the batch job from the immutable source

  • Why this matters: Simplifies fault tolerance; no need for complex recovery procedures

  • Contrast with source-of-truth data: The user’s writes, events, transactions — this is not derived; it must be preserved

What is the BSP (Bulk Synchronous Parallel) model and how is it used for graph processing?
?

  • BSP (Valiant 1990): Parallel computation in synchronous “supersteps”

  • Each superstep:

    1. Every vertex executes its function (e.g., PageRank calculation)
    2. Vertices send messages to neighbors
    3. Barrier: all vertices wait until all messages delivered
    4. Next superstep: vertices process received messages
  • Used by: Apache Giraph, Pregel (Google), GraphX (Spark), Gelly (Flink)

  • PageRank example:

    • Init: all vertices get rank 1/N
    • Each superstep: vertex sends rank / out_degree to each neighbor
    • Each vertex receives from all in-neighbors: new rank = damping * sum + (1-damping)/N
    • Repeat until convergence
  • When to use: Graph algorithms (PageRank, shortest path, connected components, community detection)

  • Alternative: Process graphs in MapReduce (iterative, less efficient)

Modern Context (2026)

What is the role of dbt (data build tool) in modern batch data engineering?
?

  • dbt = data build tool; a transformation framework for data warehouses/lakehouses

  • What it does: Defines batch transformations as SQL SELECT statements (“models”)

    • Each model = one table or view in the warehouse
    • Models can depend on other models (dbt manages the DAG)
    • dbt run executes all models in dependency order
  • Key features:

    • Version control: SQL models in git (reviewable, testable)
    • Tests: Assert business logic (unique, not null, referential integrity)
    • Documentation: Auto-generates data catalog from model definitions
    • Incremental models: Only process new data (not full recompute each run)
  • 2026 status: Industry standard for analytics engineering

    • Works with: Snowflake, BigQuery, Redshift, Databricks, DuckDB
    • Over 40,000 users (2024); ubiquitous in data teams
  • Replaces: Hand-written SQL scripts, Informatica ETL, complex Spark jobs for transformations

How does the Lakehouse architecture change batch processing workflows?
?

  • Problem with pure data warehouse: Expensive proprietary format; hard to use ML tools on warehouse data

  • Problem with pure data lake: Raw Parquet files; no ACID, no schema enforcement, slow queries

  • Lakehouse (Delta Lake, Apache Iceberg, Apache Hudi):

    • Open format: Parquet files + transaction log on object storage (S3/GCS)
    • ACID transactions: concurrent writers don’t corrupt data
    • Schema enforcement and evolution
    • Time travel: query data as of any point (audit, rollback)
    • Streaming + batch unified (can read latest data or historical snapshots)
  • Batch workflow:

    1. Source (Kafka, databases) → raw zone (Parquet on S3)
    2. dbt or Spark → transforms raw → refined zone (Delta/Iceberg tables)
    3. BI tools (Tableau, Looker), DuckDB, Trino → query refined zone
  • Why it matters: Batch output is now queryable immediately by any SQL tool; no separate export step

Interview Scenarios

Design a batch pipeline to compute weekly active users (WAU) for a social media app.
?
Input: Raw event stream (Kafka) → landing zone (Parquet on S3, partitioned by date)
Output: WAU count per user cohort, refreshed daily

Batch pipeline:

  1. Extract: Each night, batch job reads yesterday’s events from S3 (date-partitioned)
  2. Transform (Spark or dbt):
    SELECT
      DATE_TRUNC('week', event_date) as week,
      user_cohort,
      COUNT(DISTINCT user_id) as wau
    FROM events
    WHERE event_type = 'active'
      AND event_date BETWEEN DATEADD(day, -7, CURRENT_DATE) AND CURRENT_DATE
    GROUP BY week, user_cohort
  3. Load: Write to Delta Lake table (or Snowflake/BigQuery); overwrite current week’s partition

Key decisions:

  • Partitioning: Partition events by date → only read last 7 days (partition pruning)
  • Idempotent: Re-running overwrites same output partition (no duplicate counts)
  • Incrementally: Don’t reprocess all history — only add yesterday’s events
  • Storage: Delta Lake for ACID writes + time travel (audit previous WAU calculations)

A data scientist wants to train a recommendation model on 6 months of user interactions (500GB). How do you set up the batch pipeline?
?
Architecture: Lakehouse → Feature extraction → Model training

Step 1: Data preparation (Spark on Databricks/EMR):

interactions = spark.read.format("delta").load("s3://data/interactions/")
  .filter(col("date") >= "2025-10-01")
  .filter(col("event_type").isin(["view", "purchase", "rate"]))
  .select("user_id", "item_id", "event_type", "timestamp", "rating")

Step 2: Feature engineering (Spark):

  • User features: watch history, genre preferences, time-of-day patterns
  • Item features: metadata, aggregate statistics, embeddings
  • Join user + item features by user_id, item_id

Step 3: Train-test split:

  • Temporal split: data before cutoff = train; after = test
  • NO random split (would leak future to past)

Step 4: Model training (Spark MLlib or PyTorch distributed):

  • Matrix factorization (ALS) for collaborative filtering: fits in Spark
  • Neural models: need GPU cluster + PyTorch/TorchDistributed

Step 5: Serve predictions:

  • Batch score all user-item pairs overnight
  • Write top-N recommendations per user to Redis/DynamoDB
  • Serve from cache at request time (batch-computed, low-latency serving)

Quick Facts

What is the shuffle phase in MapReduce and why is it the bottleneck?
?

  • Shuffle: The network data transfer between mappers and reducers

    • All (key, value) pairs with key K must reach the same reducer
    • Requires network transfer of (potentially all) mapper output
  • Why it’s the bottleneck:

    • Network bandwidth is much slower than local disk I/O
    • Shuffle involves all mappers sending to all reducers (N × M connections)
    • Sort-by-key required before reduce can start
    • Cannot start reducing until all mappers finish (blocking barrier)
  • Size of shuffle: Can be as large as the total input data size

    • If input = 1TB, shuffle can be ~1TB of network traffic
  • Optimizations:

    • Combiner (mini-reducer at mapper): Pre-aggregate locally before shuffle (reduces data sent)
    • Compression: Compress shuffle data
    • Broadcast join: Avoid shuffle entirely when one input is small

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13