Batch layer (accurate) + speed layer (recent) merged at query time
Kappa architecture
Single stream processor handles both historical reprocessing and real-time
Batch scoring
Pre-compute ML predictions for all users overnight; serve from cache
dbt model
SQL SELECT statement materialized as table/view; versioned, tested, DAG-aware
Data skew
One key has disproportionate records → one worker overwhelmed; fix with salting
Shuffle: The Core Cost
Before shuffle: After shuffle:
Worker 1: [A=1, B=3, A=2] Worker 1 (key A): [A=1, A=2, A=5]
Worker 2: [B=1, A=5, C=2] Worker 2 (key B): [B=3, B=1, B=9]
Worker 3: [B=9, C=7, A=3] Worker 3 (key C): [C=2, C=7]
↑ ↑ (+ A=3 from Worker 3)
Records on 3 workers Same-key records co-located
(randomly distributed) (after network transfer + sort)
Cost:
- All data may cross the network (N × M connections)
- Sort by key (O(n log n))
- Disk spill if memory exhausted
- Global barrier: next stage waits for ALL workers to finish shuffle
Join Strategy Decision Tree
Is one side small enough for RAM?
│
┌────────────┴────────────┐
YES NO
│ │
BROADCAST HASH JOIN Are both sides pre-bucketed
(no shuffle for large) by the same key?
│
┌──────────┴──────────┐
YES NO
│ │
PARTITIONED HASH JOIN SORT-MERGE JOIN
(join local partitions) (shuffle + sort both sides)
Approximate size threshold for broadcast: < 1GB–10GB (depends on cluster RAM)
Batch Frameworks Comparison
Hadoop MR Spark Flink (batch) dbt
─────────────────────────────────────────────────────────────────────
API Java map/reduce DataFrame/SQL Table API/SQL SQL only
Scale Petabytes 100GB–PB 100GB–PB Warehouse
Shuffle To disk always Memory+disk Pipelined N/A (SQL)
Lazy eval No Yes (RDD/DF) Yes Yes (SQL)
Fault tol. Disk checkpts Lineage Checkpoints N/A
Speed Slow (disk I/O) Fast Fast Fast (in DB)
Use 2026? Legacy only YES YES YES (ELT)
ETL vs ELT
ETL (Traditional):
[Source DB] → [Informatica/Spark] → transform → [Warehouse star schema]
↑ transform here (arrives clean)
Problem: Transform logic lives outside warehouse; hard to debug; fixed schema
ELT (Modern):
[Source DB] → [Fivetran/Airbyte] → [Warehouse raw layer] → [dbt SQL] → [Refined]
↑ load raw ↑ transform here
Benefit: Raw data preserved; re-transform anytime; SQL in warehouse; dbt tests
ML Pipeline Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ BATCH ML PIPELINE │
│ │
│ Raw Events Feature Engineering Model Training │
│ (S3/Lakehouse) ──► (Spark / dbt) ──► (Spark MLlib / │
│ - encode categories PyTorch DDP / │
│ - aggregate history XGBoost) │
│ - temporal features │
│ │ │
│ Feature Store │
│ (Feast / Tecton) │
│ │ │
│ ┌──────────┴──────────┐ │
│ Batch Scoring Online Inference │
│ (Spark, nightly) (TorchServe, real-time) │
│ → Redis/DynamoDB → Model server │
│ (pre-computed) (on-demand) │
└─────────────────────────────────────────────────────────────────────┘
CRITICAL: Always use TEMPORAL split (not random):
Train: events before cutoff date
Test: events after cutoff date
Never allow future → past leakage!
Lambda vs Kappa Architecture
LAMBDA:
Raw ──┬──► Batch layer (Spark, nightly) ──┐
│ accurate, complete, hours lag ├──► Serving layer
└──► Speed layer (Flink, seconds lag) ──┘ (merges both views)
+ Accurate historical + Low-latency recent
- Two codebases; two sets of bugs; complex merge logic
KAPPA:
Raw ──► Stream processor (Flink, Kafka Streams) ──► Derived views
(handles real-time AND historical replay)
+ Single codebase; simpler operations
- Reprocessing can be slow; log storage expensive
2026 TREND: Streaming Lakehouse replaces both:
Stream → Iceberg tables on S3 (ACID, queryable by BI tools directly)
DataFrame Operation Types
NARROW (no shuffle — each partition processed independently):
select, filter, withColumn, map, flatMap, union, dropDuplicates*
WIDE (require shuffle — data moves between workers):
groupBy, join, distinct, repartition, coalesce, sort/orderBy
* dropDuplicates without repartition can be narrow (per-partition dedup)
ACTION (trigger execution — lazy plan runs):
count(), show(), collect(), write(), toPandas(), save()
Skew Handling (Salting)
Problem: user_id=BIGSTAR → 10M records → Worker 7 gets all of them
(other 99 workers idle)
Solution (salting):
Step 1: Add random salt 0-9 to skewed key
(user_id=BIGSTAR, salt=3) → worker 13
(user_id=BIGSTAR, salt=7) → worker 47
Step 2: Partial aggregate within each salt group
Worker 13: count(BIGSTAR_3) = 1.2M
Worker 47: count(BIGSTAR_7) = 0.8M
Step 3: Final aggregate across salt groups
SUM(counts for BIGSTAR) = 10M
dbt Incremental Model Pattern
-- incremental model: only process NEW data each run{{ config(materialized='incremental', unique_key='event_id') }}SELECT event_id, user_id, event_type, occurred_atFROM {{ ref('raw_events') }}{% if is_incremental() %} -- Only process events since last run WHERE occurred_at > (SELECT MAX(occurred_at) FROM {{ this }}){% endif %}
Key Numbers (Rule of Thumb, 2026)
Metric
Value
Broadcast join threshold (Spark default)
10MB (tune to 1–10GB)
DuckDB practical limit
~500GB single machine
Spark partition target size
~128MB per partition
dbt incremental run (vs full refresh)
10–100x faster
Shuffle overhead vs broadcast
5–20x slower for same join
MapReduce vs Spark (multi-step jobs)
Spark 5–10x faster
Red Flags
❌ Using MapReduce for new projects (Spark/Flink are 5-10x faster)
❌ Random train/test split for time-series ML (leaks future → past)
❌ .toPandas() on a large Spark DataFrame (brings PB to driver node)
❌ Large × large join without checking for skew (straggler workers)
❌ Lambda architecture for new systems (operational complexity)
❌ ETL that discards raw data (can't reprocess with new logic)
❌ Non-deterministic batch jobs (random seeds, unordered ops) — breaks reruns
Green Flags
✅ Broadcast join for large × small (no shuffle overhead)
✅ Store raw data in immutable Lakehouse; transform with dbt
✅ Temporal split for ML feature engineering
✅ Salt skewed keys before groupBy/join
✅ Partition Parquet/Iceberg by query predicates (date, region)
✅ dbt incremental models for daily batch jobs (skip unchanged data)
✅ Cache frequently reused DataFrames in Spark (spark.catalog.cacheTable)
✅ Columnar format (Parquet/ORC) for analytics — 10x smaller, faster scans