Chapter 3 Cheat Sheet - Storage and Retrieval
One-Line Summaries
| Concept | One-Liner |
|---|---|
| LSM-Tree | Append writes to memtable → flush to SSTables → merge via compaction |
| B-Tree | Balanced tree of fixed-size pages; update in-place; standard for OLTP |
| SSTable | Sorted String Table — sorted key-value file segment used in LSM-Trees |
| Write-Ahead Log (WAL) | Log changes before applying to B-tree pages; enables crash recovery |
| Bloom filter | Probabilistic structure — tells you if a key is definitely NOT in an SSTable |
| OLTP | Transactional workload: small, fast reads/writes by primary key |
| OLAP | Analytical workload: aggregate queries scanning millions of rows |
| Column store | Data stored column-by-column — optimal for analytics (read few columns) |
| Data cube | Pre-aggregated OLAP cube across all combinations of dimensions |
| Compaction | Background process merging SSTables, removing old versions of keys |
Storage Engine Decision Tree
What's your workload?
│
├─ OLTP (low-latency, small reads/writes)?
│ ├─ Write-heavy (append-intensive)?
│ │ └─ → LSM-Tree (Cassandra, RocksDB, LevelDB)
│ └─ Read-heavy / mixed / need strong consistency?
│ └─ → B-Tree (PostgreSQL, MySQL/InnoDB)
│
└─ OLAP (analytics, aggregations, large scans)?
├─ Batch/historical analytics?
│ └─ → Column store (Snowflake, BigQuery, Redshift)
└─ Real-time analytics on live data?
└─ → HTAP (TiDB, SingleStore) or Lakehouse (Delta Lake)
LSM-Tree Write and Read Paths
WRITE PATH:
Write → Memtable (in-memory sorted tree)
→ (when full) Flush to L0 SSTable on disk
→ Background: Compact L0→L1→L2 (merge sorted SSTables)
READ PATH:
Read key K → Check Memtable
→ Check Bloom filter for each SSTable level
→ If might exist: binary search in SSTable
→ Return most recent value found
B-Tree Structure
Root Page
├── [keys: 1-100] → Child Page A
│ ├── [1–33] → Leaf (row data)
│ ├── [34–66] → Leaf (row data)
│ └── [67–100] → Leaf (row data)
├── [keys: 101-200] → Child Page B
└── ...
Write:
1. Find leaf page
2. Update value in-place (or split if full)
3. Write change to WAL first (crash safety)
4. Write updated page to disk
LSM-Tree vs B-Tree
| Dimension | B-Tree | LSM-Tree |
|---|---|---|
| Write throughput | Lower (random I/O) | Higher (sequential I/O) |
| Read throughput | Higher (one place) | Lower (check multiple levels) |
| Space efficiency | Lower (fragmentation) | Higher (compression) |
| Compaction impact | None | Can saturate disk I/O |
| Crash recovery | WAL needed | Memtable WAL + SSTable immutability |
| Transaction support | Easier | Harder |
| Best for | Mixed OLTP, strong reads | Write-heavy, append workloads |
OLTP vs OLAP
OLTP OLAP
───────────────────────── ───────────────────────
Small records, by key Aggregate over many records
Read < 100 rows Read millions of rows
Random access Full column scans
Low latency (ms) High throughput (minutes OK)
Updated continuously Batch loaded (ETL)
MySQL, PostgreSQL Snowflake, BigQuery, Redshift
GB to TB TB to PB
Indexed by primary key Columnar, partitioned
Star Schema
┌─────────────────┐
│ dim_date │
│ date_key (PK) │
│ year, month... │
└────────┬────────┘
│
┌─────────────┐ ┌─────────┴────────────┐ ┌──────────────────┐
│ dim_product │ │ fact_sales │ │ dim_customer │
│ product_key ├────┤ date_key (FK) ├────┤ customer_key │
│ name, brand │ │ product_key (FK) │ │ name, city... │
└─────────────┘ │ customer_key (FK) │ └──────────────────┘
│ quantity, price │
└──────────────────────┘
Each row in fact table = one event/transaction
Queries join fact → dimension tables
Column Store vs Row Store
Row Store (PostgreSQL):
Row 1: [user_id=1, name="Alice", email="a@b.com", age=30, ...]
Row 2: [user_id=2, name="Bob", email="b@b.com", age=25, ...]
─ Good: load one row fast
─ Bad: must load all columns to get one
Column Store (Snowflake):
Col user_id: [1, 2, 3, 4, ...]
Col name: ["Alice", "Bob", "Charlie", ...]
Col age: [30, 25, 35, ...]
─ Good: read only columns needed, compress same-type data
─ Bad: write one row = write to every column file
Column Compression
Original column: [active, active, active, inactive, active, inactive, inactive]
Bitmap encoding:
active: [1, 1, 1, 0, 1, 0, 0]
inactive: [0, 0, 0, 1, 0, 1, 1]
Run-length encoding:
active×3, inactive×1, active×1, inactive×2
Query "WHERE status = active AND region = west":
→ Bitwise AND on compressed bitmaps (very fast)
Bloom Filter Mental Model
"Is key K in SSTable S?"
Bloom filter says: Meaning:
───────────────────────── ──────────────────────────────────────
DEFINITELY NOT in S → Skip SSTable entirely (big win!)
MAYBE in S → Check SSTable (might be a false positive)
No false negatives: if key exists, bloom filter always says "MAYBE"
False positives possible: bloom filter may say "MAYBE" for missing keys
Configurable FP rate: ~1% typical (trade space for fewer disk reads)
Key Trade-offs
| Decision | Pro | Con | When to Use |
|---|---|---|---|
| LSM-Tree | High write throughput | Compaction I/O, slower reads | Write-heavy workloads |
| B-Tree | Fast reads, simple | Lower write throughput | Read-heavy, mixed OLTP |
| In-memory DB | Fastest possible | Limited by RAM size | Hot datasets, caching |
| Column store | Fast analytics | Slow individual row writes | OLAP, BI |
| Row store | Fast row access | Slow full-column scans | OLTP, transactional |
| Data cube | Instant aggregations | Pre-computed, inflexible | Known aggregate queries |
Red Flags
❌ Running analytics directly on OLTP production database (kills performance)
❌ No compaction tuning on LSM-tree under heavy write load
❌ Forgetting Bloom filters → LSM-tree reads check every level
❌ Using B-tree for append-only log storage (write amplification)
❌ Missing indexes on frequently queried columns
Green Flags
✅ Use write-ahead log for crash recovery
✅ Bloom filters on all LSM-tree SSTables
✅ Separate OLTP database from data warehouse with ETL
✅ Column store for any analytical workload > 1M rows
✅ Measure actual I/O patterns before choosing storage engine
Modern Additions (2026)
NVMe SSD Impact:
├─ Random I/O ~10x faster (B-tree penalty reduced)
├─ LSM-tree sequential write advantage less dominant
└─ WiscKey: separate keys from values to reduce compaction overhead
Lakehouse (Delta Lake, Iceberg, Hudi):
├─ ACID transactions over Parquet on S3
├─ Time travel queries (version history)
└─ Merge OLAP performance with data lake flexibility
HTAP (TiDB, SingleStore):
├─ Row store (OLTP) + column store (OLAP) in one DB
├─ Real-time analytics without ETL
└─ Raft-based replication for consistency
Interview Response Templates
When Asked About Storage Engine Choice
“I’d start by understanding the access pattern. Write-heavy workloads benefit from LSM-trees (like RocksDB or Cassandra) because sequential writes are much faster than random writes. Read-heavy or mixed transactional workloads often do better with B-trees (PostgreSQL/InnoDB) for their predictable read performance. For analytics, I’d separate the concern entirely and use a columnar warehouse.”
When Asked About Slow Read Performance in LSM-tree
“LSM-tree reads can be slow because they must check multiple SSTable levels. The key mitigations are: first, Bloom filters to skip SSTables that don’t contain the key; second, compaction to reduce the number of levels; third, a good memtable that absorbs recent writes. If reads are consistently slow, it often means compaction is falling behind writes.”
Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-13