Chapter 3 Cheat Sheet - Storage and Retrieval

One-Line Summaries

Concept	One-Liner
LSM-Tree	Append writes to memtable → flush to SSTables → merge via compaction
B-Tree	Balanced tree of fixed-size pages; update in-place; standard for OLTP
SSTable	Sorted String Table — sorted key-value file segment used in LSM-Trees
Write-Ahead Log (WAL)	Log changes before applying to B-tree pages; enables crash recovery
Bloom filter	Probabilistic structure — tells you if a key is definitely NOT in an SSTable
OLTP	Transactional workload: small, fast reads/writes by primary key
OLAP	Analytical workload: aggregate queries scanning millions of rows
Column store	Data stored column-by-column — optimal for analytics (read few columns)
Data cube	Pre-aggregated OLAP cube across all combinations of dimensions
Compaction	Background process merging SSTables, removing old versions of keys

Storage Engine Decision Tree

What's your workload?
│
├─ OLTP (low-latency, small reads/writes)?
│  ├─ Write-heavy (append-intensive)?
│  │  └─ → LSM-Tree (Cassandra, RocksDB, LevelDB)
│  └─ Read-heavy / mixed / need strong consistency?
│     └─ → B-Tree (PostgreSQL, MySQL/InnoDB)
│
└─ OLAP (analytics, aggregations, large scans)?
   ├─ Batch/historical analytics?
   │  └─ → Column store (Snowflake, BigQuery, Redshift)
   └─ Real-time analytics on live data?
      └─ → HTAP (TiDB, SingleStore) or Lakehouse (Delta Lake)

LSM-Tree Write and Read Paths

WRITE PATH:
  Write → Memtable (in-memory sorted tree)
       → (when full) Flush to L0 SSTable on disk
       → Background: Compact L0→L1→L2 (merge sorted SSTables)

READ PATH:
  Read key K → Check Memtable
            → Check Bloom filter for each SSTable level
            → If might exist: binary search in SSTable
            → Return most recent value found

B-Tree Structure

Root Page
├── [keys: 1-100]  → Child Page A
│   ├── [1–33] → Leaf (row data)
│   ├── [34–66] → Leaf (row data)
│   └── [67–100] → Leaf (row data)
├── [keys: 101-200] → Child Page B
└── ...

Write:
  1. Find leaf page
  2. Update value in-place (or split if full)
  3. Write change to WAL first (crash safety)
  4. Write updated page to disk

LSM-Tree vs B-Tree

Dimension	B-Tree	LSM-Tree
Write throughput	Lower (random I/O)	Higher (sequential I/O)
Read throughput	Higher (one place)	Lower (check multiple levels)
Space efficiency	Lower (fragmentation)	Higher (compression)
Compaction impact	None	Can saturate disk I/O
Crash recovery	WAL needed	Memtable WAL + SSTable immutability
Transaction support	Easier	Harder
Best for	Mixed OLTP, strong reads	Write-heavy, append workloads

OLTP vs OLAP

OLTP                                    OLAP
─────────────────────────               ───────────────────────
Small records, by key                   Aggregate over many records
Read < 100 rows                         Read millions of rows
Random access                           Full column scans
Low latency (ms)                        High throughput (minutes OK)
Updated continuously                    Batch loaded (ETL)
MySQL, PostgreSQL                       Snowflake, BigQuery, Redshift
GB to TB                                TB to PB
Indexed by primary key                  Columnar, partitioned

Star Schema

                    ┌─────────────────┐
                    │  dim_date       │
                    │  date_key (PK)  │
                    │  year, month... │
                    └────────┬────────┘
                             │
┌─────────────┐    ┌─────────┴────────────┐    ┌──────────────────┐
│ dim_product │    │   fact_sales         │    │   dim_customer   │
│ product_key ├────┤   date_key (FK)      ├────┤   customer_key   │
│ name, brand │    │   product_key (FK)   │    │   name, city...  │
└─────────────┘    │   customer_key (FK)  │    └──────────────────┘
                   │   quantity, price    │
                   └──────────────────────┘

Each row in fact table = one event/transaction
Queries join fact → dimension tables

Column Store vs Row Store

Row Store (PostgreSQL):
Row 1: [user_id=1, name="Alice", email="a@b.com", age=30, ...]
Row 2: [user_id=2, name="Bob",   email="b@b.com", age=25, ...]
─ Good: load one row fast
─ Bad: must load all columns to get one

Column Store (Snowflake):
Col user_id: [1, 2, 3, 4, ...]
Col name:    ["Alice", "Bob", "Charlie", ...]
Col age:     [30, 25, 35, ...]
─ Good: read only columns needed, compress same-type data
─ Bad: write one row = write to every column file

Column Compression

Original column: [active, active, active, inactive, active, inactive, inactive]

Bitmap encoding:
  active:   [1, 1, 1, 0, 1, 0, 0]
  inactive: [0, 0, 0, 1, 0, 1, 1]

Run-length encoding:
  active×3, inactive×1, active×1, inactive×2

Query "WHERE status = active AND region = west":
  → Bitwise AND on compressed bitmaps (very fast)

Bloom Filter Mental Model

"Is key K in SSTable S?"

Bloom filter says:            Meaning:
─────────────────────────     ──────────────────────────────────────
DEFINITELY NOT in S     →     Skip SSTable entirely (big win!)
MAYBE in S              →     Check SSTable (might be a false positive)

No false negatives: if key exists, bloom filter always says "MAYBE"
False positives possible: bloom filter may say "MAYBE" for missing keys
Configurable FP rate: ~1% typical (trade space for fewer disk reads)

Key Trade-offs

Decision	Pro	Con	When to Use
LSM-Tree	High write throughput	Compaction I/O, slower reads	Write-heavy workloads
B-Tree	Fast reads, simple	Lower write throughput	Read-heavy, mixed OLTP
In-memory DB	Fastest possible	Limited by RAM size	Hot datasets, caching
Column store	Fast analytics	Slow individual row writes	OLAP, BI
Row store	Fast row access	Slow full-column scans	OLTP, transactional
Data cube	Instant aggregations	Pre-computed, inflexible	Known aggregate queries

Red Flags

❌ Running analytics directly on OLTP production database (kills performance)
❌ No compaction tuning on LSM-tree under heavy write load
❌ Forgetting Bloom filters → LSM-tree reads check every level
❌ Using B-tree for append-only log storage (write amplification)
❌ Missing indexes on frequently queried columns

Green Flags

✅ Use write-ahead log for crash recovery
✅ Bloom filters on all LSM-tree SSTables
✅ Separate OLTP database from data warehouse with ETL
✅ Column store for any analytical workload > 1M rows
✅ Measure actual I/O patterns before choosing storage engine

Modern Additions (2026)

NVMe SSD Impact:
├─ Random I/O ~10x faster (B-tree penalty reduced)
├─ LSM-tree sequential write advantage less dominant
└─ WiscKey: separate keys from values to reduce compaction overhead

Lakehouse (Delta Lake, Iceberg, Hudi):
├─ ACID transactions over Parquet on S3
├─ Time travel queries (version history)
└─ Merge OLAP performance with data lake flexibility

HTAP (TiDB, SingleStore):
├─ Row store (OLTP) + column store (OLAP) in one DB
├─ Real-time analytics without ETL
└─ Raft-based replication for consistency

Interview Response Templates

When Asked About Storage Engine Choice

“I’d start by understanding the access pattern. Write-heavy workloads benefit from LSM-trees (like RocksDB or Cassandra) because sequential writes are much faster than random writes. Read-heavy or mixed transactional workloads often do better with B-trees (PostgreSQL/InnoDB) for their predictable read performance. For analytics, I’d separate the concern entirely and use a columnar warehouse.”

When Asked About Slow Read Performance in LSM-tree

“LSM-tree reads can be slow because they must check multiple SSTable levels. The key mitigations are: first, Bloom filters to skip SSTables that don’t contain the key; second, compaction to reduce the number of levels; third, a good memtable that absorbs recent writes. If reads are consistently slow, it often means compaction is falling behind writes.”

Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch03-cheatsheet