Chapter 3 Cheat Sheet - Storage and Retrieval

One-Line Summaries

ConceptOne-Liner
LSM-TreeAppend writes to memtable → flush to SSTables → merge via compaction
B-TreeBalanced tree of fixed-size pages; update in-place; standard for OLTP
SSTableSorted String Table — sorted key-value file segment used in LSM-Trees
Write-Ahead Log (WAL)Log changes before applying to B-tree pages; enables crash recovery
Bloom filterProbabilistic structure — tells you if a key is definitely NOT in an SSTable
OLTPTransactional workload: small, fast reads/writes by primary key
OLAPAnalytical workload: aggregate queries scanning millions of rows
Column storeData stored column-by-column — optimal for analytics (read few columns)
Data cubePre-aggregated OLAP cube across all combinations of dimensions
CompactionBackground process merging SSTables, removing old versions of keys

Storage Engine Decision Tree

What's your workload?
│
├─ OLTP (low-latency, small reads/writes)?
│  ├─ Write-heavy (append-intensive)?
│  │  └─ → LSM-Tree (Cassandra, RocksDB, LevelDB)
│  └─ Read-heavy / mixed / need strong consistency?
│     └─ → B-Tree (PostgreSQL, MySQL/InnoDB)
│
└─ OLAP (analytics, aggregations, large scans)?
   ├─ Batch/historical analytics?
   │  └─ → Column store (Snowflake, BigQuery, Redshift)
   └─ Real-time analytics on live data?
      └─ → HTAP (TiDB, SingleStore) or Lakehouse (Delta Lake)

LSM-Tree Write and Read Paths

WRITE PATH:
  Write → Memtable (in-memory sorted tree)
       → (when full) Flush to L0 SSTable on disk
       → Background: Compact L0→L1→L2 (merge sorted SSTables)

READ PATH:
  Read key K → Check Memtable
            → Check Bloom filter for each SSTable level
            → If might exist: binary search in SSTable
            → Return most recent value found

B-Tree Structure

Root Page
├── [keys: 1-100]  → Child Page A
│   ├── [1–33] → Leaf (row data)
│   ├── [34–66] → Leaf (row data)
│   └── [67–100] → Leaf (row data)
├── [keys: 101-200] → Child Page B
└── ...

Write:
  1. Find leaf page
  2. Update value in-place (or split if full)
  3. Write change to WAL first (crash safety)
  4. Write updated page to disk

LSM-Tree vs B-Tree

DimensionB-TreeLSM-Tree
Write throughputLower (random I/O)Higher (sequential I/O)
Read throughputHigher (one place)Lower (check multiple levels)
Space efficiencyLower (fragmentation)Higher (compression)
Compaction impactNoneCan saturate disk I/O
Crash recoveryWAL neededMemtable WAL + SSTable immutability
Transaction supportEasierHarder
Best forMixed OLTP, strong readsWrite-heavy, append workloads

OLTP vs OLAP

OLTP                                    OLAP
─────────────────────────               ───────────────────────
Small records, by key                   Aggregate over many records
Read < 100 rows                         Read millions of rows
Random access                           Full column scans
Low latency (ms)                        High throughput (minutes OK)
Updated continuously                    Batch loaded (ETL)
MySQL, PostgreSQL                       Snowflake, BigQuery, Redshift
GB to TB                                TB to PB
Indexed by primary key                  Columnar, partitioned

Star Schema

                    ┌─────────────────┐
                    │  dim_date       │
                    │  date_key (PK)  │
                    │  year, month... │
                    └────────┬────────┘
                             │
┌─────────────┐    ┌─────────┴────────────┐    ┌──────────────────┐
│ dim_product │    │   fact_sales         │    │   dim_customer   │
│ product_key ├────┤   date_key (FK)      ├────┤   customer_key   │
│ name, brand │    │   product_key (FK)   │    │   name, city...  │
└─────────────┘    │   customer_key (FK)  │    └──────────────────┘
                   │   quantity, price    │
                   └──────────────────────┘

Each row in fact table = one event/transaction
Queries join fact → dimension tables

Column Store vs Row Store

Row Store (PostgreSQL):
Row 1: [user_id=1, name="Alice", email="a@b.com", age=30, ...]
Row 2: [user_id=2, name="Bob",   email="b@b.com", age=25, ...]
─ Good: load one row fast
─ Bad: must load all columns to get one

Column Store (Snowflake):
Col user_id: [1, 2, 3, 4, ...]
Col name:    ["Alice", "Bob", "Charlie", ...]
Col age:     [30, 25, 35, ...]
─ Good: read only columns needed, compress same-type data
─ Bad: write one row = write to every column file

Column Compression

Original column: [active, active, active, inactive, active, inactive, inactive]

Bitmap encoding:
  active:   [1, 1, 1, 0, 1, 0, 0]
  inactive: [0, 0, 0, 1, 0, 1, 1]

Run-length encoding:
  active×3, inactive×1, active×1, inactive×2

Query "WHERE status = active AND region = west":
  → Bitwise AND on compressed bitmaps (very fast)

Bloom Filter Mental Model

"Is key K in SSTable S?"

Bloom filter says:            Meaning:
─────────────────────────     ──────────────────────────────────────
DEFINITELY NOT in S     →     Skip SSTable entirely (big win!)
MAYBE in S              →     Check SSTable (might be a false positive)

No false negatives: if key exists, bloom filter always says "MAYBE"
False positives possible: bloom filter may say "MAYBE" for missing keys
Configurable FP rate: ~1% typical (trade space for fewer disk reads)

Key Trade-offs

DecisionProConWhen to Use
LSM-TreeHigh write throughputCompaction I/O, slower readsWrite-heavy workloads
B-TreeFast reads, simpleLower write throughputRead-heavy, mixed OLTP
In-memory DBFastest possibleLimited by RAM sizeHot datasets, caching
Column storeFast analyticsSlow individual row writesOLAP, BI
Row storeFast row accessSlow full-column scansOLTP, transactional
Data cubeInstant aggregationsPre-computed, inflexibleKnown aggregate queries

Red Flags

❌ Running analytics directly on OLTP production database (kills performance)
❌ No compaction tuning on LSM-tree under heavy write load
❌ Forgetting Bloom filters → LSM-tree reads check every level
❌ Using B-tree for append-only log storage (write amplification)
❌ Missing indexes on frequently queried columns

Green Flags

✅ Use write-ahead log for crash recovery
✅ Bloom filters on all LSM-tree SSTables
✅ Separate OLTP database from data warehouse with ETL
✅ Column store for any analytical workload > 1M rows
✅ Measure actual I/O patterns before choosing storage engine

Modern Additions (2026)

NVMe SSD Impact:
├─ Random I/O ~10x faster (B-tree penalty reduced)
├─ LSM-tree sequential write advantage less dominant
└─ WiscKey: separate keys from values to reduce compaction overhead

Lakehouse (Delta Lake, Iceberg, Hudi):
├─ ACID transactions over Parquet on S3
├─ Time travel queries (version history)
└─ Merge OLAP performance with data lake flexibility

HTAP (TiDB, SingleStore):
├─ Row store (OLTP) + column store (OLAP) in one DB
├─ Real-time analytics without ETL
└─ Raft-based replication for consistency

Interview Response Templates

When Asked About Storage Engine Choice

“I’d start by understanding the access pattern. Write-heavy workloads benefit from LSM-trees (like RocksDB or Cassandra) because sequential writes are much faster than random writes. Read-heavy or mixed transactional workloads often do better with B-trees (PostgreSQL/InnoDB) for their predictable read performance. For analytics, I’d separate the concern entirely and use a columnar warehouse.”

When Asked About Slow Read Performance in LSM-tree

“LSM-tree reads can be slow because they must check multiple SSTable levels. The key mitigations are: first, Bloom filters to skip SSTables that don’t contain the key; second, compaction to reduce the number of levels; third, a good memtable that absorbs recent writes. If reads are consistently slow, it often means compaction is falling behind writes.”


Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-13