Chapter 4 Cheat Sheet — Storage and Retrieval

ddia-2e storage retrieval cheatsheet


One-Line Summaries

ConceptOne-Liner
LSM-TreeAppend-only memtable → SSTables → compaction; optimized for writes
B-TreeIn-place updates on sorted pages; optimized for reads, dominant in OLTP
SSTableSorted String Table: sorted key-value log segment; mergeable, compressible
MemtableIn-memory sorted tree (red-black/AVL); flushed to SSTable when full
CompactionBackground process merging SSTables, discarding old versions
Bloom filterProbabilistic filter: “definitely not here” vs “maybe here”; avoids disk reads
Write amplificationMultiple physical writes per one logical write (compaction cost)
Clustered indexRow data stored inside the index; primary key lookup = one I/O
Covering indexIndex includes extra columns; query served without heap access
Column-orientedStore each column separately; fast for analytical scans, compresses well
Vectorized executionProcess batches of values (1024 rows) per SIMD instruction
Query compilationTranslate query plan → machine code via LLVM; eliminates interpretation overhead
Materialized viewPre-computed query result stored physically; instant access
OLAP cubePre-aggregated results across all dimension combinations
Vector embeddingHigh-dimensional float vector encoding semantic meaning; supports similarity search
HNSWHierarchical Navigable Small World; O(log N) approximate nearest-neighbor graph
RAGRetrieval-Augmented Generation; embed query → find similar docs → feed to LLM
LakehouseACID transactions + schema evolution on open Parquet files in object storage

LSM-Tree Write and Read Path

WRITE PATH
──────────
Client write
     │
     ▼
┌──────────────┐
│  Memtable    │  ← in-memory sorted tree (red-black/AVL)
│  (in RAM)    │  ← also written to WAL on disk for crash recovery
└──────┬───────┘
       │ flush when full (~4MB)
       ▼
┌──────────────┐
│  SSTable L0  │  ← immutable, sorted, on disk
│  SSTable L0  │
└──────┬───────┘
       │ background compaction
       ▼
┌──────────────┐
│  SSTable L1  │  ← larger, merged, deduplicated
│  SSTable L2  │
│  ...         │
└──────────────┘

READ PATH
─────────
1. Check memtable (most recent writes)
2. Check Bloom filter for each SSTable level (skip if "definitely not here")
3. Check L0 SSTables (most recent on disk)
4. Check L1, L2, ... until found or exhausted

B-Tree Structure

                    [30 | 70]                ← root (branch page)
                   /    |    \
            [10|20]  [40|60]  [80|90]        ← branch pages
           /  |  \    ...       ...
         ...  ...  [21|25|28]               ← leaf pages (contain actual values)

Key properties:
• Each page: 4KB-64KB (depends on storage)
• Branching factor: ~500 child pointers per page
• 4 levels × 500 branching = 62.5 trillion keys addressable
• Update: find leaf → modify in place → write WAL → if split needed, update parent

LSM-Tree vs B-Tree Quick Reference

PropertyLSM-TreeB-TreeWhen It Matters
Write throughputHigher (sequential)Lower (random I/O)Write-heavy workloads
Write amplificationHigher (compaction)Lower (1 write/change)SSD wear, cost
Read latencyHigher (multi-level)Lower (O(log N))OLTP read performance
Space amplificationHigher (compaction in-flight)LowerStorage cost
Range queriesGood (sorted)Excellent (linked leaves)Scan-heavy queries
Crash recoverySimple (replay WAL)ComplexOps simplicity
ConcurrencySimple (append-only)Complex (latches)Multi-threaded writes
CompressionBetter (sorted data)WorseStorage cost
SSD fitExcellentGoodModern hardware
Best useCassandra, RocksDBMySQL, PostgreSQLMatch to workload

Amplification Factors (The Three Amps)

Write Amplification: 1 logical write → N physical writes
  LSM-Tree: compaction rewrites data multiple times across levels
  B-Tree: typically 1-2x (WAL + in-place page write)
  High write amp → SSD wear, more I/O bandwidth consumed

Read Amplification: 1 logical read → N physical reads
  LSM-Tree: check memtable + Bloom filter + N SSTable levels
  B-Tree: traverse log(N) pages in one path
  High read amp → slower reads, more I/O

Space Amplification: actual disk used / logical data size
  LSM-Tree: up to 10x during compaction (old + new copies exist simultaneously)
  B-Tree: ~2x (fragmentation, free pages)
  High space amp → more storage cost

Column-Oriented Storage: The Key Insight

Row-oriented (OLTP layout):
[id | name | region | revenue | category | date | status | ...]  ← 100 columns
[id | name | region | revenue | category | date | status | ...]
...

Query: SELECT SUM(revenue) WHERE region = 'EU'
Must read ALL 100 columns to get just 2

Column-oriented (OLAP layout):
revenue_file: [125.0, 890.0, 45.0, 1200.0, ...]   ← only this file read
region_file:  ['US',  'EU',  'EU', 'US',   ...]   ← and this one

Query reads 2% of data. Before compression.

After compression (dictionary + RLE on region):
region_file compressed: { 'EU': [bits 0,1,0,0,1,1,...] }
Operate directly on compressed bitmaps — no decompression needed!

Compression Techniques in Column Stores

Dictionary Encoding:
Before: ['pending','shipped','pending','delivered','shipped']
After:  {0:'pending', 1:'shipped', 2:'delivered'}, [0,1,0,2,1]
Savings: ~5x for string columns with low cardinality

Run-Length Encoding (RLE) on sorted data:
Before: ['EU','EU','EU','EU','US','US','US']
After:  [('EU',4), ('US',3)]
Savings: extreme for sorted low-cardinality columns

Delta Encoding (for timestamps):
Before: [1704067200, 1704067260, 1704067320, 1704067380]
After:  [1704067200, +60, +60, +60]
Savings: ~4x compression on dense time series

OLTP vs OLAP Quick Comparison

DimensionOLTPOLAP
UsersEnd-users (application)Analysts, data scientists
Query typeShort, many transactionsLong scans, aggregations
DatasetGB, normalized (3NF)TB-PB, star/snowflake schema
Write patternIndividual low-latency writesBulk ETL/ELT loads
Index typeB-Tree (random access)Column + bitmap + sort key
ExamplesPostgreSQL, MySQL, DynamoDBSnowflake, BigQuery, Redshift

Cloud Data Warehouse Decision Tree

Need serverless (no cluster management)?
  └─ Yes → Google BigQuery (pay-per-query)

Need best separation of compute/storage + zero-copy cloning?
  └─ Yes → Snowflake

Already heavily invested in AWS?
  └─ Yes → Amazon Redshift (RA3 nodes)

Need unified batch + streaming + ML in one platform?
  └─ Yes → Databricks Lakehouse (Delta Lake)

Need embedded analytics (no server, runs in-process)?
  └─ Yes → DuckDB

Need extreme real-time insert throughput for analytics?
  └─ Yes → ClickHouse

Need open table format (no vendor lock-in)?
  └─ Yes → Iceberg + Trino/Athena/Spark

Vectorized Execution vs Volcano Model

Volcano model (tuple-at-a-time):
for each row in table:
  if filter(row):           ← function call per row
    result = project(row)   ← function call per row
    aggregate(result)       ← function call per row
→ 3 × 1,000,000,000 function calls for 1B rows

Vectorized model (batch-at-a-time):
for each batch of 1024 rows:
  filtered_batch = filter(batch)     ← 1 call, SIMD processes 8-32 values
  projected_batch = project(filtered_batch)
  aggregate(projected_batch)
→ 3 × (1,000,000,000 / 1024) = 2.9M function calls
→ Each call also 8-32x faster (SIMD)
→ Net: ~100-300x fewer instructions

Used by: DuckDB, Apache Arrow DataFusion, ClickHouse, Velox (Meta)

Traditional keyword search:
query: "database performance"
match: documents containing exact words "database" AND "performance"

Semantic vector search:
query → embed → [0.12, -0.45, 0.89, ..., 0.03]  (1536 dims)
compare cosine similarity to all document embeddings
match: "storage engine optimization" (semantically close, no keyword match)
       "DB query speed tuning" (semantically close)

RAG Pattern:
User question → embed → find top-5 similar chunks in vector DB
              → inject chunks into LLM prompt
              → LLM generates answer grounded in retrieved context

ANN Algorithm Comparison

AlgorithmTypeQuery SpeedRecallMemoryBest For
HNSWGraphO(log N)95-99%HighGeneral purpose, low latency
IVF-FlatClusteringO(K×D)90-98%MediumLarge-scale batch queries
PQ (Product Quantization)CompressionFast85-95%LowBillion-scale, memory-constrained
IVF-PQCluster+compressVery fast85-92%Very lowBillion-scale, fast
ScaNNAsymmetric hashingVery fast95-99%MediumGoogle-scale production
Flat (exact)Brute forceO(N×D)100%NoneSmall datasets (<100K)

Index Type Quick Reference

Index TypeStructureBest ForNot For
B-TreeSorted tree pagesEquality + range queriesFull-text, vector
HashHash mapExact equality onlyRange queries
BitmapBit arraysLow-cardinality columns, OLAPHigh-cardinality, OLTP
InvertedTerm → doc listFull-text searchNumeric, vector
R-TreeSpatial tree2D/3D geospatialNon-spatial
HNSWProximity graphVector similarity (ANN)Exact search
GIN/GiSTGeneralizedArrays, full-text, JSONSimple equality

Storage Engine Taxonomy

OLTP Storage Engines:
├─ LSM-Tree: RocksDB, Cassandra, LevelDB, HBase, ScyllaDB
└─ B-Tree: MySQL InnoDB, PostgreSQL, SQLite, Oracle

OLAP Storage Engines:
├─ Cloud DW: Snowflake, BigQuery, Redshift, Databricks
├─ Embedded: DuckDB (columnar, in-process)
├─ Self-hosted: ClickHouse, Apache Druid, Pinot
└─ Open formats: Parquet + Iceberg/Delta on S3

Specialized:
├─ In-memory: Redis, Memcached, VoltDB
├─ Vector: pgvector, Pinecone, Weaviate, Qdrant, Milvus
├─ Full-text: Elasticsearch, Typesense, Meilisearch
└─ Time-series: InfluxDB, TimescaleDB, QuestDB

Modern Additions (2026)

Lakehouse Architecture:
├─ Delta Lake (Databricks) + Apache Iceberg (Netflix/Apple)
├─ ACID transactions on Parquet files in S3/GCS
├─ Time travel, schema evolution, concurrent read/write
└─ Multiple engines (Spark, Trino, Flink, Athena) read same data

DuckDB Revolution:
├─ Columnar analytics without a server (embedded)
├─ Reads Parquet, Arrow, CSV, JSON natively
├─ 10-100x faster than pandas for medium-scale analytics
└─ Powers MotherDuck, Observable, many analytics tools

Vector Search Mainstream:
├─ pgvector + HNSW available in all managed PostgreSQL services
├─ Pinecone, Weaviate, Qdrant for dedicated vector workloads
├─ Hybrid search (vector + BM25) standard for RAG
└─ Every cloud provider now has native vector DB support

RocksDB as Universal Substrate:
├─ Underlying engine for TiKV, TiDB, CockroachDB, Kafka storage
└─ Purpose-tuned for SSD write-heavy workloads

Quick Revision Time: 8 minutes
Interview Prep: 25 minutes
Last Updated: 2026-05-29