Chapter 4 Cheat Sheet — Storage and Retrieval

One-Line Summaries

Concept	One-Liner
LSM-Tree	Append-only memtable → SSTables → compaction; optimized for writes
B-Tree	In-place updates on sorted pages; optimized for reads, dominant in OLTP
SSTable	Sorted String Table: sorted key-value log segment; mergeable, compressible
Memtable	In-memory sorted tree (red-black/AVL); flushed to SSTable when full
Compaction	Background process merging SSTables, discarding old versions
Bloom filter	Probabilistic filter: “definitely not here” vs “maybe here”; avoids disk reads
Write amplification	Multiple physical writes per one logical write (compaction cost)
Clustered index	Row data stored inside the index; primary key lookup = one I/O
Covering index	Index includes extra columns; query served without heap access
Column-oriented	Store each column separately; fast for analytical scans, compresses well
Vectorized execution	Process batches of values (1024 rows) per SIMD instruction
Query compilation	Translate query plan → machine code via LLVM; eliminates interpretation overhead
Materialized view	Pre-computed query result stored physically; instant access
OLAP cube	Pre-aggregated results across all dimension combinations
Vector embedding	High-dimensional float vector encoding semantic meaning; supports similarity search
HNSW	Hierarchical Navigable Small World; O(log N) approximate nearest-neighbor graph
RAG	Retrieval-Augmented Generation; embed query → find similar docs → feed to LLM
Lakehouse	ACID transactions + schema evolution on open Parquet files in object storage

LSM-Tree Write and Read Path

WRITE PATH
──────────
Client write
     │
     ▼
┌──────────────┐
│  Memtable    │  ← in-memory sorted tree (red-black/AVL)
│  (in RAM)    │  ← also written to WAL on disk for crash recovery
└──────┬───────┘
       │ flush when full (~4MB)
       ▼
┌──────────────┐
│  SSTable L0  │  ← immutable, sorted, on disk
│  SSTable L0  │
└──────┬───────┘
       │ background compaction
       ▼
┌──────────────┐
│  SSTable L1  │  ← larger, merged, deduplicated
│  SSTable L2  │
│  ...         │
└──────────────┘

READ PATH
─────────
1. Check memtable (most recent writes)
2. Check Bloom filter for each SSTable level (skip if "definitely not here")
3. Check L0 SSTables (most recent on disk)
4. Check L1, L2, ... until found or exhausted

B-Tree Structure

                    [30 | 70]                ← root (branch page)
                   /    |    \
            [10|20]  [40|60]  [80|90]        ← branch pages
           /  |  \    ...       ...
         ...  ...  [21|25|28]               ← leaf pages (contain actual values)

Key properties:
• Each page: 4KB-64KB (depends on storage)
• Branching factor: ~500 child pointers per page
• 4 levels × 500 branching = 62.5 trillion keys addressable
• Update: find leaf → modify in place → write WAL → if split needed, update parent

LSM-Tree vs B-Tree Quick Reference

Property	LSM-Tree	B-Tree	When It Matters
Write throughput	Higher (sequential)	Lower (random I/O)	Write-heavy workloads
Write amplification	Higher (compaction)	Lower (1 write/change)	SSD wear, cost
Read latency	Higher (multi-level)	Lower (O(log N))	OLTP read performance
Space amplification	Higher (compaction in-flight)	Lower	Storage cost
Range queries	Good (sorted)	Excellent (linked leaves)	Scan-heavy queries
Crash recovery	Simple (replay WAL)	Complex	Ops simplicity
Concurrency	Simple (append-only)	Complex (latches)	Multi-threaded writes
Compression	Better (sorted data)	Worse	Storage cost
SSD fit	Excellent	Good	Modern hardware
Best use	Cassandra, RocksDB	MySQL, PostgreSQL	Match to workload

Amplification Factors (The Three Amps)

Write Amplification: 1 logical write → N physical writes
  LSM-Tree: compaction rewrites data multiple times across levels
  B-Tree: typically 1-2x (WAL + in-place page write)
  High write amp → SSD wear, more I/O bandwidth consumed

Read Amplification: 1 logical read → N physical reads
  LSM-Tree: check memtable + Bloom filter + N SSTable levels
  B-Tree: traverse log(N) pages in one path
  High read amp → slower reads, more I/O

Space Amplification: actual disk used / logical data size
  LSM-Tree: up to 10x during compaction (old + new copies exist simultaneously)
  B-Tree: ~2x (fragmentation, free pages)
  High space amp → more storage cost

Column-Oriented Storage: The Key Insight

Row-oriented (OLTP layout):
[id | name | region | revenue | category | date | status | ...]  ← 100 columns
[id | name | region | revenue | category | date | status | ...]
...

Query: SELECT SUM(revenue) WHERE region = 'EU'
Must read ALL 100 columns to get just 2

Column-oriented (OLAP layout):
revenue_file: [125.0, 890.0, 45.0, 1200.0, ...]   ← only this file read
region_file:  ['US',  'EU',  'EU', 'US',   ...]   ← and this one

Query reads 2% of data. Before compression.

After compression (dictionary + RLE on region):
region_file compressed: { 'EU': [bits 0,1,0,0,1,1,...] }
Operate directly on compressed bitmaps — no decompression needed!

Compression Techniques in Column Stores

Dictionary Encoding:
Before: ['pending','shipped','pending','delivered','shipped']
After:  {0:'pending', 1:'shipped', 2:'delivered'}, [0,1,0,2,1]
Savings: ~5x for string columns with low cardinality

Run-Length Encoding (RLE) on sorted data:
Before: ['EU','EU','EU','EU','US','US','US']
After:  [('EU',4), ('US',3)]
Savings: extreme for sorted low-cardinality columns

Delta Encoding (for timestamps):
Before: [1704067200, 1704067260, 1704067320, 1704067380]
After:  [1704067200, +60, +60, +60]
Savings: ~4x compression on dense time series

OLTP vs OLAP Quick Comparison

Dimension	OLTP	OLAP
Users	End-users (application)	Analysts, data scientists
Query type	Short, many transactions	Long scans, aggregations
Dataset	GB, normalized (3NF)	TB-PB, star/snowflake schema
Write pattern	Individual low-latency writes	Bulk ETL/ELT loads
Index type	B-Tree (random access)	Column + bitmap + sort key
Examples	PostgreSQL, MySQL, DynamoDB	Snowflake, BigQuery, Redshift

Cloud Data Warehouse Decision Tree

Need serverless (no cluster management)?
  └─ Yes → Google BigQuery (pay-per-query)

Need best separation of compute/storage + zero-copy cloning?
  └─ Yes → Snowflake

Already heavily invested in AWS?
  └─ Yes → Amazon Redshift (RA3 nodes)

Need unified batch + streaming + ML in one platform?
  └─ Yes → Databricks Lakehouse (Delta Lake)

Need embedded analytics (no server, runs in-process)?
  └─ Yes → DuckDB

Need extreme real-time insert throughput for analytics?
  └─ Yes → ClickHouse

Need open table format (no vendor lock-in)?
  └─ Yes → Iceberg + Trino/Athena/Spark

Vectorized Execution vs Volcano Model

Volcano model (tuple-at-a-time):
for each row in table:
  if filter(row):           ← function call per row
    result = project(row)   ← function call per row
    aggregate(result)       ← function call per row
→ 3 × 1,000,000,000 function calls for 1B rows

Vectorized model (batch-at-a-time):
for each batch of 1024 rows:
  filtered_batch = filter(batch)     ← 1 call, SIMD processes 8-32 values
  projected_batch = project(filtered_batch)
  aggregate(projected_batch)
→ 3 × (1,000,000,000 / 1024) = 2.9M function calls
→ Each call also 8-32x faster (SIMD)
→ Net: ~100-300x fewer instructions

Used by: DuckDB, Apache Arrow DataFusion, ClickHouse, Velox (Meta)

Vector Embeddings: Semantic Search

Traditional keyword search:
query: "database performance"
match: documents containing exact words "database" AND "performance"

Semantic vector search:
query → embed → [0.12, -0.45, 0.89, ..., 0.03]  (1536 dims)
compare cosine similarity to all document embeddings
match: "storage engine optimization" (semantically close, no keyword match)
       "DB query speed tuning" (semantically close)

RAG Pattern:
User question → embed → find top-5 similar chunks in vector DB
              → inject chunks into LLM prompt
              → LLM generates answer grounded in retrieved context

ANN Algorithm Comparison

Algorithm	Type	Query Speed	Recall	Memory	Best For
HNSW	Graph	O(log N)	95-99%	High	General purpose, low latency
IVF-Flat	Clustering	O(K×D)	90-98%	Medium	Large-scale batch queries
PQ (Product Quantization)	Compression	Fast	85-95%	Low	Billion-scale, memory-constrained
IVF-PQ	Cluster+compress	Very fast	85-92%	Very low	Billion-scale, fast
ScaNN	Asymmetric hashing	Very fast	95-99%	Medium	Google-scale production
Flat (exact)	Brute force	O(N×D)	100%	None	Small datasets (<100K)

Index Type Quick Reference

Index Type	Structure	Best For	Not For
B-Tree	Sorted tree pages	Equality + range queries	Full-text, vector
Hash	Hash map	Exact equality only	Range queries
Bitmap	Bit arrays	Low-cardinality columns, OLAP	High-cardinality, OLTP
Inverted	Term → doc list	Full-text search	Numeric, vector
R-Tree	Spatial tree	2D/3D geospatial	Non-spatial
HNSW	Proximity graph	Vector similarity (ANN)	Exact search
GIN/GiST	Generalized	Arrays, full-text, JSON	Simple equality

Storage Engine Taxonomy

OLTP Storage Engines:
├─ LSM-Tree: RocksDB, Cassandra, LevelDB, HBase, ScyllaDB
└─ B-Tree: MySQL InnoDB, PostgreSQL, SQLite, Oracle

OLAP Storage Engines:
├─ Cloud DW: Snowflake, BigQuery, Redshift, Databricks
├─ Embedded: DuckDB (columnar, in-process)
├─ Self-hosted: ClickHouse, Apache Druid, Pinot
└─ Open formats: Parquet + Iceberg/Delta on S3

Specialized:
├─ In-memory: Redis, Memcached, VoltDB
├─ Vector: pgvector, Pinecone, Weaviate, Qdrant, Milvus
├─ Full-text: Elasticsearch, Typesense, Meilisearch
└─ Time-series: InfluxDB, TimescaleDB, QuestDB

Modern Additions (2026)

Lakehouse Architecture:
├─ Delta Lake (Databricks) + Apache Iceberg (Netflix/Apple)
├─ ACID transactions on Parquet files in S3/GCS
├─ Time travel, schema evolution, concurrent read/write
└─ Multiple engines (Spark, Trino, Flink, Athena) read same data

DuckDB Revolution:
├─ Columnar analytics without a server (embedded)
├─ Reads Parquet, Arrow, CSV, JSON natively
├─ 10-100x faster than pandas for medium-scale analytics
└─ Powers MotherDuck, Observable, many analytics tools

Vector Search Mainstream:
├─ pgvector + HNSW available in all managed PostgreSQL services
├─ Pinecone, Weaviate, Qdrant for dedicated vector workloads
├─ Hybrid search (vector + BM25) standard for RAG
└─ Every cloud provider now has native vector DB support

RocksDB as Universal Substrate:
├─ Underlying engine for TiKV, TiDB, CockroachDB, Kafka storage
└─ Purpose-tuned for SSD write-heavy workloads

Quick Revision Time: 8 minutes
Interview Prep: 25 minutes
Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch04-cheatsheet

Chapter 4 Cheat Sheet — Storage and Retrieval

One-Line Summaries

LSM-Tree Write and Read Path

B-Tree Structure

LSM-Tree vs B-Tree Quick Reference

Amplification Factors (The Three Amps)

Column-Oriented Storage: The Key Insight

Compression Techniques in Column Stores

OLTP vs OLAP Quick Comparison

Cloud Data Warehouse Decision Tree

Vectorized Execution vs Volcano Model

Vector Embeddings: Semantic Search

ANN Algorithm Comparison

Index Type Quick Reference

Storage Engine Taxonomy

Modern Additions (2026)

Graph View

Table of Contents