Chapter 4: Storage and Retrieval

ddia-2e storage retrieval lsm-tree b-tree column-store vector-embeddings oltp olap

Status: Notes complete

Overview

Databases perform two fundamental tasks: storing data when you write it, and retrieving it efficiently when you need it. This chapter explains how storage engines actually work — not just what API they expose, but the data structures and algorithms underneath. The key insight is that there is no single best storage engine: log-structured engines (LSM-Trees) optimize for write throughput, B-Trees optimize for read performance, column-oriented stores optimize for analytical scans, and vector indexes optimize for similarity search. The 2nd edition adds significant new material: cloud data warehouses (Snowflake, BigQuery), query compilation and vectorization (LLVM, Apache Arrow), and vector embeddings for AI/ML workloads — all of which reflect the massive shift in how data is stored and queried in 2026.

Key Concepts

Log-Structured Storage (LSM-Trees, SSTables, Compaction)

The simplest possible database: an append-only log.

# Primitive key-value store
db_set() { echo "$1,$2" >> database.log; }
db_get() { grep "^$1," database.log | tail -n 1 | cut -d',' -f2; }

db_set is O(1): just append to the end of the file
db_get is O(n): must scan the entire file

To make reads efficient, add an index — a data structure that maintains additional metadata to help you find data quickly. Every index trades write overhead for read speed.

Hash indexes: The simplest index is an in-memory hash map from key to byte offset in the log file. Bitcask (the Riak storage engine) uses exactly this.

Reads: O(1) hash lookup → seek to offset → read value
Writes: O(1) append + update hash map
Limitation: All keys must fit in RAM; range queries not supported

The compaction problem: An append-only log grows forever. Solution: compaction — merge log segments, throwing away old values and keeping only the latest.

SSTables and LSM-Trees

SSTable (Sorted String Table): A log segment where keys are sorted. Benefits over unsorted hash logs:

Merging segments is efficient (like merge sort — similar to merge in mergesort)
No longer need to keep all keys in memory — a sparse index is sufficient (find the range, scan within it)
Range queries are efficient (keys are sorted)

LSM-Tree (Log-Structured Merge-Tree): The full write path using SSTables:

Write path:
1. Write to in-memory sorted tree (memtable — usually a red-black tree or AVL tree)
2. When memtable exceeds threshold (~4MB), flush to SSTable on disk (Level 0)
3. Background compaction merges Level 0 SSTables into Level 1, Level 1 into Level 2, etc.

Read path:
1. Check memtable first
2. Check Level 0 SSTables (most recent)
3. Check Level 1, Level 2, ... (progressively older)
4. Use Bloom filter to skip SSTables that definitely don't contain the key

Compaction strategies:

Size-tiered (Cassandra default): When several small SSTables accumulate, merge them into one larger one. Simple but can cause space amplification.
Leveled (RocksDB, LevelDB default): Each level has a size limit. When a level overflows, some SSTables are merged into the next level. Better space and read amplification, higher write amplification.

Bloom filters: A probabilistic data structure that answers “definitely not in this SSTable” vs “maybe in this SSTable” — avoids reading SSTables that don’t contain the key.

Systems using LSM-Trees: LevelDB, RocksDB, Cassandra, HBase, ScyllaDB, Bigtable (Google). Also TiKV (underlying store for TiDB).

B-Trees

B-Tree: The dominant index structure for relational databases, introduced by Bayer and McCreight in 1972. Unlike LSM-Trees, B-Trees are update-in-place — they modify pages on disk directly.

Structure:

Data organized in pages (traditionally 4KB, modern SSDs use 16KB-64KB pages)
Each page either contains a range of keys with child pointers (branch page) or contains actual values (leaf page)
The branching factor — number of child pointers per page — is typically several hundred
A 4-level tree with branching factor 500 can address 500^4 = 62.5 trillion key-value pairs

Write path (in-place update):

Find the leaf page containing the key
Modify the value in place (overwrite that page on disk)
If the page is too full, split it into two pages and update the parent

Write-ahead log (WAL): Before modifying a page, write the change to an append-only WAL on disk. If a crash occurs mid-write, the WAL allows recovery to a consistent state.

Concurrency: Latches (lightweight locks) protect pages during writes. LSM-Trees have a simpler concurrency model because they are append-only.

Comparing B-Trees and LSM-Trees

The three key amplification metrics:

Write amplification: How many times is one logical write amplified into physical disk writes?
Read amplification: How many disk reads are needed for one logical read?
Space amplification: How much disk space is used relative to the actual data size?

Dimension	LSM-Tree	B-Tree	Notes
Write throughput	Higher	Lower	LSM appends sequentially; B-Tree random writes
Write amplification	Medium (compaction)	Lower (1 write per change)	But compaction happens in background
Read performance	Slower (multi-level)	Faster (predictable O(log n))	Bloom filters help LSM
Read amplification	Higher (check multiple levels)	Lower (single tree traversal)	LSM must check multiple SSTables
Space amplification	Higher (multiple copies during compaction)	Lower	LSM can have 10x data size during merge
Range queries	Good (SSTables are sorted)	Excellent (B-Tree leaf pages linked)	Both support range queries
Crash recovery	Simple (replay memtable WAL)	Complex (WAL + tree repair)	LSM easier to recover
Compaction impact	Can spike read/write latency	None (no background compaction)	LSM needs careful tuning
Concurrency	Simpler (append-only)	Complex (page-level latches)	LSM easier to implement correctly
Best for	Write-heavy workloads	Read-heavy, OLTP	Match to your access pattern

Key insight from 2nd edition: SSDs have changed the relative performance of B-Trees and LSM-Trees. SSDs handle random writes much better than spinning disks, which reduces one of LSM-Trees’ primary advantages. However, LSM-Trees still win on write throughput for write-heavy workloads.

Multicolumn and Secondary Indexes

Primary index (clustered index): The table is sorted by the primary key. Lookups by primary key are fast.

Secondary indexes: Indexes on non-primary-key columns. Can be sparse (key → row location) or dense (key → value inline).

Multicolumn indexes (composite indexes): Index on multiple columns together.

(last_name, first_name): efficient for WHERE last_name = 'Smith' or WHERE last_name = 'Smith' AND first_name = 'John'; NOT efficient for WHERE first_name = 'John' alone
Leftmost prefix rule: A composite index on (A, B, C) can satisfy queries on A; A+B; or A+B+C — but not B alone or C alone

Geospatial multicolumn indexes: (latitude, longitude) searches don’t fit B-Tree model well. Solutions:

R-Trees: Space-partitioning tree (used by PostGIS)
Space-filling curves (Hilbert curves): Map 2D coordinates to 1D key that preserves locality
Bounding box filters: Pre-filter with a B-Tree on lat/lon ranges, then refine

Full-text indexes: Inverted indexes map each word to the list of documents containing it (see Full-Text Search section).

Storing Values Within the Index

Two approaches for secondary indexes:

Heap file approach: Index stores the row location (file offset or primary key). To read the full row, follow the pointer to the heap file. Avoids duplicating data; but requires two lookups per read.
Clustered index: Row data stored directly within the index. In MySQL InnoDB, the primary key is always a clustered index. Very fast for reads that access most columns.
Covering index (index with included columns): A secondary index that includes additional columns beyond the index key, so the query can be satisfied without touching the heap file (“index-only scan”).

-- Index covering a common query pattern
CREATE INDEX idx_orders_customer_status ON orders (customer_id, status)
INCLUDE (created_at, total_amount);
-- Query served entirely from index (no heap access needed)
SELECT created_at, total_amount FROM orders WHERE customer_id = 42 AND status = 'pending';

Keeping Everything in Memory

In-memory databases keep all data in RAM and write to disk only for durability:

Redis: Key-value store with rich data structures (lists, sets, sorted sets, hashes). Persistence via snapshots (RDB) or append-only file (AOF)
Memcached: Pure cache, no persistence
VoltDB, MemSQL (SingleStore): In-memory relational databases with full SQL
Druid: In-memory analytics on time-series data

Why in-memory is faster (counterintuitive insight): Not just because avoiding disk — in-memory data structures can be more flexible. Hash tables and linked lists are impractical on disk but trivial in memory. An in-memory engine can use richer data structures than what a disk-based engine can afford.

Durability approaches:

Write-ahead log to disk + periodic snapshots
Disk as backup: Load from disk on startup, operate entirely in memory
Replication: Use replicas as durability (if all replicas fail, data is lost)

Data Storage for Analytics

Cloud Data Warehouses

OLTP (Online Transaction Processing) vs OLAP (Online Analytical Processing):

Dimension	OLTP	OLAP
Primary users	End-users via application	Analysts, data scientists
Access pattern	Small reads/writes per query	Large scans, aggregations
Dataset size	GB	TB-PB
Query pattern	Random access by key	Sequential scan, GROUP BY, SUM
Write pattern	Low-latency individual writes	Bulk import (ETL/ELT)
Schema	Highly normalized (3NF)	Denormalized star/snowflake
Examples	PostgreSQL, MySQL, DynamoDB	Snowflake, BigQuery, Redshift

Data warehouse: A separate database optimized for analytical queries. Data is loaded from OLTP systems via ETL (Extract-Transform-Load) or ELT pipelines.

Major cloud data warehouses (2026):

System	Provider	Key Differentiator
Snowflake	Multi-cloud	Separation of compute/storage; virtual warehouses; zero-copy cloning
BigQuery	Google Cloud	Serverless (no cluster management); pay-per-query; native ML features
Amazon Redshift	AWS	Columnar storage; tight AWS ecosystem integration; RA3 nodes (managed storage)
Databricks Lakehouse	Databricks	Delta Lake (open format); unified batch + streaming; ML-native
Azure Synapse	Microsoft	Integration with Power BI; serverless SQL pools; Spark integration
DuckDB	Open source	Embedded analytics (runs in-process); columnar; no server required
ClickHouse	Open source	Extreme insert throughput; real-time analytics; very fast GROUP BY

Snowflake architecture (influential model): Storage (S3) is decoupled from compute (virtual warehouses). Multiple warehouses can read the same data concurrently. Queries run on ephemeral compute nodes that read cold data from S3 and cache hot data locally. This separation enables:

Independent scaling of storage and compute
Multiple teams use the same data without contention
Pay only for compute time used

The Lakehouse pattern: Combining data lake (cheap object storage: S3, GCS) with data warehouse query capabilities. Delta Lake (Databricks), Apache Iceberg (Netflix, Apple), and Apache Hudi provide ACID transactions, schema evolution, and time travel on top of Parquet files in object storage.

Column-Oriented Storage

Row-oriented storage: All values for one row are stored together. Reading one column requires loading all rows.

Column-oriented storage: All values for one column are stored together. Reading one column touches only that column’s data.

Row-oriented (one row per page):
page 1: [row1_col1, row1_col2, ..., row1_col100]
page 2: [row2_col1, row2_col2, ..., row2_col100]

Column-oriented (one column per file):
col1_file: [row1_val, row2_val, row3_val, ..., rowN_val]
col2_file: [row1_val, row2_val, row3_val, ..., rowN_val]

Why column storage is faster for analytics:

Read only needed columns: An analytical query on a 100-column table might touch only 3 columns. Column storage reads 3% of the data; row storage reads 100%.
Compression is dramatically more effective: A column of integers or repeated values (like a status field with 5 distinct values) compresses far better than a mixed row. Run-length encoding (RLE) on sorted columns is extremely efficient.
Vectorized processing: CPU SIMD instructions process 8-32 values in a single instruction. Dense arrays of the same type (a column) are ideal for SIMD.

Column compression:

Dictionary encoding: Map distinct values to integers ({'pending': 0, 'shipped': 1, 'delivered': 2}). Compress with bitpacking.
Run-length encoding: AAABBBBCC → A:3, B:4, C:2
Delta encoding: Store differences between consecutive values instead of absolute values (great for timestamps)
Bitmap encoding: One bit per row per distinct value — efficient for low-cardinality columns

Sort order in column stores: Data can be sorted by one or more columns. Sorted order enables:

Better RLE compression (repeated values cluster)
Efficient range predicates
“Sort key” in Snowflake/Redshift controls this

Materialized aggregates: Pre-compute common aggregate expressions (SUM, COUNT, AVG) and cache results. A data cube (OLAP cube) stores pre-aggregated results across all dimension combinations. Dramatic speedup for known query patterns but requires upfront definition and storage overhead.

Query Execution: Compilation and Vectorization

Traditional database query execution: Volcano/Iterator model (tuple-at-a-time). Each operator (scan, filter, join, aggregate) implements a next() method that returns one row. Simple to implement but many function calls per row — poor CPU cache utilization.

Vectorized execution (column-at-a-time): Operators process a batch of values (e.g., 1024 rows) at a time. Benefits:

Far fewer function calls (amortize overhead across batch)
SIMD instructions apply to the entire batch
Better CPU cache utilization
DuckDB, Apache Arrow DataFusion, ClickHouse, and modern Snowflake/BigQuery all use vectorized execution

Volcano model: for each row { filter(row) → project(row) → aggregate(row) }
Vectorized:    filter(batch) → project(batch) → aggregate(batch)
               ← 1 function call per batch, not per row →

Query compilation (code generation): Translate a query plan into optimized machine code at runtime. The generated code is specialized for the specific query — no generic operator overhead.

LLVM: Most common approach (used by Apache Spark’s Tungsten engine, Amazon Redshift, Snowflake)
Whole-stage code generation: Combine multiple operators into a single compiled function, eliminating intermediate materialization
Benefits: 10-100x faster than interpreted execution for CPU-bound queries
Cost: Compilation adds latency (10ms-1s), less beneficial for tiny queries

Apache Arrow DataFusion: Rust-based query engine using Arrow columnar format + vectorized execution. Used in InfluxDB, DuckDB-Arrow integration, and the emerging “ADBC” (Arrow Database Connectivity) standard.

Materialized Views and Data Cubes

Materialized view: A pre-computed query result stored on disk. Unlike a regular view (which is a stored query executed on demand), a materialized view is a physical copy of data.

-- Materialized view: pre-compute daily revenue per product
CREATE MATERIALIZED VIEW daily_product_revenue AS
SELECT product_id, date_trunc('day', created_at) AS day, SUM(amount) AS revenue
FROM orders
WHERE status = 'completed'
GROUP BY 1, 2;

Refresh strategies: On demand (REFRESH MATERIALIZED VIEW), scheduled, or incremental (only process new data)
Used heavily in dbt for intermediate transformation layers

OLAP cube (data cube): A multi-dimensional pre-aggregation structure. For a fact table with dimensions (product, region, date), an OLAP cube pre-computes all combinations:

Total sales by product
Total sales by region
Total sales by date
Total sales by product × region
Total sales by product × date
Total sales by region × date
Total sales by product × region × date

Benefits: Any aggregate query over pre-computed dimensions is instant.
Cost: Storage grows exponentially with number of dimensions (2^N combinations). Not feasible for high-cardinality dimensions.

Multidimensional and Full-Text Indexes

Full-Text Search

Full-text search (inverted index): Maps each word/token to the list of documents containing it.

Inverted index:
"database" → [doc1, doc3, doc7, doc12]
"storage"  → [doc1, doc4, doc7]
"retrieval" → [doc3, doc7, doc11]

Query: "database AND storage" → [doc1] ∩ [doc4] = [doc1, doc7]

Tokenization and normalization: Before indexing, text is:

Tokenized: Split into words/tokens
Normalized: Lowercase, remove punctuation
Stemmed/lemmatized: running → run, databases → database
Stop words removed: the, a, is excluded from index

Relevance ranking: Inverted indexes also store term frequencies and document frequencies for TF-IDF (Term Frequency-Inverse Document Frequency) ranking:

TF: How often the term appears in this document
IDF: Inverse of how often the term appears across all documents (rarer = more relevant)

Systems: Elasticsearch, Apache Solr (both based on Apache Lucene), PostgreSQL tsvector/tsquery, Typesense, Meilisearch

Fuzzy search: Edit-distance-based matching allows typos and misspellings. Computed via finite automata or SIMD-accelerated Levenshtein distance.

Vector Embeddings

Vector embeddings are the most important new topic in the 2nd edition — a data model that didn’t exist at production scale when the 1st edition was written.

What is a vector embedding?
A vector embedding is a list of floating-point numbers (a high-dimensional vector) that represents the semantic meaning of an object — text, image, audio, or code. Objects with similar meaning have vectors that are close together in vector space.

# Example: text embedding using OpenAI/sentence-transformers
embedding_1 = embed("The cat sat on the mat")   # → [0.12, -0.45, 0.89, ..., 0.03]  # 1536 dims
embedding_2 = embed("A feline rested on a rug") # → [0.11, -0.47, 0.91, ..., 0.02]  # 1536 dims
embedding_3 = embed("The stock market crashed")  # → [-0.8,  0.23, -0.5, ..., 0.67]  # 1536 dims
 
# cosine_similarity(embedding_1, embedding_2) ≈ 0.97  (semantically similar)
# cosine_similarity(embedding_1, embedding_3) ≈ 0.12  (semantically different)

Similarity metrics:

Cosine similarity: Angle between vectors (most common for text)
Euclidean distance: Geometric distance (for image embeddings)
Dot product: Used when vectors are normalized (fastest to compute)

Use cases:

Semantic search: Find documents semantically similar to a query, not just keyword-matching
RAG (Retrieval-Augmented Generation): Retrieve relevant context for LLM prompts
Recommendation systems: Find similar items or users based on embedding proximity
Image search: Find visually similar images
Anomaly detection: Identify vectors that are outliers from normal clusters

RAG pattern (critical for 2026 LLM applications):

User query → embed query → find K nearest vectors in vector DB
           → retrieve source documents → inject into LLM prompt
           → LLM generates answer grounded in retrieved documents

Approximate Nearest Neighbor (ANN) algorithms: Exact nearest-neighbor search in high dimensions is O(N × D) where D is dimensionality (768-4096 for modern embeddings). Too slow for large corpora. ANN algorithms trade small accuracy loss for much faster search:

HNSW (Hierarchical Navigable Small World): Graph-based; builds a multi-layer proximity graph. O(log N) queries. Default in most vector databases. Excellent recall/latency trade-off.
IVF (Inverted File Index): Cluster vectors (k-means); search only the nearest clusters. Fast but lower recall.
Product Quantization (PQ): Compress vectors to reduce memory; approximate distances. Used for very large datasets.
ScaNN (Google): Asymmetric hashing; state-of-the-art recall-latency trade-off.

Vector DB	Approach	Notes
Pinecone	Managed, cloud-native	Fully managed; easy to use; expensive
Weaviate	HNSW + multimodal	Hybrid vector+keyword; GraphQL API
Qdrant	HNSW (Rust)	Very fast; payload filtering; open source
Milvus/Zilliz	IVF + HNSW	Large-scale; GPU acceleration support
pgvector	HNSW + IVF-Flat	PostgreSQL extension; easy for existing PG users
ChromaDB	In-memory HNSW	Local development; Python-first

Hybrid search: Most production systems combine vector search (semantic similarity) with keyword search (exact/fuzzy match) and re-rank results. This is called hybrid retrieval:

Query → [vector search top-K] + [BM25 keyword search top-K]
      → merge and re-rank with cross-encoder
      → return final results

Comparison Tables

LSM-Tree vs B-Tree: Complete Comparison

Property	LSM-Tree	B-Tree	Winner
Write throughput	High (sequential appends)	Lower (random in-place)	LSM
Write amplification	Higher (compaction rewrites)	Lower (1 write per change)	B-Tree
Read performance	Slower (multi-SSTable check)	Faster (single tree path)	B-Tree
Read amplification	Higher (check all levels)	Lower (log N pages)	B-Tree
Space amplification	Higher (multiple copies)	Lower (in-place update)	B-Tree
Range queries	Good (sorted SSTables)	Excellent (linked leaves)	B-Tree (slight edge)
Crash recovery	Simple (replay memtable WAL)	Complex (WAL + tree)	LSM
Compaction overhead	Background CPU/IO impact	None	B-Tree
Concurrency control	Simpler (append-only)	Complex (page latches)	LSM
SSD friendliness	Very good (sequential writes)	Good (SSDs handle random)	Tie
Compression	Better (sorted data compresses well)	Worse (fragmented pages)	LSM
Best workload	Write-heavy (Cassandra, RocksDB)	Read-heavy OLTP (MySQL InnoDB)	Depends

Storage Engine Taxonomy

Category	Structure	Examples	Best For
Log-structured	LSM-Tree + SSTable	RocksDB, Cassandra, LevelDB	High write throughput
Update-in-place	B-Tree	MySQL InnoDB, PostgreSQL, SQLite	OLTP reads, range queries
Column-oriented	Column files + compression	Parquet, Snowflake, BigQuery	Analytical scans, aggregation
In-memory	Hash table, sorted tree	Redis, Memcached, VoltDB	Lowest latency, small dataset
Vector index	HNSW, IVF	pgvector, Pinecone, Qdrant	Similarity search, ML workloads
Full-text	Inverted index	Lucene, Elasticsearch	Text search, relevance ranking
Time-series	TSM (time-series merge tree)	InfluxDB, TimescaleDB	Sequential time-based data

Important Points Summary

Indexes trade write speed for read speed: Every index slows writes but speeds reads — choose indexes for your actual query patterns, not hypothetical ones.
LSM-Trees win on write throughput; B-Trees win on read performance: Both remain relevant depending on workload.
Column storage is transformative for analytics: Reading 3 out of 100 columns means reading 3% of the data — a 33x reduction in IO before any compression.
Compression in column stores is not just space savings: Run-length encoding enables operating on compressed data without decompressing — multiply performance gains.
Vectorized execution changes the CPU equation: Processing 1024 values per SIMD instruction vs 1 per function call is a 100-1000x difference for scan-heavy queries.
Cloud DWs separate storage and compute: Snowflake’s architecture allows independent scaling; you pay only for compute time used.
Vector embeddings introduce semantic similarity as a query primitive: This is a fundamentally new access pattern — “most similar” rather than “exact match.”
HNSW is the dominant ANN algorithm: Graph-based, O(log N) query, excellent recall — the default in most vector databases.
Hybrid search (vector + keyword) outperforms either alone: Production RAG systems combine semantic and lexical matching.
The Lakehouse pattern is winning: Delta Lake + Iceberg + open Parquet on S3 is replacing proprietary DW formats.

Modern Context (2026)

The Lakehouse has won:

Delta Lake (Databricks), Apache Iceberg (Netflix/Apple), Apache Hudi (Uber) provide ACID transactions + schema evolution + time travel on Parquet files in object storage
Most new data warehouses are lakehouses — avoid vendor lock-in while gaining DW features
Open table formats allow multiple engines (Spark, Trino, Flink, Athena) to read the same data

DuckDB: a paradigm shift for analytics:

An embedded columnar analytical database — runs in-process (no server)
Reads Parquet, Arrow, CSV, and JSON natively
Used for local data analysis, replacing pandas for medium-scale work
Powers MotherDuck (cloud DuckDB), Observable (notebooks), and many analytics tools
Demonstrates that columnar vectorized query execution doesn’t require a massive cluster

RocksDB as universal building block:

RocksDB (LSM-Tree engine by Facebook/Meta) is the underlying storage engine for: TiKV, PingCAP, CockroachDB, Cassandra (optionally), Kafka log storage, MyRocks (MySQL)
Purpose-built for SSD workloads; used for billions of records in production

Vector search is now table stakes:

Every major cloud provider now has a managed vector DB or pgvector support
Amazon Aurora, Azure Cosmos DB, Google AlloyDB all support vector search natively
pgvector (PostgreSQL) is the default choice for teams already on PostgreSQL
HNSW indexes added to pgvector 0.5.0 (2023) dramatically improved performance

Query compilation via LLVM:

Apache Spark’s Tungsten engine uses code generation (WholeStageCodegen)
DuckDB uses runtime vectorization without LLVM (simpler, faster startup)
Amazon Redshift CodeGen uses LLVM for select operators
Trade-off: LLVM compilation adds 100ms-2s startup; worth it only for long-running queries

The end of ETL, rise of ELT:

Traditional ETL: Extract from source → Transform in pipeline → Load into DW
Modern ELT: Extract → Load into DW raw → Transform with SQL (dbt) inside the DW
Cloud DWs are cheap enough to store raw data; SQL is fast enough to transform it on demand
dbt (data build tool) is the dominant transformation layer

Questions for Reflection

Why does an LSM-Tree perform better for write-heavy workloads than a B-Tree, even though both eventually write to disk? What specific mechanism explains the difference?
In column-oriented storage, data for different columns is stored in separate files. How does the database maintain the correspondence between rows across column files (i.e., how does it know that value 47 in the revenue column belongs to the same row as value 2024-01-15 in the date column)?
What makes vectorized query execution faster than the traditional Volcano (tuple-at-a-time) model? In which scenarios does the difference matter most?
Why is approximate nearest neighbor (ANN) search necessary for vector embeddings, when B-Trees provide exact results for traditional queries? What fundamental property of high-dimensional space makes exact search impractical?
Snowflake’s architecture separates storage (S3) from compute (virtual warehouses). What are the performance implications of this separation, and how does local caching on compute nodes mitigate them?
If you were building a RAG (Retrieval-Augmented Generation) system for a legal document corpus with 10 million documents, what storage components would you use and why?

ch03-data-models-query-languages — The logical models that these storage engines implement
ch05-encoding-and-evolution — How data is serialized for storage (Parquet, Avro, Protobuf)
ch06-replication — How storage engines are replicated across machines
ch07-sharding — How data is partitioned across storage nodes
ch03-storage-and-retrieval — 1st edition coverage for comparison

Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch04-storage-and-retrieval

Chapter 4: Storage and Retrieval

Overview

Key Concepts

Log-Structured Storage (LSM-Trees, SSTables, Compaction)

SSTables and LSM-Trees

B-Trees

Comparing B-Trees and LSM-Trees

Multicolumn and Secondary Indexes

Storing Values Within the Index

Keeping Everything in Memory

Data Storage for Analytics

Cloud Data Warehouses

Column-Oriented Storage

Query Execution: Compilation and Vectorization

Materialized Views and Data Cubes

Multidimensional and Full-Text Indexes

Full-Text Search

Vector Embeddings

Comparison Tables

LSM-Tree vs B-Tree: Complete Comparison

Storage Engine Taxonomy

Important Points Summary

Modern Context (2026)

Questions for Reflection

Graph View

Table of Contents

Backlinks

Study Notes by Niladri & AI

Explorer

ch04-storage-and-retrieval

Chapter 4: Storage and Retrieval

Overview

Key Concepts

Log-Structured Storage (LSM-Trees, SSTables, Compaction)

SSTables and LSM-Trees

B-Trees

Comparing B-Trees and LSM-Trees

Multicolumn and Secondary Indexes

Storing Values Within the Index

Keeping Everything in Memory

Data Storage for Analytics

Cloud Data Warehouses

Column-Oriented Storage

Query Execution: Compilation and Vectorization

Materialized Views and Data Cubes

Multidimensional and Full-Text Indexes

Full-Text Search

Vector Embeddings

Comparison Tables

LSM-Tree vs B-Tree: Complete Comparison

Storage Engine Taxonomy

Important Points Summary

Modern Context (2026)

Questions for Reflection

Related Resources

Graph View

Table of Contents

Backlinks