Chapter 4: Storage and Retrieval
ddia-2e storage retrieval lsm-tree b-tree column-store vector-embeddings oltp olap
Status: Notes complete
Overview
Databases perform two fundamental tasks: storing data when you write it, and retrieving it efficiently when you need it. This chapter explains how storage engines actually work — not just what API they expose, but the data structures and algorithms underneath. The key insight is that there is no single best storage engine: log-structured engines (LSM-Trees) optimize for write throughput, B-Trees optimize for read performance, column-oriented stores optimize for analytical scans, and vector indexes optimize for similarity search. The 2nd edition adds significant new material: cloud data warehouses (Snowflake, BigQuery), query compilation and vectorization (LLVM, Apache Arrow), and vector embeddings for AI/ML workloads — all of which reflect the massive shift in how data is stored and queried in 2026.
Key Concepts
Log-Structured Storage (LSM-Trees, SSTables, Compaction)
The simplest possible database: an append-only log.
# Primitive key-value store
db_set() { echo "$1,$2" >> database.log; }
db_get() { grep "^$1," database.log | tail -n 1 | cut -d',' -f2; }db_setis O(1): just append to the end of the filedb_getis O(n): must scan the entire file
To make reads efficient, add an index — a data structure that maintains additional metadata to help you find data quickly. Every index trades write overhead for read speed.
Hash indexes: The simplest index is an in-memory hash map from key to byte offset in the log file. Bitcask (the Riak storage engine) uses exactly this.
- Reads: O(1) hash lookup → seek to offset → read value
- Writes: O(1) append + update hash map
- Limitation: All keys must fit in RAM; range queries not supported
The compaction problem: An append-only log grows forever. Solution: compaction — merge log segments, throwing away old values and keeping only the latest.
SSTables and LSM-Trees
SSTable (Sorted String Table): A log segment where keys are sorted. Benefits over unsorted hash logs:
- Merging segments is efficient (like merge sort — similar to merge in mergesort)
- No longer need to keep all keys in memory — a sparse index is sufficient (find the range, scan within it)
- Range queries are efficient (keys are sorted)
LSM-Tree (Log-Structured Merge-Tree): The full write path using SSTables:
Write path:
1. Write to in-memory sorted tree (memtable — usually a red-black tree or AVL tree)
2. When memtable exceeds threshold (~4MB), flush to SSTable on disk (Level 0)
3. Background compaction merges Level 0 SSTables into Level 1, Level 1 into Level 2, etc.
Read path:
1. Check memtable first
2. Check Level 0 SSTables (most recent)
3. Check Level 1, Level 2, ... (progressively older)
4. Use Bloom filter to skip SSTables that definitely don't contain the key
Compaction strategies:
- Size-tiered (Cassandra default): When several small SSTables accumulate, merge them into one larger one. Simple but can cause space amplification.
- Leveled (RocksDB, LevelDB default): Each level has a size limit. When a level overflows, some SSTables are merged into the next level. Better space and read amplification, higher write amplification.
Bloom filters: A probabilistic data structure that answers “definitely not in this SSTable” vs “maybe in this SSTable” — avoids reading SSTables that don’t contain the key.
Systems using LSM-Trees: LevelDB, RocksDB, Cassandra, HBase, ScyllaDB, Bigtable (Google). Also TiKV (underlying store for TiDB).
B-Trees
B-Tree: The dominant index structure for relational databases, introduced by Bayer and McCreight in 1972. Unlike LSM-Trees, B-Trees are update-in-place — they modify pages on disk directly.
Structure:
- Data organized in pages (traditionally 4KB, modern SSDs use 16KB-64KB pages)
- Each page either contains a range of keys with child pointers (branch page) or contains actual values (leaf page)
- The branching factor — number of child pointers per page — is typically several hundred
- A 4-level tree with branching factor 500 can address 500^4 = 62.5 trillion key-value pairs
Write path (in-place update):
- Find the leaf page containing the key
- Modify the value in place (overwrite that page on disk)
- If the page is too full, split it into two pages and update the parent
Write-ahead log (WAL): Before modifying a page, write the change to an append-only WAL on disk. If a crash occurs mid-write, the WAL allows recovery to a consistent state.
Concurrency: Latches (lightweight locks) protect pages during writes. LSM-Trees have a simpler concurrency model because they are append-only.
Comparing B-Trees and LSM-Trees
The three key amplification metrics:
- Write amplification: How many times is one logical write amplified into physical disk writes?
- Read amplification: How many disk reads are needed for one logical read?
- Space amplification: How much disk space is used relative to the actual data size?
| Dimension | LSM-Tree | B-Tree | Notes |
|---|---|---|---|
| Write throughput | Higher | Lower | LSM appends sequentially; B-Tree random writes |
| Write amplification | Medium (compaction) | Lower (1 write per change) | But compaction happens in background |
| Read performance | Slower (multi-level) | Faster (predictable O(log n)) | Bloom filters help LSM |
| Read amplification | Higher (check multiple levels) | Lower (single tree traversal) | LSM must check multiple SSTables |
| Space amplification | Higher (multiple copies during compaction) | Lower | LSM can have 10x data size during merge |
| Range queries | Good (SSTables are sorted) | Excellent (B-Tree leaf pages linked) | Both support range queries |
| Crash recovery | Simple (replay memtable WAL) | Complex (WAL + tree repair) | LSM easier to recover |
| Compaction impact | Can spike read/write latency | None (no background compaction) | LSM needs careful tuning |
| Concurrency | Simpler (append-only) | Complex (page-level latches) | LSM easier to implement correctly |
| Best for | Write-heavy workloads | Read-heavy, OLTP | Match to your access pattern |
Key insight from 2nd edition: SSDs have changed the relative performance of B-Trees and LSM-Trees. SSDs handle random writes much better than spinning disks, which reduces one of LSM-Trees’ primary advantages. However, LSM-Trees still win on write throughput for write-heavy workloads.
Multicolumn and Secondary Indexes
Primary index (clustered index): The table is sorted by the primary key. Lookups by primary key are fast.
Secondary indexes: Indexes on non-primary-key columns. Can be sparse (key → row location) or dense (key → value inline).
Multicolumn indexes (composite indexes): Index on multiple columns together.
(last_name, first_name): efficient forWHERE last_name = 'Smith'orWHERE last_name = 'Smith' AND first_name = 'John'; NOT efficient forWHERE first_name = 'John'alone- Leftmost prefix rule: A composite index on (A, B, C) can satisfy queries on A; A+B; or A+B+C — but not B alone or C alone
Geospatial multicolumn indexes: (latitude, longitude) searches don’t fit B-Tree model well. Solutions:
- R-Trees: Space-partitioning tree (used by PostGIS)
- Space-filling curves (Hilbert curves): Map 2D coordinates to 1D key that preserves locality
- Bounding box filters: Pre-filter with a B-Tree on lat/lon ranges, then refine
Full-text indexes: Inverted indexes map each word to the list of documents containing it (see Full-Text Search section).
Storing Values Within the Index
Two approaches for secondary indexes:
-
Heap file approach: Index stores the row location (file offset or primary key). To read the full row, follow the pointer to the heap file. Avoids duplicating data; but requires two lookups per read.
-
Clustered index: Row data stored directly within the index. In MySQL InnoDB, the primary key is always a clustered index. Very fast for reads that access most columns.
-
Covering index (index with included columns): A secondary index that includes additional columns beyond the index key, so the query can be satisfied without touching the heap file (“index-only scan”).
-- Index covering a common query pattern
CREATE INDEX idx_orders_customer_status ON orders (customer_id, status)
INCLUDE (created_at, total_amount);
-- Query served entirely from index (no heap access needed)
SELECT created_at, total_amount FROM orders WHERE customer_id = 42 AND status = 'pending';Keeping Everything in Memory
In-memory databases keep all data in RAM and write to disk only for durability:
- Redis: Key-value store with rich data structures (lists, sets, sorted sets, hashes). Persistence via snapshots (RDB) or append-only file (AOF)
- Memcached: Pure cache, no persistence
- VoltDB, MemSQL (SingleStore): In-memory relational databases with full SQL
- Druid: In-memory analytics on time-series data
Why in-memory is faster (counterintuitive insight): Not just because avoiding disk — in-memory data structures can be more flexible. Hash tables and linked lists are impractical on disk but trivial in memory. An in-memory engine can use richer data structures than what a disk-based engine can afford.
Durability approaches:
- Write-ahead log to disk + periodic snapshots
- Disk as backup: Load from disk on startup, operate entirely in memory
- Replication: Use replicas as durability (if all replicas fail, data is lost)
Data Storage for Analytics
Cloud Data Warehouses
OLTP (Online Transaction Processing) vs OLAP (Online Analytical Processing):
| Dimension | OLTP | OLAP |
|---|---|---|
| Primary users | End-users via application | Analysts, data scientists |
| Access pattern | Small reads/writes per query | Large scans, aggregations |
| Dataset size | GB | TB-PB |
| Query pattern | Random access by key | Sequential scan, GROUP BY, SUM |
| Write pattern | Low-latency individual writes | Bulk import (ETL/ELT) |
| Schema | Highly normalized (3NF) | Denormalized star/snowflake |
| Examples | PostgreSQL, MySQL, DynamoDB | Snowflake, BigQuery, Redshift |
Data warehouse: A separate database optimized for analytical queries. Data is loaded from OLTP systems via ETL (Extract-Transform-Load) or ELT pipelines.
Major cloud data warehouses (2026):
| System | Provider | Key Differentiator |
|---|---|---|
| Snowflake | Multi-cloud | Separation of compute/storage; virtual warehouses; zero-copy cloning |
| BigQuery | Google Cloud | Serverless (no cluster management); pay-per-query; native ML features |
| Amazon Redshift | AWS | Columnar storage; tight AWS ecosystem integration; RA3 nodes (managed storage) |
| Databricks Lakehouse | Databricks | Delta Lake (open format); unified batch + streaming; ML-native |
| Azure Synapse | Microsoft | Integration with Power BI; serverless SQL pools; Spark integration |
| DuckDB | Open source | Embedded analytics (runs in-process); columnar; no server required |
| ClickHouse | Open source | Extreme insert throughput; real-time analytics; very fast GROUP BY |
Snowflake architecture (influential model): Storage (S3) is decoupled from compute (virtual warehouses). Multiple warehouses can read the same data concurrently. Queries run on ephemeral compute nodes that read cold data from S3 and cache hot data locally. This separation enables:
- Independent scaling of storage and compute
- Multiple teams use the same data without contention
- Pay only for compute time used
The Lakehouse pattern: Combining data lake (cheap object storage: S3, GCS) with data warehouse query capabilities. Delta Lake (Databricks), Apache Iceberg (Netflix, Apple), and Apache Hudi provide ACID transactions, schema evolution, and time travel on top of Parquet files in object storage.
Column-Oriented Storage
Row-oriented storage: All values for one row are stored together. Reading one column requires loading all rows.
Column-oriented storage: All values for one column are stored together. Reading one column touches only that column’s data.
Row-oriented (one row per page):
page 1: [row1_col1, row1_col2, ..., row1_col100]
page 2: [row2_col1, row2_col2, ..., row2_col100]
Column-oriented (one column per file):
col1_file: [row1_val, row2_val, row3_val, ..., rowN_val]
col2_file: [row1_val, row2_val, row3_val, ..., rowN_val]
Why column storage is faster for analytics:
- Read only needed columns: An analytical query on a 100-column table might touch only 3 columns. Column storage reads 3% of the data; row storage reads 100%.
- Compression is dramatically more effective: A column of integers or repeated values (like a status field with 5 distinct values) compresses far better than a mixed row. Run-length encoding (RLE) on sorted columns is extremely efficient.
- Vectorized processing: CPU SIMD instructions process 8-32 values in a single instruction. Dense arrays of the same type (a column) are ideal for SIMD.
Column compression:
- Dictionary encoding: Map distinct values to integers (
{'pending': 0, 'shipped': 1, 'delivered': 2}). Compress with bitpacking. - Run-length encoding:
AAABBBBCC→A:3, B:4, C:2 - Delta encoding: Store differences between consecutive values instead of absolute values (great for timestamps)
- Bitmap encoding: One bit per row per distinct value — efficient for low-cardinality columns
Sort order in column stores: Data can be sorted by one or more columns. Sorted order enables:
- Better RLE compression (repeated values cluster)
- Efficient range predicates
- “Sort key” in Snowflake/Redshift controls this
Materialized aggregates: Pre-compute common aggregate expressions (SUM, COUNT, AVG) and cache results. A data cube (OLAP cube) stores pre-aggregated results across all dimension combinations. Dramatic speedup for known query patterns but requires upfront definition and storage overhead.
Query Execution: Compilation and Vectorization
Traditional database query execution: Volcano/Iterator model (tuple-at-a-time). Each operator (scan, filter, join, aggregate) implements a next() method that returns one row. Simple to implement but many function calls per row — poor CPU cache utilization.
Vectorized execution (column-at-a-time): Operators process a batch of values (e.g., 1024 rows) at a time. Benefits:
- Far fewer function calls (amortize overhead across batch)
- SIMD instructions apply to the entire batch
- Better CPU cache utilization
- DuckDB, Apache Arrow DataFusion, ClickHouse, and modern Snowflake/BigQuery all use vectorized execution
Volcano model: for each row { filter(row) → project(row) → aggregate(row) }
Vectorized: filter(batch) → project(batch) → aggregate(batch)
← 1 function call per batch, not per row →
Query compilation (code generation): Translate a query plan into optimized machine code at runtime. The generated code is specialized for the specific query — no generic operator overhead.
- LLVM: Most common approach (used by Apache Spark’s Tungsten engine, Amazon Redshift, Snowflake)
- Whole-stage code generation: Combine multiple operators into a single compiled function, eliminating intermediate materialization
- Benefits: 10-100x faster than interpreted execution for CPU-bound queries
- Cost: Compilation adds latency (10ms-1s), less beneficial for tiny queries
Apache Arrow DataFusion: Rust-based query engine using Arrow columnar format + vectorized execution. Used in InfluxDB, DuckDB-Arrow integration, and the emerging “ADBC” (Arrow Database Connectivity) standard.
Materialized Views and Data Cubes
Materialized view: A pre-computed query result stored on disk. Unlike a regular view (which is a stored query executed on demand), a materialized view is a physical copy of data.
-- Materialized view: pre-compute daily revenue per product
CREATE MATERIALIZED VIEW daily_product_revenue AS
SELECT product_id, date_trunc('day', created_at) AS day, SUM(amount) AS revenue
FROM orders
WHERE status = 'completed'
GROUP BY 1, 2;- Refresh strategies: On demand (
REFRESH MATERIALIZED VIEW), scheduled, or incremental (only process new data) - Used heavily in dbt for intermediate transformation layers
OLAP cube (data cube): A multi-dimensional pre-aggregation structure. For a fact table with dimensions (product, region, date), an OLAP cube pre-computes all combinations:
- Total sales by product
- Total sales by region
- Total sales by date
- Total sales by product × region
- Total sales by product × date
- Total sales by region × date
- Total sales by product × region × date
Benefits: Any aggregate query over pre-computed dimensions is instant.
Cost: Storage grows exponentially with number of dimensions (2^N combinations). Not feasible for high-cardinality dimensions.
Multidimensional and Full-Text Indexes
Full-Text Search
Full-text search (inverted index): Maps each word/token to the list of documents containing it.
Inverted index:
"database" → [doc1, doc3, doc7, doc12]
"storage" → [doc1, doc4, doc7]
"retrieval" → [doc3, doc7, doc11]
Query: "database AND storage" → [doc1] ∩ [doc4] = [doc1, doc7]
Tokenization and normalization: Before indexing, text is:
- Tokenized: Split into words/tokens
- Normalized: Lowercase, remove punctuation
- Stemmed/lemmatized:
running→run,databases→database - Stop words removed:
the,a,isexcluded from index
Relevance ranking: Inverted indexes also store term frequencies and document frequencies for TF-IDF (Term Frequency-Inverse Document Frequency) ranking:
- TF: How often the term appears in this document
- IDF: Inverse of how often the term appears across all documents (rarer = more relevant)
Systems: Elasticsearch, Apache Solr (both based on Apache Lucene), PostgreSQL tsvector/tsquery, Typesense, Meilisearch
Fuzzy search: Edit-distance-based matching allows typos and misspellings. Computed via finite automata or SIMD-accelerated Levenshtein distance.
Vector Embeddings
Vector embeddings are the most important new topic in the 2nd edition — a data model that didn’t exist at production scale when the 1st edition was written.
What is a vector embedding?
A vector embedding is a list of floating-point numbers (a high-dimensional vector) that represents the semantic meaning of an object — text, image, audio, or code. Objects with similar meaning have vectors that are close together in vector space.
# Example: text embedding using OpenAI/sentence-transformers
embedding_1 = embed("The cat sat on the mat") # → [0.12, -0.45, 0.89, ..., 0.03] # 1536 dims
embedding_2 = embed("A feline rested on a rug") # → [0.11, -0.47, 0.91, ..., 0.02] # 1536 dims
embedding_3 = embed("The stock market crashed") # → [-0.8, 0.23, -0.5, ..., 0.67] # 1536 dims
# cosine_similarity(embedding_1, embedding_2) ≈ 0.97 (semantically similar)
# cosine_similarity(embedding_1, embedding_3) ≈ 0.12 (semantically different)Similarity metrics:
- Cosine similarity: Angle between vectors (most common for text)
- Euclidean distance: Geometric distance (for image embeddings)
- Dot product: Used when vectors are normalized (fastest to compute)
Use cases:
- Semantic search: Find documents semantically similar to a query, not just keyword-matching
- RAG (Retrieval-Augmented Generation): Retrieve relevant context for LLM prompts
- Recommendation systems: Find similar items or users based on embedding proximity
- Image search: Find visually similar images
- Anomaly detection: Identify vectors that are outliers from normal clusters
RAG pattern (critical for 2026 LLM applications):
User query → embed query → find K nearest vectors in vector DB
→ retrieve source documents → inject into LLM prompt
→ LLM generates answer grounded in retrieved documents
Approximate Nearest Neighbor (ANN) algorithms: Exact nearest-neighbor search in high dimensions is O(N × D) where D is dimensionality (768-4096 for modern embeddings). Too slow for large corpora. ANN algorithms trade small accuracy loss for much faster search:
- HNSW (Hierarchical Navigable Small World): Graph-based; builds a multi-layer proximity graph. O(log N) queries. Default in most vector databases. Excellent recall/latency trade-off.
- IVF (Inverted File Index): Cluster vectors (k-means); search only the nearest clusters. Fast but lower recall.
- Product Quantization (PQ): Compress vectors to reduce memory; approximate distances. Used for very large datasets.
- ScaNN (Google): Asymmetric hashing; state-of-the-art recall-latency trade-off.
| Vector DB | Approach | Notes |
|---|---|---|
| Pinecone | Managed, cloud-native | Fully managed; easy to use; expensive |
| Weaviate | HNSW + multimodal | Hybrid vector+keyword; GraphQL API |
| Qdrant | HNSW (Rust) | Very fast; payload filtering; open source |
| Milvus/Zilliz | IVF + HNSW | Large-scale; GPU acceleration support |
| pgvector | HNSW + IVF-Flat | PostgreSQL extension; easy for existing PG users |
| ChromaDB | In-memory HNSW | Local development; Python-first |
Hybrid search: Most production systems combine vector search (semantic similarity) with keyword search (exact/fuzzy match) and re-rank results. This is called hybrid retrieval:
Query → [vector search top-K] + [BM25 keyword search top-K]
→ merge and re-rank with cross-encoder
→ return final results
Comparison Tables
LSM-Tree vs B-Tree: Complete Comparison
| Property | LSM-Tree | B-Tree | Winner |
|---|---|---|---|
| Write throughput | High (sequential appends) | Lower (random in-place) | LSM |
| Write amplification | Higher (compaction rewrites) | Lower (1 write per change) | B-Tree |
| Read performance | Slower (multi-SSTable check) | Faster (single tree path) | B-Tree |
| Read amplification | Higher (check all levels) | Lower (log N pages) | B-Tree |
| Space amplification | Higher (multiple copies) | Lower (in-place update) | B-Tree |
| Range queries | Good (sorted SSTables) | Excellent (linked leaves) | B-Tree (slight edge) |
| Crash recovery | Simple (replay memtable WAL) | Complex (WAL + tree) | LSM |
| Compaction overhead | Background CPU/IO impact | None | B-Tree |
| Concurrency control | Simpler (append-only) | Complex (page latches) | LSM |
| SSD friendliness | Very good (sequential writes) | Good (SSDs handle random) | Tie |
| Compression | Better (sorted data compresses well) | Worse (fragmented pages) | LSM |
| Best workload | Write-heavy (Cassandra, RocksDB) | Read-heavy OLTP (MySQL InnoDB) | Depends |
Storage Engine Taxonomy
| Category | Structure | Examples | Best For |
|---|---|---|---|
| Log-structured | LSM-Tree + SSTable | RocksDB, Cassandra, LevelDB | High write throughput |
| Update-in-place | B-Tree | MySQL InnoDB, PostgreSQL, SQLite | OLTP reads, range queries |
| Column-oriented | Column files + compression | Parquet, Snowflake, BigQuery | Analytical scans, aggregation |
| In-memory | Hash table, sorted tree | Redis, Memcached, VoltDB | Lowest latency, small dataset |
| Vector index | HNSW, IVF | pgvector, Pinecone, Qdrant | Similarity search, ML workloads |
| Full-text | Inverted index | Lucene, Elasticsearch | Text search, relevance ranking |
| Time-series | TSM (time-series merge tree) | InfluxDB, TimescaleDB | Sequential time-based data |
Important Points Summary
- Indexes trade write speed for read speed: Every index slows writes but speeds reads — choose indexes for your actual query patterns, not hypothetical ones.
- LSM-Trees win on write throughput; B-Trees win on read performance: Both remain relevant depending on workload.
- Column storage is transformative for analytics: Reading 3 out of 100 columns means reading 3% of the data — a 33x reduction in IO before any compression.
- Compression in column stores is not just space savings: Run-length encoding enables operating on compressed data without decompressing — multiply performance gains.
- Vectorized execution changes the CPU equation: Processing 1024 values per SIMD instruction vs 1 per function call is a 100-1000x difference for scan-heavy queries.
- Cloud DWs separate storage and compute: Snowflake’s architecture allows independent scaling; you pay only for compute time used.
- Vector embeddings introduce semantic similarity as a query primitive: This is a fundamentally new access pattern — “most similar” rather than “exact match.”
- HNSW is the dominant ANN algorithm: Graph-based, O(log N) query, excellent recall — the default in most vector databases.
- Hybrid search (vector + keyword) outperforms either alone: Production RAG systems combine semantic and lexical matching.
- The Lakehouse pattern is winning: Delta Lake + Iceberg + open Parquet on S3 is replacing proprietary DW formats.
Modern Context (2026)
The Lakehouse has won:
- Delta Lake (Databricks), Apache Iceberg (Netflix/Apple), Apache Hudi (Uber) provide ACID transactions + schema evolution + time travel on Parquet files in object storage
- Most new data warehouses are lakehouses — avoid vendor lock-in while gaining DW features
- Open table formats allow multiple engines (Spark, Trino, Flink, Athena) to read the same data
DuckDB: a paradigm shift for analytics:
- An embedded columnar analytical database — runs in-process (no server)
- Reads Parquet, Arrow, CSV, and JSON natively
- Used for local data analysis, replacing pandas for medium-scale work
- Powers MotherDuck (cloud DuckDB), Observable (notebooks), and many analytics tools
- Demonstrates that columnar vectorized query execution doesn’t require a massive cluster
RocksDB as universal building block:
- RocksDB (LSM-Tree engine by Facebook/Meta) is the underlying storage engine for: TiKV, PingCAP, CockroachDB, Cassandra (optionally), Kafka log storage, MyRocks (MySQL)
- Purpose-built for SSD workloads; used for billions of records in production
Vector search is now table stakes:
- Every major cloud provider now has a managed vector DB or pgvector support
- Amazon Aurora, Azure Cosmos DB, Google AlloyDB all support vector search natively
- pgvector (PostgreSQL) is the default choice for teams already on PostgreSQL
- HNSW indexes added to pgvector 0.5.0 (2023) dramatically improved performance
Query compilation via LLVM:
- Apache Spark’s Tungsten engine uses code generation (WholeStageCodegen)
- DuckDB uses runtime vectorization without LLVM (simpler, faster startup)
- Amazon Redshift CodeGen uses LLVM for select operators
- Trade-off: LLVM compilation adds 100ms-2s startup; worth it only for long-running queries
The end of ETL, rise of ELT:
- Traditional ETL: Extract from source → Transform in pipeline → Load into DW
- Modern ELT: Extract → Load into DW raw → Transform with SQL (dbt) inside the DW
- Cloud DWs are cheap enough to store raw data; SQL is fast enough to transform it on demand
- dbt (data build tool) is the dominant transformation layer
Questions for Reflection
- Why does an LSM-Tree perform better for write-heavy workloads than a B-Tree, even though both eventually write to disk? What specific mechanism explains the difference?
- In column-oriented storage, data for different columns is stored in separate files. How does the database maintain the correspondence between rows across column files (i.e., how does it know that value 47 in the
revenuecolumn belongs to the same row as value2024-01-15in thedatecolumn)? - What makes vectorized query execution faster than the traditional Volcano (tuple-at-a-time) model? In which scenarios does the difference matter most?
- Why is approximate nearest neighbor (ANN) search necessary for vector embeddings, when B-Trees provide exact results for traditional queries? What fundamental property of high-dimensional space makes exact search impractical?
- Snowflake’s architecture separates storage (S3) from compute (virtual warehouses). What are the performance implications of this separation, and how does local caching on compute nodes mitigate them?
- If you were building a RAG (Retrieval-Augmented Generation) system for a legal document corpus with 10 million documents, what storage components would you use and why?
Related Resources
- ch03-data-models-query-languages — The logical models that these storage engines implement
- ch05-encoding-and-evolution — How data is serialized for storage (Parquet, Avro, Protobuf)
- ch06-replication — How storage engines are replicated across machines
- ch07-sharding — How data is partitioned across storage nodes
- ch03-storage-and-retrieval — 1st edition coverage for comparison
Last Updated: 2026-05-29