Chapter 4: Storage and Retrieval

ddia-2e storage retrieval lsm-tree b-tree column-store vector-embeddings oltp olap

Status: Notes complete


Overview

Databases perform two fundamental tasks: storing data when you write it, and retrieving it efficiently when you need it. This chapter explains how storage engines actually work — not just what API they expose, but the data structures and algorithms underneath. The key insight is that there is no single best storage engine: log-structured engines (LSM-Trees) optimize for write throughput, B-Trees optimize for read performance, column-oriented stores optimize for analytical scans, and vector indexes optimize for similarity search. The 2nd edition adds significant new material: cloud data warehouses (Snowflake, BigQuery), query compilation and vectorization (LLVM, Apache Arrow), and vector embeddings for AI/ML workloads — all of which reflect the massive shift in how data is stored and queried in 2026.


Key Concepts

Log-Structured Storage (LSM-Trees, SSTables, Compaction)

The simplest possible database: an append-only log.

# Primitive key-value store
db_set() { echo "$1,$2" >> database.log; }
db_get() { grep "^$1," database.log | tail -n 1 | cut -d',' -f2; }
  • db_set is O(1): just append to the end of the file
  • db_get is O(n): must scan the entire file

To make reads efficient, add an index — a data structure that maintains additional metadata to help you find data quickly. Every index trades write overhead for read speed.

Hash indexes: The simplest index is an in-memory hash map from key to byte offset in the log file. Bitcask (the Riak storage engine) uses exactly this.

  • Reads: O(1) hash lookup → seek to offset → read value
  • Writes: O(1) append + update hash map
  • Limitation: All keys must fit in RAM; range queries not supported

The compaction problem: An append-only log grows forever. Solution: compaction — merge log segments, throwing away old values and keeping only the latest.

SSTables and LSM-Trees

SSTable (Sorted String Table): A log segment where keys are sorted. Benefits over unsorted hash logs:

  1. Merging segments is efficient (like merge sort — similar to merge in mergesort)
  2. No longer need to keep all keys in memory — a sparse index is sufficient (find the range, scan within it)
  3. Range queries are efficient (keys are sorted)

LSM-Tree (Log-Structured Merge-Tree): The full write path using SSTables:

Write path:
1. Write to in-memory sorted tree (memtable — usually a red-black tree or AVL tree)
2. When memtable exceeds threshold (~4MB), flush to SSTable on disk (Level 0)
3. Background compaction merges Level 0 SSTables into Level 1, Level 1 into Level 2, etc.

Read path:
1. Check memtable first
2. Check Level 0 SSTables (most recent)
3. Check Level 1, Level 2, ... (progressively older)
4. Use Bloom filter to skip SSTables that definitely don't contain the key

Compaction strategies:

  • Size-tiered (Cassandra default): When several small SSTables accumulate, merge them into one larger one. Simple but can cause space amplification.
  • Leveled (RocksDB, LevelDB default): Each level has a size limit. When a level overflows, some SSTables are merged into the next level. Better space and read amplification, higher write amplification.

Bloom filters: A probabilistic data structure that answers “definitely not in this SSTable” vs “maybe in this SSTable” — avoids reading SSTables that don’t contain the key.

Systems using LSM-Trees: LevelDB, RocksDB, Cassandra, HBase, ScyllaDB, Bigtable (Google). Also TiKV (underlying store for TiDB).

B-Trees

B-Tree: The dominant index structure for relational databases, introduced by Bayer and McCreight in 1972. Unlike LSM-Trees, B-Trees are update-in-place — they modify pages on disk directly.

Structure:

  • Data organized in pages (traditionally 4KB, modern SSDs use 16KB-64KB pages)
  • Each page either contains a range of keys with child pointers (branch page) or contains actual values (leaf page)
  • The branching factor — number of child pointers per page — is typically several hundred
  • A 4-level tree with branching factor 500 can address 500^4 = 62.5 trillion key-value pairs

Write path (in-place update):

  1. Find the leaf page containing the key
  2. Modify the value in place (overwrite that page on disk)
  3. If the page is too full, split it into two pages and update the parent

Write-ahead log (WAL): Before modifying a page, write the change to an append-only WAL on disk. If a crash occurs mid-write, the WAL allows recovery to a consistent state.

Concurrency: Latches (lightweight locks) protect pages during writes. LSM-Trees have a simpler concurrency model because they are append-only.

Comparing B-Trees and LSM-Trees

The three key amplification metrics:

  • Write amplification: How many times is one logical write amplified into physical disk writes?
  • Read amplification: How many disk reads are needed for one logical read?
  • Space amplification: How much disk space is used relative to the actual data size?
DimensionLSM-TreeB-TreeNotes
Write throughputHigherLowerLSM appends sequentially; B-Tree random writes
Write amplificationMedium (compaction)Lower (1 write per change)But compaction happens in background
Read performanceSlower (multi-level)Faster (predictable O(log n))Bloom filters help LSM
Read amplificationHigher (check multiple levels)Lower (single tree traversal)LSM must check multiple SSTables
Space amplificationHigher (multiple copies during compaction)LowerLSM can have 10x data size during merge
Range queriesGood (SSTables are sorted)Excellent (B-Tree leaf pages linked)Both support range queries
Crash recoverySimple (replay memtable WAL)Complex (WAL + tree repair)LSM easier to recover
Compaction impactCan spike read/write latencyNone (no background compaction)LSM needs careful tuning
ConcurrencySimpler (append-only)Complex (page-level latches)LSM easier to implement correctly
Best forWrite-heavy workloadsRead-heavy, OLTPMatch to your access pattern

Key insight from 2nd edition: SSDs have changed the relative performance of B-Trees and LSM-Trees. SSDs handle random writes much better than spinning disks, which reduces one of LSM-Trees’ primary advantages. However, LSM-Trees still win on write throughput for write-heavy workloads.

Multicolumn and Secondary Indexes

Primary index (clustered index): The table is sorted by the primary key. Lookups by primary key are fast.

Secondary indexes: Indexes on non-primary-key columns. Can be sparse (key → row location) or dense (key → value inline).

Multicolumn indexes (composite indexes): Index on multiple columns together.

  • (last_name, first_name): efficient for WHERE last_name = 'Smith' or WHERE last_name = 'Smith' AND first_name = 'John'; NOT efficient for WHERE first_name = 'John' alone
  • Leftmost prefix rule: A composite index on (A, B, C) can satisfy queries on A; A+B; or A+B+C — but not B alone or C alone

Geospatial multicolumn indexes: (latitude, longitude) searches don’t fit B-Tree model well. Solutions:

  • R-Trees: Space-partitioning tree (used by PostGIS)
  • Space-filling curves (Hilbert curves): Map 2D coordinates to 1D key that preserves locality
  • Bounding box filters: Pre-filter with a B-Tree on lat/lon ranges, then refine

Full-text indexes: Inverted indexes map each word to the list of documents containing it (see Full-Text Search section).

Storing Values Within the Index

Two approaches for secondary indexes:

  1. Heap file approach: Index stores the row location (file offset or primary key). To read the full row, follow the pointer to the heap file. Avoids duplicating data; but requires two lookups per read.

  2. Clustered index: Row data stored directly within the index. In MySQL InnoDB, the primary key is always a clustered index. Very fast for reads that access most columns.

  3. Covering index (index with included columns): A secondary index that includes additional columns beyond the index key, so the query can be satisfied without touching the heap file (“index-only scan”).

-- Index covering a common query pattern
CREATE INDEX idx_orders_customer_status ON orders (customer_id, status)
INCLUDE (created_at, total_amount);
-- Query served entirely from index (no heap access needed)
SELECT created_at, total_amount FROM orders WHERE customer_id = 42 AND status = 'pending';

Keeping Everything in Memory

In-memory databases keep all data in RAM and write to disk only for durability:

  • Redis: Key-value store with rich data structures (lists, sets, sorted sets, hashes). Persistence via snapshots (RDB) or append-only file (AOF)
  • Memcached: Pure cache, no persistence
  • VoltDB, MemSQL (SingleStore): In-memory relational databases with full SQL
  • Druid: In-memory analytics on time-series data

Why in-memory is faster (counterintuitive insight): Not just because avoiding disk — in-memory data structures can be more flexible. Hash tables and linked lists are impractical on disk but trivial in memory. An in-memory engine can use richer data structures than what a disk-based engine can afford.

Durability approaches:

  • Write-ahead log to disk + periodic snapshots
  • Disk as backup: Load from disk on startup, operate entirely in memory
  • Replication: Use replicas as durability (if all replicas fail, data is lost)

Data Storage for Analytics

Cloud Data Warehouses

OLTP (Online Transaction Processing) vs OLAP (Online Analytical Processing):

DimensionOLTPOLAP
Primary usersEnd-users via applicationAnalysts, data scientists
Access patternSmall reads/writes per queryLarge scans, aggregations
Dataset sizeGBTB-PB
Query patternRandom access by keySequential scan, GROUP BY, SUM
Write patternLow-latency individual writesBulk import (ETL/ELT)
SchemaHighly normalized (3NF)Denormalized star/snowflake
ExamplesPostgreSQL, MySQL, DynamoDBSnowflake, BigQuery, Redshift

Data warehouse: A separate database optimized for analytical queries. Data is loaded from OLTP systems via ETL (Extract-Transform-Load) or ELT pipelines.

Major cloud data warehouses (2026):

SystemProviderKey Differentiator
SnowflakeMulti-cloudSeparation of compute/storage; virtual warehouses; zero-copy cloning
BigQueryGoogle CloudServerless (no cluster management); pay-per-query; native ML features
Amazon RedshiftAWSColumnar storage; tight AWS ecosystem integration; RA3 nodes (managed storage)
Databricks LakehouseDatabricksDelta Lake (open format); unified batch + streaming; ML-native
Azure SynapseMicrosoftIntegration with Power BI; serverless SQL pools; Spark integration
DuckDBOpen sourceEmbedded analytics (runs in-process); columnar; no server required
ClickHouseOpen sourceExtreme insert throughput; real-time analytics; very fast GROUP BY

Snowflake architecture (influential model): Storage (S3) is decoupled from compute (virtual warehouses). Multiple warehouses can read the same data concurrently. Queries run on ephemeral compute nodes that read cold data from S3 and cache hot data locally. This separation enables:

  • Independent scaling of storage and compute
  • Multiple teams use the same data without contention
  • Pay only for compute time used

The Lakehouse pattern: Combining data lake (cheap object storage: S3, GCS) with data warehouse query capabilities. Delta Lake (Databricks), Apache Iceberg (Netflix, Apple), and Apache Hudi provide ACID transactions, schema evolution, and time travel on top of Parquet files in object storage.

Column-Oriented Storage

Row-oriented storage: All values for one row are stored together. Reading one column requires loading all rows.

Column-oriented storage: All values for one column are stored together. Reading one column touches only that column’s data.

Row-oriented (one row per page):
page 1: [row1_col1, row1_col2, ..., row1_col100]
page 2: [row2_col1, row2_col2, ..., row2_col100]

Column-oriented (one column per file):
col1_file: [row1_val, row2_val, row3_val, ..., rowN_val]
col2_file: [row1_val, row2_val, row3_val, ..., rowN_val]

Why column storage is faster for analytics:

  1. Read only needed columns: An analytical query on a 100-column table might touch only 3 columns. Column storage reads 3% of the data; row storage reads 100%.
  2. Compression is dramatically more effective: A column of integers or repeated values (like a status field with 5 distinct values) compresses far better than a mixed row. Run-length encoding (RLE) on sorted columns is extremely efficient.
  3. Vectorized processing: CPU SIMD instructions process 8-32 values in a single instruction. Dense arrays of the same type (a column) are ideal for SIMD.

Column compression:

  • Dictionary encoding: Map distinct values to integers ({'pending': 0, 'shipped': 1, 'delivered': 2}). Compress with bitpacking.
  • Run-length encoding: AAABBBBCCA:3, B:4, C:2
  • Delta encoding: Store differences between consecutive values instead of absolute values (great for timestamps)
  • Bitmap encoding: One bit per row per distinct value — efficient for low-cardinality columns

Sort order in column stores: Data can be sorted by one or more columns. Sorted order enables:

  • Better RLE compression (repeated values cluster)
  • Efficient range predicates
  • “Sort key” in Snowflake/Redshift controls this

Materialized aggregates: Pre-compute common aggregate expressions (SUM, COUNT, AVG) and cache results. A data cube (OLAP cube) stores pre-aggregated results across all dimension combinations. Dramatic speedup for known query patterns but requires upfront definition and storage overhead.

Query Execution: Compilation and Vectorization

Traditional database query execution: Volcano/Iterator model (tuple-at-a-time). Each operator (scan, filter, join, aggregate) implements a next() method that returns one row. Simple to implement but many function calls per row — poor CPU cache utilization.

Vectorized execution (column-at-a-time): Operators process a batch of values (e.g., 1024 rows) at a time. Benefits:

  • Far fewer function calls (amortize overhead across batch)
  • SIMD instructions apply to the entire batch
  • Better CPU cache utilization
  • DuckDB, Apache Arrow DataFusion, ClickHouse, and modern Snowflake/BigQuery all use vectorized execution
Volcano model: for each row { filter(row) → project(row) → aggregate(row) }
Vectorized:    filter(batch) → project(batch) → aggregate(batch)
               ← 1 function call per batch, not per row →

Query compilation (code generation): Translate a query plan into optimized machine code at runtime. The generated code is specialized for the specific query — no generic operator overhead.

  • LLVM: Most common approach (used by Apache Spark’s Tungsten engine, Amazon Redshift, Snowflake)
  • Whole-stage code generation: Combine multiple operators into a single compiled function, eliminating intermediate materialization
  • Benefits: 10-100x faster than interpreted execution for CPU-bound queries
  • Cost: Compilation adds latency (10ms-1s), less beneficial for tiny queries

Apache Arrow DataFusion: Rust-based query engine using Arrow columnar format + vectorized execution. Used in InfluxDB, DuckDB-Arrow integration, and the emerging “ADBC” (Arrow Database Connectivity) standard.

Materialized Views and Data Cubes

Materialized view: A pre-computed query result stored on disk. Unlike a regular view (which is a stored query executed on demand), a materialized view is a physical copy of data.

-- Materialized view: pre-compute daily revenue per product
CREATE MATERIALIZED VIEW daily_product_revenue AS
SELECT product_id, date_trunc('day', created_at) AS day, SUM(amount) AS revenue
FROM orders
WHERE status = 'completed'
GROUP BY 1, 2;
  • Refresh strategies: On demand (REFRESH MATERIALIZED VIEW), scheduled, or incremental (only process new data)
  • Used heavily in dbt for intermediate transformation layers

OLAP cube (data cube): A multi-dimensional pre-aggregation structure. For a fact table with dimensions (product, region, date), an OLAP cube pre-computes all combinations:

  • Total sales by product
  • Total sales by region
  • Total sales by date
  • Total sales by product × region
  • Total sales by product × date
  • Total sales by region × date
  • Total sales by product × region × date

Benefits: Any aggregate query over pre-computed dimensions is instant.
Cost: Storage grows exponentially with number of dimensions (2^N combinations). Not feasible for high-cardinality dimensions.


Multidimensional and Full-Text Indexes

Full-text search (inverted index): Maps each word/token to the list of documents containing it.

Inverted index:
"database" → [doc1, doc3, doc7, doc12]
"storage"  → [doc1, doc4, doc7]
"retrieval" → [doc3, doc7, doc11]

Query: "database AND storage" → [doc1] ∩ [doc4] = [doc1, doc7]

Tokenization and normalization: Before indexing, text is:

  1. Tokenized: Split into words/tokens
  2. Normalized: Lowercase, remove punctuation
  3. Stemmed/lemmatized: runningrun, databasesdatabase
  4. Stop words removed: the, a, is excluded from index

Relevance ranking: Inverted indexes also store term frequencies and document frequencies for TF-IDF (Term Frequency-Inverse Document Frequency) ranking:

  • TF: How often the term appears in this document
  • IDF: Inverse of how often the term appears across all documents (rarer = more relevant)

Systems: Elasticsearch, Apache Solr (both based on Apache Lucene), PostgreSQL tsvector/tsquery, Typesense, Meilisearch

Fuzzy search: Edit-distance-based matching allows typos and misspellings. Computed via finite automata or SIMD-accelerated Levenshtein distance.

Vector Embeddings

Vector embeddings are the most important new topic in the 2nd edition — a data model that didn’t exist at production scale when the 1st edition was written.

What is a vector embedding?
A vector embedding is a list of floating-point numbers (a high-dimensional vector) that represents the semantic meaning of an object — text, image, audio, or code. Objects with similar meaning have vectors that are close together in vector space.

# Example: text embedding using OpenAI/sentence-transformers
embedding_1 = embed("The cat sat on the mat")   # → [0.12, -0.45, 0.89, ..., 0.03]  # 1536 dims
embedding_2 = embed("A feline rested on a rug") # → [0.11, -0.47, 0.91, ..., 0.02]  # 1536 dims
embedding_3 = embed("The stock market crashed")  # → [-0.8,  0.23, -0.5, ..., 0.67]  # 1536 dims
 
# cosine_similarity(embedding_1, embedding_2) ≈ 0.97  (semantically similar)
# cosine_similarity(embedding_1, embedding_3) ≈ 0.12  (semantically different)

Similarity metrics:

  • Cosine similarity: Angle between vectors (most common for text)
  • Euclidean distance: Geometric distance (for image embeddings)
  • Dot product: Used when vectors are normalized (fastest to compute)

Use cases:

  1. Semantic search: Find documents semantically similar to a query, not just keyword-matching
  2. RAG (Retrieval-Augmented Generation): Retrieve relevant context for LLM prompts
  3. Recommendation systems: Find similar items or users based on embedding proximity
  4. Image search: Find visually similar images
  5. Anomaly detection: Identify vectors that are outliers from normal clusters

RAG pattern (critical for 2026 LLM applications):

User query → embed query → find K nearest vectors in vector DB
           → retrieve source documents → inject into LLM prompt
           → LLM generates answer grounded in retrieved documents

Approximate Nearest Neighbor (ANN) algorithms: Exact nearest-neighbor search in high dimensions is O(N × D) where D is dimensionality (768-4096 for modern embeddings). Too slow for large corpora. ANN algorithms trade small accuracy loss for much faster search:

  • HNSW (Hierarchical Navigable Small World): Graph-based; builds a multi-layer proximity graph. O(log N) queries. Default in most vector databases. Excellent recall/latency trade-off.
  • IVF (Inverted File Index): Cluster vectors (k-means); search only the nearest clusters. Fast but lower recall.
  • Product Quantization (PQ): Compress vectors to reduce memory; approximate distances. Used for very large datasets.
  • ScaNN (Google): Asymmetric hashing; state-of-the-art recall-latency trade-off.
Vector DBApproachNotes
PineconeManaged, cloud-nativeFully managed; easy to use; expensive
WeaviateHNSW + multimodalHybrid vector+keyword; GraphQL API
QdrantHNSW (Rust)Very fast; payload filtering; open source
Milvus/ZillizIVF + HNSWLarge-scale; GPU acceleration support
pgvectorHNSW + IVF-FlatPostgreSQL extension; easy for existing PG users
ChromaDBIn-memory HNSWLocal development; Python-first

Hybrid search: Most production systems combine vector search (semantic similarity) with keyword search (exact/fuzzy match) and re-rank results. This is called hybrid retrieval:

Query → [vector search top-K] + [BM25 keyword search top-K]
      → merge and re-rank with cross-encoder
      → return final results

Comparison Tables

LSM-Tree vs B-Tree: Complete Comparison

PropertyLSM-TreeB-TreeWinner
Write throughputHigh (sequential appends)Lower (random in-place)LSM
Write amplificationHigher (compaction rewrites)Lower (1 write per change)B-Tree
Read performanceSlower (multi-SSTable check)Faster (single tree path)B-Tree
Read amplificationHigher (check all levels)Lower (log N pages)B-Tree
Space amplificationHigher (multiple copies)Lower (in-place update)B-Tree
Range queriesGood (sorted SSTables)Excellent (linked leaves)B-Tree (slight edge)
Crash recoverySimple (replay memtable WAL)Complex (WAL + tree)LSM
Compaction overheadBackground CPU/IO impactNoneB-Tree
Concurrency controlSimpler (append-only)Complex (page latches)LSM
SSD friendlinessVery good (sequential writes)Good (SSDs handle random)Tie
CompressionBetter (sorted data compresses well)Worse (fragmented pages)LSM
Best workloadWrite-heavy (Cassandra, RocksDB)Read-heavy OLTP (MySQL InnoDB)Depends

Storage Engine Taxonomy

CategoryStructureExamplesBest For
Log-structuredLSM-Tree + SSTableRocksDB, Cassandra, LevelDBHigh write throughput
Update-in-placeB-TreeMySQL InnoDB, PostgreSQL, SQLiteOLTP reads, range queries
Column-orientedColumn files + compressionParquet, Snowflake, BigQueryAnalytical scans, aggregation
In-memoryHash table, sorted treeRedis, Memcached, VoltDBLowest latency, small dataset
Vector indexHNSW, IVFpgvector, Pinecone, QdrantSimilarity search, ML workloads
Full-textInverted indexLucene, ElasticsearchText search, relevance ranking
Time-seriesTSM (time-series merge tree)InfluxDB, TimescaleDBSequential time-based data

Important Points Summary

  • Indexes trade write speed for read speed: Every index slows writes but speeds reads — choose indexes for your actual query patterns, not hypothetical ones.
  • LSM-Trees win on write throughput; B-Trees win on read performance: Both remain relevant depending on workload.
  • Column storage is transformative for analytics: Reading 3 out of 100 columns means reading 3% of the data — a 33x reduction in IO before any compression.
  • Compression in column stores is not just space savings: Run-length encoding enables operating on compressed data without decompressing — multiply performance gains.
  • Vectorized execution changes the CPU equation: Processing 1024 values per SIMD instruction vs 1 per function call is a 100-1000x difference for scan-heavy queries.
  • Cloud DWs separate storage and compute: Snowflake’s architecture allows independent scaling; you pay only for compute time used.
  • Vector embeddings introduce semantic similarity as a query primitive: This is a fundamentally new access pattern — “most similar” rather than “exact match.”
  • HNSW is the dominant ANN algorithm: Graph-based, O(log N) query, excellent recall — the default in most vector databases.
  • Hybrid search (vector + keyword) outperforms either alone: Production RAG systems combine semantic and lexical matching.
  • The Lakehouse pattern is winning: Delta Lake + Iceberg + open Parquet on S3 is replacing proprietary DW formats.

Modern Context (2026)

The Lakehouse has won:

  • Delta Lake (Databricks), Apache Iceberg (Netflix/Apple), Apache Hudi (Uber) provide ACID transactions + schema evolution + time travel on Parquet files in object storage
  • Most new data warehouses are lakehouses — avoid vendor lock-in while gaining DW features
  • Open table formats allow multiple engines (Spark, Trino, Flink, Athena) to read the same data

DuckDB: a paradigm shift for analytics:

  • An embedded columnar analytical database — runs in-process (no server)
  • Reads Parquet, Arrow, CSV, and JSON natively
  • Used for local data analysis, replacing pandas for medium-scale work
  • Powers MotherDuck (cloud DuckDB), Observable (notebooks), and many analytics tools
  • Demonstrates that columnar vectorized query execution doesn’t require a massive cluster

RocksDB as universal building block:

  • RocksDB (LSM-Tree engine by Facebook/Meta) is the underlying storage engine for: TiKV, PingCAP, CockroachDB, Cassandra (optionally), Kafka log storage, MyRocks (MySQL)
  • Purpose-built for SSD workloads; used for billions of records in production

Vector search is now table stakes:

  • Every major cloud provider now has a managed vector DB or pgvector support
  • Amazon Aurora, Azure Cosmos DB, Google AlloyDB all support vector search natively
  • pgvector (PostgreSQL) is the default choice for teams already on PostgreSQL
  • HNSW indexes added to pgvector 0.5.0 (2023) dramatically improved performance

Query compilation via LLVM:

  • Apache Spark’s Tungsten engine uses code generation (WholeStageCodegen)
  • DuckDB uses runtime vectorization without LLVM (simpler, faster startup)
  • Amazon Redshift CodeGen uses LLVM for select operators
  • Trade-off: LLVM compilation adds 100ms-2s startup; worth it only for long-running queries

The end of ETL, rise of ELT:

  • Traditional ETL: Extract from source → Transform in pipeline → Load into DW
  • Modern ELT: Extract → Load into DW raw → Transform with SQL (dbt) inside the DW
  • Cloud DWs are cheap enough to store raw data; SQL is fast enough to transform it on demand
  • dbt (data build tool) is the dominant transformation layer

Questions for Reflection

  1. Why does an LSM-Tree perform better for write-heavy workloads than a B-Tree, even though both eventually write to disk? What specific mechanism explains the difference?
  2. In column-oriented storage, data for different columns is stored in separate files. How does the database maintain the correspondence between rows across column files (i.e., how does it know that value 47 in the revenue column belongs to the same row as value 2024-01-15 in the date column)?
  3. What makes vectorized query execution faster than the traditional Volcano (tuple-at-a-time) model? In which scenarios does the difference matter most?
  4. Why is approximate nearest neighbor (ANN) search necessary for vector embeddings, when B-Trees provide exact results for traditional queries? What fundamental property of high-dimensional space makes exact search impractical?
  5. Snowflake’s architecture separates storage (S3) from compute (virtual warehouses). What are the performance implications of this separation, and how does local caching on compute nodes mitigate them?
  6. If you were building a RAG (Retrieval-Augmented Generation) system for a legal document corpus with 10 million documents, what storage components would you use and why?

Last Updated: 2026-05-29