Chapter 4 Flashcards — Storage and Retrieval

flashcards ddia-2e chapter4 storage retrieval lsm-tree b-tree column-store vector


Definitions and Mechanisms

What is an LSM-Tree and what are the key components of its write path?
?
LSM-Tree (Log-Structured Merge-Tree): A write-optimized storage engine built on:

Write path (3 stages):

  1. Memtable: Writes go to an in-memory sorted tree (red-black tree or AVL tree) + WAL (for crash recovery)
  2. Flush: When memtable exceeds threshold (~4MB), it is flushed to an immutable SSTable on disk (Level 0)
  3. Compaction: Background process merges SSTables across levels, discarding old values and maintaining sorted order

Key insight: All writes are sequential (append to WAL, sorted flush to SSTable) — no random writes to disk.

Bloom filter: Attached to each SSTable to quickly rule out “key definitely not in this SSTable” — avoids reading SSTables that don’t contain the key.

Systems: RocksDB, LevelDB, Cassandra, HBase, ScyllaDB, Bigtable

What is an SSTable and why does sorting keys matter?
?
SSTable (Sorted String Table): An immutable on-disk log segment where keys are sorted.

Why sorting matters:

  1. Efficient merging: Merging multiple SSTables is like merge sort — O(N log K) where K is the number of SSTables
  2. Sparse index: Don’t need to keep all keys in memory; a sparse index (one entry per few KB) is enough to find a key range
  3. Range queries: Keys in sorted order → efficient range scans
  4. Better compression: Adjacent values of the same key/prefix compress better (delta encoding, prefix compression)

Contrast with unsorted hash log: Hash log requires all keys in memory + no range queries.

How does a B-Tree handle writes?
?
B-Tree write algorithm:

  1. Find the leaf page containing the target key (traverse tree from root)
  2. Modify in place: Update the value in the leaf page (overwrite that page on disk)
  3. Split if full: If the leaf page is full after modification, split it into two pages and update the parent page
  4. WAL (Write-Ahead Log): Before modifying any page, write the intended change to the WAL — enables crash recovery

Key property: B-Trees are update-in-place. The same page on disk is overwritten (unlike LSM-Trees which only append).

Branching factor: Typically several hundred child pointers per page. A 4-level B-Tree with branching factor 500 can address 500^4 = 62.5 trillion key-value pairs.

Crash recovery: The WAL allows the database to reconstruct the B-Tree to a consistent state after a crash.

What are the three amplification metrics for storage engines?
?
Three metrics for evaluating storage engine efficiency:

Write Amplification: How many physical disk writes per one logical write?

  • B-Tree: ~1-2x (WAL write + in-place page update)
  • LSM-Tree: higher (data rewritten multiple times during compaction across levels)
  • High write amp → SSD wear, more I/O bandwidth consumed

Read Amplification: How many physical disk reads per one logical read?

  • B-Tree: O(log N) pages along one tree path
  • LSM-Tree: check memtable + Bloom filters + potentially multiple SSTable levels
  • High read amp → slower reads

Space Amplification: Physical disk used / logical data size?

  • B-Tree: ~2x (page fragmentation, free pages)
  • LSM-Tree: up to 10x during active compaction (old + new SSTable copies coexist)
  • High space amp → storage cost

What is column-oriented storage and why is it faster for analytics?
?
Column-oriented storage: All values for a single column are stored together (in the same file or page), rather than storing all values for a row together.

Why it’s faster for analytics (3 reasons):

  1. Read only needed columns: A query touching 3 out of 100 columns reads 3% of the data. Row storage reads 100%.
  2. Compression is dramatically more effective: A column of integers or a status field with 5 distinct values compresses far better than a mixed row. Dictionary encoding + RLE can achieve 10-100x compression.
  3. Vectorized processing: A column is a dense array of the same type — ideal for CPU SIMD instructions that process 8-32 values per instruction. Row storage interleaves types, breaking SIMD efficiency.

Additional benefit: Operating on compressed data without decompressing (bitmap operations on dictionary-encoded columns).

Systems: Snowflake, BigQuery, Amazon Redshift, DuckDB, ClickHouse, Apache Parquet (file format)

What is vectorized execution and how does it differ from the Volcano model?
?
Volcano (iterator) model: Each operator implements next() returning one row at a time.

  • Calls: 1 function call per row × N operators × M rows = enormous call overhead
  • Poor CPU cache utilization (one row processed, then discarded)

Vectorized execution: Operators process a batch of rows (e.g., 1024) at a time.

  • Same operators, but each receives and returns a column-batch instead of one row
  • SIMD: CPU processes 8-32 values per instruction on the batch
  • Function call amortization: 1 function call per 1024 rows vs 1 per row

Example: 1 billion rows, 3 operators:

  • Volcano: 3B function calls
  • Vectorized: ~3M function calls, each processing 1024 values with SIMD
  • Net speedup: ~100-300x for scan-heavy queries

Used by: DuckDB, ClickHouse, Apache Arrow DataFusion, Velox (Meta), modern Snowflake/BigQuery

What is a vector embedding and what similarity metrics are used?
?
Vector embedding: A high-dimensional floating-point vector (typically 256–4096 dimensions) that encodes the semantic meaning of an object (text, image, code, audio).

Key property: Objects with similar meaning have vectors that are close together in the embedding space.

embed("cat sat on mat")   → [0.12, -0.45, 0.89, ...]
embed("feline on rug")    → [0.11, -0.47, 0.91, ...]  # close!
embed("stock market")     → [-0.8,  0.23, -0.5, ...]  # far away

Similarity metrics:

  • Cosine similarity: Angle between vectors. 1.0 = identical direction, 0.0 = orthogonal. Most common for text.
  • Euclidean distance: Geometric distance. Used for image embeddings.
  • Dot product: Fastest; equivalent to cosine when vectors are normalized. Used by OpenAI Ada.

Use cases: Semantic search, RAG (LLM retrieval), recommendation systems, image search, anomaly detection, duplicate detection

What is HNSW and why is approximate nearest neighbor (ANN) search necessary?
?
Why ANN is necessary: Exact nearest neighbor search in D-dimensional space requires computing distance to every vector (O(N × D)). For N=10M, D=1536, that’s 15 billion multiplications per query — too slow for real-time retrieval.

HNSW (Hierarchical Navigable Small World): A graph-based ANN algorithm:

  • Builds a multi-layer graph where each node connects to its nearest neighbors at multiple scales
  • Layer 0 (bottom): full graph with all vectors, each connected to nearby vectors
  • Higher layers: sparser graphs that allow “long jumps” for fast navigation
  • Query: start at top layer, greedily navigate toward query vector, descend to bottom layer

Properties:

  • Query complexity: O(log N) — dramatically faster than exact O(N)
  • Recall: 95-99% accuracy (finds nearly all true nearest neighbors)
  • Memory: High (graph edges stored in RAM for fast traversal)
  • Build time: Moderate (O(N log N))

Used by: pgvector (default in v0.5+), Weaviate, Qdrant, Chroma, Pinecone


Trade-offs and Comparisons

LSM-Tree vs B-Tree — when to use each?
?
Use LSM-Tree when:

  • Write-heavy workloads: High insert/update throughput is critical
  • SSD storage: Sequential writes extend SSD life (less random write wear)
  • Compression matters: LSM stores sorted data → excellent compression ratios
  • Example workloads: Time-series ingestion, event logging, wide-column stores
  • Systems: Cassandra, RocksDB, HBase, ScyllaDB

Use B-Tree when:

  • Read-heavy OLTP: Low-latency reads are critical (single tree path = predictable)
  • Range queries are frequent: B-Tree leaf pages are linked, range scans very efficient
  • Consistency/recovery: Simpler crash recovery story (well-understood, battle-tested)
  • Mixed read/write: Most OLTP apps have more reads than writes
  • Systems: MySQL InnoDB, PostgreSQL, SQLite, Oracle

Key trade-off: LSM wins on write throughput; B-Tree wins on read latency and predictability. Bloom filters partially close LSM’s read gap.

What are the trade-offs of column-oriented storage vs row-oriented storage?
?

DimensionRow-OrientedColumn-Oriented
Full row readFast (one I/O)Slow (N column files)
Single column scanSlow (reads all columns)Fast (reads only that file)
Write patternSimple (append row)Complex (update N files)
CompressionPoor (mixed types)Excellent (homogeneous type)
SIMD effectivenessPoor (mixed types)Excellent (uniform arrays)
OLTP queriesGoodPoor
OLAP aggregationsPoorExcellent

Key insight: Column stores shine for analytical queries that scan 3-10 columns out of 100. Row stores shine for OLTP that reads most columns of specific rows by key.

Modern convergence: Snowflake, BigQuery, and DuckDB are column-oriented but support SQL. PostgreSQL (row-oriented) added columnar storage extensions (pg_analytics, Hydra).

What are the trade-offs of in-memory vs disk-based databases?
?
In-memory advantages:

  • Much faster access: No I/O latency (RAM ~100ns vs SSD ~100µs = 1000x)
  • Richer data structures: Hash tables, linked lists, sorted sets — impractical on disk
  • No serialization overhead: Data stays in native in-memory format

In-memory limitations:

  • Volatile by default: Power loss = data loss (mitigated with WAL, replication, snapshots)
  • Dataset size limited: Must fit in RAM (expensive)
  • Not always faster for large datasets: Disk I/O cost amortized by OS page cache; hot data is cached anyway

Durability approaches for in-memory DBs:

  1. Write-ahead log + periodic snapshots (Redis AOF + RDB)
  2. Disk as a backup medium (load on startup, operate in RAM)
  3. Replication (treat replicas as durability mechanism)

Key insight (from DDIA): The performance advantage of in-memory DBs over disk-based DBs is not primarily about avoiding disk reads (which are cached by the OS). It is about using richer data structures that would be too expensive to serialize/deserialize.

What are the pros and cons of materialized views?
?
Materialized view: A query result pre-computed and stored physically on disk.

Pros:

  • Instant access for the pre-computed query (no recomputation)
  • Can index the materialized view for additional speed
  • Dramatically accelerates BI dashboards with known query patterns

Cons:

  • Storage overhead (duplicate data)
  • Staleness: data may be out of date between refreshes
  • Refresh cost: full refresh is expensive for large datasets; incremental refresh is complex
  • Inflexibility: must know query patterns in advance

When to use:

  • Fixed, frequently-run aggregate queries (e.g., daily revenue by product)
  • BI dashboards with predictable query patterns
  • dbt uses materialized views/tables extensively as intermediate transformation layers

When NOT to use:

  • Ad-hoc analytical queries (patterns unknown in advance)
  • Highly volatile source data (refresh cost > query savings)

Numbers and Precision

What are the key performance numbers for storage engine comparison?
?
B-Tree:

  • Page size: 4KB (traditional) to 64KB (modern SSDs)
  • Branching factor: ~500 child pointers per page
  • 4-level B-Tree at 500 branching factor: 500^4 = 62.5 trillion addressable key-value pairs
  • Read: O(log N) pages (typically 3-4 disk reads for large datasets)

LSM-Tree:

  • Memtable flush threshold: typically 4-256 MB
  • Level size ratio: typically 10x per level (L0: 256MB, L1: 2.56GB, L2: 25.6GB, …)
  • Bloom filter: ~10 bits per key achieves ~1% false positive rate
  • Space amplification during compaction: can reach ~10x temporarily

Vector indexes:

  • Typical embedding dimensions: 256 (small), 768 (BERT), 1536 (OpenAI Ada), 3072 (text-embedding-3-large)
  • HNSW query complexity: O(log N) with 95-99% recall
  • Exact brute-force: O(N × D) — prohibitive at N=10M, D=1536 (15B multiplications)
  • pgvector HNSW: can handle ~1M vectors effectively; dedicated DBs handle billions

What are the key characteristics of a data warehouse query vs OLTP query?
?
OLTP query characteristics:

  • Touches: 10-100 rows (small dataset per query)
  • Uses: primary key or secondary index lookup (random access)
  • Duration: < 10ms (often < 1ms)
  • Concurrency: thousands of simultaneous queries
  • Writes: individual row inserts/updates

OLAP query characteristics:

  • Touches: millions to billions of rows (sequential scan)
  • Uses: column scans + filters + aggregations
  • Duration: seconds to minutes
  • Concurrency: tens of simultaneous queries
  • Writes: batch ETL/ELT loads (bulk inserts)

Key numbers for Snowflake/BigQuery:

  • BigQuery: scans up to ~100GB/second per slot; prices by data scanned
  • Snowflake: virtual warehouses scale from XS (1 node) to 6XL (128+ nodes)
  • ClickHouse: can ingest >1 million rows/second per node; sub-second GROUP BY on billions of rows

Application and Failure Modes

How would you choose between a dedicated vector database and pgvector for a production system?
?
Choose pgvector when:

  • Already using PostgreSQL — avoid operational complexity of another database
  • Dataset is small-to-medium (<5M vectors)
  • Queries combine vector similarity with relational filters (e.g., WHERE user_id = 42 ORDER BY embedding <-> $query LIMIT 10)
  • ACID transactions needed alongside vector operations
  • Team wants single database system

Choose a dedicated vector DB (Pinecone, Weaviate, Qdrant) when:

  • Very large scale (>100M vectors)
  • Need maximum query performance (QPS > 10,000)
  • Need advanced features: multi-vector search, metadata filtering at scale, distributed sharding
  • No existing PostgreSQL dependency

Hybrid approach: Use pgvector for development and small production workloads; migrate to a dedicated vector DB if throughput or scale requirements exceed pgvector’s limits.

Warning signs pgvector is insufficient: Query latency >100ms at target load, dataset approaching 50M+ vectors, requiring complex metadata filtering on billions of vectors.

What are the failure modes of LSM-Tree compaction?
?

  1. Write stalls: If writes arrive faster than compaction can proceed, the memtable fills up and writes must be blocked (“write stall”). Mitigation: tune compaction concurrency, use rate limiting, monitor L0 file count.

  2. Compaction I/O spikes: Background compaction competes with foreground reads/writes for disk bandwidth. Result: unpredictable read/write latency spikes. Mitigation: prioritize foreground I/O, use dedicated compaction disks.

  3. Space amplification: During compaction, old and new SSTables coexist temporarily. Total disk usage can be 2-10x the logical data size. Mitigation: monitor disk usage carefully; size disks with headroom.

  4. Read amplification without Bloom filters: If Bloom filters are too small (low memory), false positive rate rises and the system reads many SSTables for each key lookup. Mitigation: allocate sufficient memory for Bloom filters (10 bits/key is typical).

  5. Tombstone accumulation: Deletes in LSM-Trees write a tombstone marker rather than removing data. Until compaction merges away the tombstone, reads must check for it. In write-heavy workloads with many deletes, tombstones accumulate and hurt read performance.

What are the failure modes of vector search systems?
?

  1. Index build time: HNSW index build for 100M vectors can take hours. Must be done offline; live re-indexing is expensive. Mitigation: plan index builds as batch operations; use delta indexes for new data.

  2. Memory pressure: HNSW stores the full proximity graph in RAM. 100M vectors × 1536 dims × 4 bytes = ~600GB of raw vectors, plus graph edges. Mitigation: use product quantization (PQ) to compress, or use IVF+PQ instead of HNSW.

  3. Recall degradation under filtering: When applying metadata filters with vector search, naively running filtered ANN can return fewer than K results if the filter is very restrictive. Mitigation: use systems that natively support pre-filtering (Qdrant’s payload indexing, Weaviate’s filtering).

  4. Embedding model drift: If the model used to generate stored embeddings changes, old and new embeddings are in different spaces — similarity scores become meaningless. Mitigation: version embeddings; plan for full re-embedding when model changes.

  5. Stale embeddings: Documents updated after embedding don’t reflect their new content until re-embedded. Mitigation: trigger re-embedding on document updates; maintain embedding freshness metadata.

How would you design the storage layer for a RAG (Retrieval-Augmented Generation) system?
?
Components needed:

  1. Document store: Raw documents (S3, PostgreSQL, or document DB). Stores original text and metadata. Needed for reconstructing context to pass to LLM.

  2. Chunk store: Documents split into 256-512 token chunks. Stored alongside their embeddings.

  3. Vector index: HNSW index over chunk embeddings.

    • Small scale (<1M chunks): pgvector
    • Medium scale (1M-50M): pgvector or Qdrant
    • Large scale (>50M): Pinecone, Weaviate, Milvus
  4. Keyword index (optional but recommended): Elasticsearch or Typesense for BM25 keyword search. Hybrid search (vector + keyword) significantly outperforms vector-only.

  5. Metadata filters: Store document metadata (date, source, category) alongside vectors for filtering before or after vector retrieval.

Query flow:

User query → embed (model call) → 
  → vector search top-K (HNSW) + keyword search top-K (BM25)
  → merge and deduplicate
  → optionally re-rank with cross-encoder
  → assemble context → LLM prompt → response

Key trade-off: More chunks = more context coverage but higher retrieval latency and LLM token cost. Typical chunk size: 256-512 tokens with 20% overlap.


Modern Context

What is the Lakehouse architecture and why has it become dominant?
?
Lakehouse = Data Lake (cheap object storage: S3/GCS/ADLS) + Data Warehouse (ACID, schema, queries)

Problem it solves: Traditional data lake (raw files on S3) had no ACID, no schema enforcement, no efficient updates — leading to “data swamps”. Traditional DW (Snowflake, Redshift) required copying data into proprietary formats, creating vendor lock-in and storage cost.

Key components:

  • Apache Iceberg (Netflix, Apple, AWS): Open table format providing ACID, schema evolution, time travel, partition pruning on Parquet files
  • Delta Lake (Databricks): Similar to Iceberg; tighter Spark integration; Delta Sharing for cross-org data
  • Apache Hudi (Uber): Optimized for upserts and incremental processing

Why it has won:

  1. Open format: Multiple engines (Spark, Flink, Trino, Athena, DuckDB) read the same Iceberg/Delta tables
  2. Cost: S3 storage is ~50x cheaper than Snowflake managed storage
  3. No data movement: Compute engines run queries directly on the lake
  4. ACID: Iceberg provides snapshot isolation, concurrent writes, rollback

2026 state: Iceberg is the emerging winner over Delta and Hudi; AWS, Google, and Azure all support it natively.

How has DuckDB changed local and embedded analytics?
?
DuckDB (open source, embedded OLAP engine):

  • Runs in-process — no server, no network, no cluster
  • Columnar vectorized execution (same model as Snowflake/BigQuery)
  • Reads Parquet, Arrow, CSV, JSON, and SQL databases natively
  • Full SQL support including window functions, CTEs, JOINs

What it replaces:

  • pandas for medium-scale data transformation (10GB-1TB)
  • SQLite for analytical queries (SQLite is row-oriented, slow for aggregations)
  • Spark for local development and small-to-medium ETL

Performance: DuckDB is typically 10-100x faster than pandas on group-by/aggregation queries, and comparable to Spark for datasets that fit on a single machine (< a few TB).

Ecosystem impact:

  • MotherDuck: Managed cloud DuckDB (hybrid local + cloud execution)
  • Observable: In-browser analytics using DuckDB-WASM
  • dbt: Native DuckDB adapter for local development
  • Arrow: Zero-copy data exchange between DuckDB, pandas, and Polars

2026 position: DuckDB is the default tool for local data analysis and embedded analytics in Python/R/JS. Replaces many use cases for Spark, Pandas, and SQLite.


Total Cards: 27
Review Time: ~25 minutes
Priority: HIGH
Last Updated: 2026-05-29