Chapter 4 Flashcards — Storage and Retrieval

flashcards ddia-2e chapter4 storage retrieval lsm-tree b-tree column-store vector

Definitions and Mechanisms

What is an LSM-Tree and what are the key components of its write path?
?
LSM-Tree (Log-Structured Merge-Tree): A write-optimized storage engine built on:

Write path (3 stages):

Memtable: Writes go to an in-memory sorted tree (red-black tree or AVL tree) + WAL (for crash recovery)
Flush: When memtable exceeds threshold (~4MB), it is flushed to an immutable SSTable on disk (Level 0)
Compaction: Background process merges SSTables across levels, discarding old values and maintaining sorted order

Key insight: All writes are sequential (append to WAL, sorted flush to SSTable) — no random writes to disk.

Bloom filter: Attached to each SSTable to quickly rule out “key definitely not in this SSTable” — avoids reading SSTables that don’t contain the key.

Systems: RocksDB, LevelDB, Cassandra, HBase, ScyllaDB, Bigtable

What is an SSTable and why does sorting keys matter?
?
SSTable (Sorted String Table): An immutable on-disk log segment where keys are sorted.

Why sorting matters:

Efficient merging: Merging multiple SSTables is like merge sort — O(N log K) where K is the number of SSTables
Sparse index: Don’t need to keep all keys in memory; a sparse index (one entry per few KB) is enough to find a key range
Range queries: Keys in sorted order → efficient range scans
Better compression: Adjacent values of the same key/prefix compress better (delta encoding, prefix compression)

Contrast with unsorted hash log: Hash log requires all keys in memory + no range queries.

How does a B-Tree handle writes?
?
B-Tree write algorithm:

Find the leaf page containing the target key (traverse tree from root)
Modify in place: Update the value in the leaf page (overwrite that page on disk)
Split if full: If the leaf page is full after modification, split it into two pages and update the parent page
WAL (Write-Ahead Log): Before modifying any page, write the intended change to the WAL — enables crash recovery

Key property: B-Trees are update-in-place. The same page on disk is overwritten (unlike LSM-Trees which only append).

Branching factor: Typically several hundred child pointers per page. A 4-level B-Tree with branching factor 500 can address 500^4 = 62.5 trillion key-value pairs.

Crash recovery: The WAL allows the database to reconstruct the B-Tree to a consistent state after a crash.

What are the three amplification metrics for storage engines?
?
Three metrics for evaluating storage engine efficiency:

Write Amplification: How many physical disk writes per one logical write?

B-Tree: ~1-2x (WAL write + in-place page update)
LSM-Tree: higher (data rewritten multiple times during compaction across levels)
High write amp → SSD wear, more I/O bandwidth consumed

Read Amplification: How many physical disk reads per one logical read?

B-Tree: O(log N) pages along one tree path
LSM-Tree: check memtable + Bloom filters + potentially multiple SSTable levels
High read amp → slower reads

Space Amplification: Physical disk used / logical data size?

B-Tree: ~2x (page fragmentation, free pages)
LSM-Tree: up to 10x during active compaction (old + new SSTable copies coexist)
High space amp → storage cost

What is column-oriented storage and why is it faster for analytics?
?
Column-oriented storage: All values for a single column are stored together (in the same file or page), rather than storing all values for a row together.

Why it’s faster for analytics (3 reasons):

Read only needed columns: A query touching 3 out of 100 columns reads 3% of the data. Row storage reads 100%.
Compression is dramatically more effective: A column of integers or a status field with 5 distinct values compresses far better than a mixed row. Dictionary encoding + RLE can achieve 10-100x compression.
Vectorized processing: A column is a dense array of the same type — ideal for CPU SIMD instructions that process 8-32 values per instruction. Row storage interleaves types, breaking SIMD efficiency.

Additional benefit: Operating on compressed data without decompressing (bitmap operations on dictionary-encoded columns).

Systems: Snowflake, BigQuery, Amazon Redshift, DuckDB, ClickHouse, Apache Parquet (file format)

What is vectorized execution and how does it differ from the Volcano model?
?
Volcano (iterator) model: Each operator implements next() returning one row at a time.

Calls: 1 function call per row × N operators × M rows = enormous call overhead
Poor CPU cache utilization (one row processed, then discarded)

Vectorized execution: Operators process a batch of rows (e.g., 1024) at a time.

Same operators, but each receives and returns a column-batch instead of one row
SIMD: CPU processes 8-32 values per instruction on the batch
Function call amortization: 1 function call per 1024 rows vs 1 per row

Example: 1 billion rows, 3 operators:

Volcano: 3B function calls
Vectorized: ~3M function calls, each processing 1024 values with SIMD
Net speedup: ~100-300x for scan-heavy queries

Used by: DuckDB, ClickHouse, Apache Arrow DataFusion, Velox (Meta), modern Snowflake/BigQuery

What is a vector embedding and what similarity metrics are used?
?
Vector embedding: A high-dimensional floating-point vector (typically 256–4096 dimensions) that encodes the semantic meaning of an object (text, image, code, audio).

Key property: Objects with similar meaning have vectors that are close together in the embedding space.

embed("cat sat on mat")   → [0.12, -0.45, 0.89, ...]
embed("feline on rug")    → [0.11, -0.47, 0.91, ...]  # close!
embed("stock market")     → [-0.8,  0.23, -0.5, ...]  # far away

Similarity metrics:

Cosine similarity: Angle between vectors. 1.0 = identical direction, 0.0 = orthogonal. Most common for text.
Euclidean distance: Geometric distance. Used for image embeddings.
Dot product: Fastest; equivalent to cosine when vectors are normalized. Used by OpenAI Ada.

Use cases: Semantic search, RAG (LLM retrieval), recommendation systems, image search, anomaly detection, duplicate detection

What is HNSW and why is approximate nearest neighbor (ANN) search necessary?
?
Why ANN is necessary: Exact nearest neighbor search in D-dimensional space requires computing distance to every vector (O(N × D)). For N=10M, D=1536, that’s 15 billion multiplications per query — too slow for real-time retrieval.

HNSW (Hierarchical Navigable Small World): A graph-based ANN algorithm:

Builds a multi-layer graph where each node connects to its nearest neighbors at multiple scales
Layer 0 (bottom): full graph with all vectors, each connected to nearby vectors
Higher layers: sparser graphs that allow “long jumps” for fast navigation
Query: start at top layer, greedily navigate toward query vector, descend to bottom layer

Properties:

Query complexity: O(log N) — dramatically faster than exact O(N)
Recall: 95-99% accuracy (finds nearly all true nearest neighbors)
Memory: High (graph edges stored in RAM for fast traversal)
Build time: Moderate (O(N log N))

Used by: pgvector (default in v0.5+), Weaviate, Qdrant, Chroma, Pinecone

Trade-offs and Comparisons

LSM-Tree vs B-Tree — when to use each?
?
Use LSM-Tree when:

Write-heavy workloads: High insert/update throughput is critical
SSD storage: Sequential writes extend SSD life (less random write wear)
Compression matters: LSM stores sorted data → excellent compression ratios
Example workloads: Time-series ingestion, event logging, wide-column stores
Systems: Cassandra, RocksDB, HBase, ScyllaDB

Use B-Tree when:

Read-heavy OLTP: Low-latency reads are critical (single tree path = predictable)
Range queries are frequent: B-Tree leaf pages are linked, range scans very efficient
Consistency/recovery: Simpler crash recovery story (well-understood, battle-tested)
Mixed read/write: Most OLTP apps have more reads than writes
Systems: MySQL InnoDB, PostgreSQL, SQLite, Oracle

Key trade-off: LSM wins on write throughput; B-Tree wins on read latency and predictability. Bloom filters partially close LSM’s read gap.

What are the trade-offs of column-oriented storage vs row-oriented storage?
?

Dimension	Row-Oriented	Column-Oriented
Full row read	Fast (one I/O)	Slow (N column files)
Single column scan	Slow (reads all columns)	Fast (reads only that file)
Write pattern	Simple (append row)	Complex (update N files)
Compression	Poor (mixed types)	Excellent (homogeneous type)
SIMD effectiveness	Poor (mixed types)	Excellent (uniform arrays)
OLTP queries	Good	Poor
OLAP aggregations	Poor	Excellent

Key insight: Column stores shine for analytical queries that scan 3-10 columns out of 100. Row stores shine for OLTP that reads most columns of specific rows by key.

Modern convergence: Snowflake, BigQuery, and DuckDB are column-oriented but support SQL. PostgreSQL (row-oriented) added columnar storage extensions (pg_analytics, Hydra).

What are the trade-offs of in-memory vs disk-based databases?
?
In-memory advantages:

Much faster access: No I/O latency (RAM ~100ns vs SSD ~100µs = 1000x)
Richer data structures: Hash tables, linked lists, sorted sets — impractical on disk
No serialization overhead: Data stays in native in-memory format

In-memory limitations:

Volatile by default: Power loss = data loss (mitigated with WAL, replication, snapshots)
Dataset size limited: Must fit in RAM (expensive)
Not always faster for large datasets: Disk I/O cost amortized by OS page cache; hot data is cached anyway

Durability approaches for in-memory DBs:

Write-ahead log + periodic snapshots (Redis AOF + RDB)
Disk as a backup medium (load on startup, operate in RAM)
Replication (treat replicas as durability mechanism)

Key insight (from DDIA): The performance advantage of in-memory DBs over disk-based DBs is not primarily about avoiding disk reads (which are cached by the OS). It is about using richer data structures that would be too expensive to serialize/deserialize.

What are the pros and cons of materialized views?
?
Materialized view: A query result pre-computed and stored physically on disk.

Pros:

Instant access for the pre-computed query (no recomputation)
Can index the materialized view for additional speed
Dramatically accelerates BI dashboards with known query patterns

Cons:

Storage overhead (duplicate data)
Staleness: data may be out of date between refreshes
Refresh cost: full refresh is expensive for large datasets; incremental refresh is complex
Inflexibility: must know query patterns in advance

When to use:

Fixed, frequently-run aggregate queries (e.g., daily revenue by product)
BI dashboards with predictable query patterns
dbt uses materialized views/tables extensively as intermediate transformation layers

When NOT to use:

Ad-hoc analytical queries (patterns unknown in advance)
Highly volatile source data (refresh cost > query savings)

Numbers and Precision

What are the key performance numbers for storage engine comparison?
?
B-Tree:

Page size: 4KB (traditional) to 64KB (modern SSDs)
Branching factor: ~500 child pointers per page
4-level B-Tree at 500 branching factor: 500^4 = 62.5 trillion addressable key-value pairs
Read: O(log N) pages (typically 3-4 disk reads for large datasets)

LSM-Tree:

Memtable flush threshold: typically 4-256 MB
Level size ratio: typically 10x per level (L0: 256MB, L1: 2.56GB, L2: 25.6GB, …)
Bloom filter: ~10 bits per key achieves ~1% false positive rate
Space amplification during compaction: can reach ~10x temporarily

Vector indexes:

Typical embedding dimensions: 256 (small), 768 (BERT), 1536 (OpenAI Ada), 3072 (text-embedding-3-large)
HNSW query complexity: O(log N) with 95-99% recall
Exact brute-force: O(N × D) — prohibitive at N=10M, D=1536 (15B multiplications)
pgvector HNSW: can handle ~1M vectors effectively; dedicated DBs handle billions

What are the key characteristics of a data warehouse query vs OLTP query?
?
OLTP query characteristics:

Touches: 10-100 rows (small dataset per query)
Uses: primary key or secondary index lookup (random access)
Duration: < 10ms (often < 1ms)
Concurrency: thousands of simultaneous queries
Writes: individual row inserts/updates

OLAP query characteristics:

Touches: millions to billions of rows (sequential scan)
Uses: column scans + filters + aggregations
Duration: seconds to minutes
Concurrency: tens of simultaneous queries
Writes: batch ETL/ELT loads (bulk inserts)

Key numbers for Snowflake/BigQuery:

BigQuery: scans up to ~100GB/second per slot; prices by data scanned
Snowflake: virtual warehouses scale from XS (1 node) to 6XL (128+ nodes)
ClickHouse: can ingest >1 million rows/second per node; sub-second GROUP BY on billions of rows

Application and Failure Modes

How would you choose between a dedicated vector database and pgvector for a production system?
?
Choose pgvector when:

Already using PostgreSQL — avoid operational complexity of another database
Dataset is small-to-medium (<5M vectors)
Queries combine vector similarity with relational filters (e.g., WHERE user_id = 42 ORDER BY embedding <-> $query LIMIT 10)
ACID transactions needed alongside vector operations
Team wants single database system

Choose a dedicated vector DB (Pinecone, Weaviate, Qdrant) when:

Very large scale (>100M vectors)
Need maximum query performance (QPS > 10,000)
Need advanced features: multi-vector search, metadata filtering at scale, distributed sharding
No existing PostgreSQL dependency

Hybrid approach: Use pgvector for development and small production workloads; migrate to a dedicated vector DB if throughput or scale requirements exceed pgvector’s limits.

Warning signs pgvector is insufficient: Query latency >100ms at target load, dataset approaching 50M+ vectors, requiring complex metadata filtering on billions of vectors.

What are the failure modes of LSM-Tree compaction?
?

Write stalls: If writes arrive faster than compaction can proceed, the memtable fills up and writes must be blocked (“write stall”). Mitigation: tune compaction concurrency, use rate limiting, monitor L0 file count.
Compaction I/O spikes: Background compaction competes with foreground reads/writes for disk bandwidth. Result: unpredictable read/write latency spikes. Mitigation: prioritize foreground I/O, use dedicated compaction disks.
Space amplification: During compaction, old and new SSTables coexist temporarily. Total disk usage can be 2-10x the logical data size. Mitigation: monitor disk usage carefully; size disks with headroom.
Read amplification without Bloom filters: If Bloom filters are too small (low memory), false positive rate rises and the system reads many SSTables for each key lookup. Mitigation: allocate sufficient memory for Bloom filters (10 bits/key is typical).
Tombstone accumulation: Deletes in LSM-Trees write a tombstone marker rather than removing data. Until compaction merges away the tombstone, reads must check for it. In write-heavy workloads with many deletes, tombstones accumulate and hurt read performance.

What are the failure modes of vector search systems?
?

Index build time: HNSW index build for 100M vectors can take hours. Must be done offline; live re-indexing is expensive. Mitigation: plan index builds as batch operations; use delta indexes for new data.
Memory pressure: HNSW stores the full proximity graph in RAM. 100M vectors × 1536 dims × 4 bytes = ~600GB of raw vectors, plus graph edges. Mitigation: use product quantization (PQ) to compress, or use IVF+PQ instead of HNSW.
Recall degradation under filtering: When applying metadata filters with vector search, naively running filtered ANN can return fewer than K results if the filter is very restrictive. Mitigation: use systems that natively support pre-filtering (Qdrant’s payload indexing, Weaviate’s filtering).
Embedding model drift: If the model used to generate stored embeddings changes, old and new embeddings are in different spaces — similarity scores become meaningless. Mitigation: version embeddings; plan for full re-embedding when model changes.
Stale embeddings: Documents updated after embedding don’t reflect their new content until re-embedded. Mitigation: trigger re-embedding on document updates; maintain embedding freshness metadata.

How would you design the storage layer for a RAG (Retrieval-Augmented Generation) system?
?
Components needed:

Document store: Raw documents (S3, PostgreSQL, or document DB). Stores original text and metadata. Needed for reconstructing context to pass to LLM.
Chunk store: Documents split into 256-512 token chunks. Stored alongside their embeddings.
Vector index: HNSW index over chunk embeddings.
- Small scale (<1M chunks): pgvector
- Medium scale (1M-50M): pgvector or Qdrant
- Large scale (>50M): Pinecone, Weaviate, Milvus
Keyword index (optional but recommended): Elasticsearch or Typesense for BM25 keyword search. Hybrid search (vector + keyword) significantly outperforms vector-only.
Metadata filters: Store document metadata (date, source, category) alongside vectors for filtering before or after vector retrieval.

Query flow:

User query → embed (model call) → 
  → vector search top-K (HNSW) + keyword search top-K (BM25)
  → merge and deduplicate
  → optionally re-rank with cross-encoder
  → assemble context → LLM prompt → response

Key trade-off: More chunks = more context coverage but higher retrieval latency and LLM token cost. Typical chunk size: 256-512 tokens with 20% overlap.

Modern Context

What is the Lakehouse architecture and why has it become dominant?
?
Lakehouse = Data Lake (cheap object storage: S3/GCS/ADLS) + Data Warehouse (ACID, schema, queries)

Problem it solves: Traditional data lake (raw files on S3) had no ACID, no schema enforcement, no efficient updates — leading to “data swamps”. Traditional DW (Snowflake, Redshift) required copying data into proprietary formats, creating vendor lock-in and storage cost.

Key components:

Apache Iceberg (Netflix, Apple, AWS): Open table format providing ACID, schema evolution, time travel, partition pruning on Parquet files
Delta Lake (Databricks): Similar to Iceberg; tighter Spark integration; Delta Sharing for cross-org data
Apache Hudi (Uber): Optimized for upserts and incremental processing

Why it has won:

Open format: Multiple engines (Spark, Flink, Trino, Athena, DuckDB) read the same Iceberg/Delta tables
Cost: S3 storage is ~50x cheaper than Snowflake managed storage
No data movement: Compute engines run queries directly on the lake
ACID: Iceberg provides snapshot isolation, concurrent writes, rollback

2026 state: Iceberg is the emerging winner over Delta and Hudi; AWS, Google, and Azure all support it natively.

How has DuckDB changed local and embedded analytics?
?
DuckDB (open source, embedded OLAP engine):

Runs in-process — no server, no network, no cluster
Columnar vectorized execution (same model as Snowflake/BigQuery)
Reads Parquet, Arrow, CSV, JSON, and SQL databases natively
Full SQL support including window functions, CTEs, JOINs

What it replaces:

pandas for medium-scale data transformation (10GB-1TB)
SQLite for analytical queries (SQLite is row-oriented, slow for aggregations)
Spark for local development and small-to-medium ETL

Performance: DuckDB is typically 10-100x faster than pandas on group-by/aggregation queries, and comparable to Spark for datasets that fit on a single machine (< a few TB).

Ecosystem impact:

MotherDuck: Managed cloud DuckDB (hybrid local + cloud execution)
Observable: In-browser analytics using DuckDB-WASM
dbt: Native DuckDB adapter for local development
Arrow: Zero-copy data exchange between DuckDB, pandas, and Polars

2026 position: DuckDB is the default tool for local data analysis and embedded analytics in Python/R/JS. Replaces many use cases for Spark, Pandas, and SQLite.

Total Cards: 27
Review Time: ~25 minutes
Priority: HIGH
Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch04-flashcards