Chapter 4 Flashcards — Storage and Retrieval
flashcards ddia-2e chapter4 storage retrieval lsm-tree b-tree column-store vector
Definitions and Mechanisms
What is an LSM-Tree and what are the key components of its write path?
?
LSM-Tree (Log-Structured Merge-Tree): A write-optimized storage engine built on:
Write path (3 stages):
- Memtable: Writes go to an in-memory sorted tree (red-black tree or AVL tree) + WAL (for crash recovery)
- Flush: When memtable exceeds threshold (~4MB), it is flushed to an immutable SSTable on disk (Level 0)
- Compaction: Background process merges SSTables across levels, discarding old values and maintaining sorted order
Key insight: All writes are sequential (append to WAL, sorted flush to SSTable) — no random writes to disk.
Bloom filter: Attached to each SSTable to quickly rule out “key definitely not in this SSTable” — avoids reading SSTables that don’t contain the key.
Systems: RocksDB, LevelDB, Cassandra, HBase, ScyllaDB, Bigtable
What is an SSTable and why does sorting keys matter?
?
SSTable (Sorted String Table): An immutable on-disk log segment where keys are sorted.
Why sorting matters:
- Efficient merging: Merging multiple SSTables is like merge sort — O(N log K) where K is the number of SSTables
- Sparse index: Don’t need to keep all keys in memory; a sparse index (one entry per few KB) is enough to find a key range
- Range queries: Keys in sorted order → efficient range scans
- Better compression: Adjacent values of the same key/prefix compress better (delta encoding, prefix compression)
Contrast with unsorted hash log: Hash log requires all keys in memory + no range queries.
How does a B-Tree handle writes?
?
B-Tree write algorithm:
- Find the leaf page containing the target key (traverse tree from root)
- Modify in place: Update the value in the leaf page (overwrite that page on disk)
- Split if full: If the leaf page is full after modification, split it into two pages and update the parent page
- WAL (Write-Ahead Log): Before modifying any page, write the intended change to the WAL — enables crash recovery
Key property: B-Trees are update-in-place. The same page on disk is overwritten (unlike LSM-Trees which only append).
Branching factor: Typically several hundred child pointers per page. A 4-level B-Tree with branching factor 500 can address 500^4 = 62.5 trillion key-value pairs.
Crash recovery: The WAL allows the database to reconstruct the B-Tree to a consistent state after a crash.
What are the three amplification metrics for storage engines?
?
Three metrics for evaluating storage engine efficiency:
Write Amplification: How many physical disk writes per one logical write?
- B-Tree: ~1-2x (WAL write + in-place page update)
- LSM-Tree: higher (data rewritten multiple times during compaction across levels)
- High write amp → SSD wear, more I/O bandwidth consumed
Read Amplification: How many physical disk reads per one logical read?
- B-Tree: O(log N) pages along one tree path
- LSM-Tree: check memtable + Bloom filters + potentially multiple SSTable levels
- High read amp → slower reads
Space Amplification: Physical disk used / logical data size?
- B-Tree: ~2x (page fragmentation, free pages)
- LSM-Tree: up to 10x during active compaction (old + new SSTable copies coexist)
- High space amp → storage cost
What is column-oriented storage and why is it faster for analytics?
?
Column-oriented storage: All values for a single column are stored together (in the same file or page), rather than storing all values for a row together.
Why it’s faster for analytics (3 reasons):
- Read only needed columns: A query touching 3 out of 100 columns reads 3% of the data. Row storage reads 100%.
- Compression is dramatically more effective: A column of integers or a status field with 5 distinct values compresses far better than a mixed row. Dictionary encoding + RLE can achieve 10-100x compression.
- Vectorized processing: A column is a dense array of the same type — ideal for CPU SIMD instructions that process 8-32 values per instruction. Row storage interleaves types, breaking SIMD efficiency.
Additional benefit: Operating on compressed data without decompressing (bitmap operations on dictionary-encoded columns).
Systems: Snowflake, BigQuery, Amazon Redshift, DuckDB, ClickHouse, Apache Parquet (file format)
What is vectorized execution and how does it differ from the Volcano model?
?
Volcano (iterator) model: Each operator implements next() returning one row at a time.
- Calls: 1 function call per row × N operators × M rows = enormous call overhead
- Poor CPU cache utilization (one row processed, then discarded)
Vectorized execution: Operators process a batch of rows (e.g., 1024) at a time.
- Same operators, but each receives and returns a column-batch instead of one row
- SIMD: CPU processes 8-32 values per instruction on the batch
- Function call amortization: 1 function call per 1024 rows vs 1 per row
Example: 1 billion rows, 3 operators:
- Volcano: 3B function calls
- Vectorized: ~3M function calls, each processing 1024 values with SIMD
- Net speedup: ~100-300x for scan-heavy queries
Used by: DuckDB, ClickHouse, Apache Arrow DataFusion, Velox (Meta), modern Snowflake/BigQuery
What is a vector embedding and what similarity metrics are used?
?
Vector embedding: A high-dimensional floating-point vector (typically 256–4096 dimensions) that encodes the semantic meaning of an object (text, image, code, audio).
Key property: Objects with similar meaning have vectors that are close together in the embedding space.
embed("cat sat on mat") → [0.12, -0.45, 0.89, ...]
embed("feline on rug") → [0.11, -0.47, 0.91, ...] # close!
embed("stock market") → [-0.8, 0.23, -0.5, ...] # far awaySimilarity metrics:
- Cosine similarity: Angle between vectors. 1.0 = identical direction, 0.0 = orthogonal. Most common for text.
- Euclidean distance: Geometric distance. Used for image embeddings.
- Dot product: Fastest; equivalent to cosine when vectors are normalized. Used by OpenAI Ada.
Use cases: Semantic search, RAG (LLM retrieval), recommendation systems, image search, anomaly detection, duplicate detection
What is HNSW and why is approximate nearest neighbor (ANN) search necessary?
?
Why ANN is necessary: Exact nearest neighbor search in D-dimensional space requires computing distance to every vector (O(N × D)). For N=10M, D=1536, that’s 15 billion multiplications per query — too slow for real-time retrieval.
HNSW (Hierarchical Navigable Small World): A graph-based ANN algorithm:
- Builds a multi-layer graph where each node connects to its nearest neighbors at multiple scales
- Layer 0 (bottom): full graph with all vectors, each connected to nearby vectors
- Higher layers: sparser graphs that allow “long jumps” for fast navigation
- Query: start at top layer, greedily navigate toward query vector, descend to bottom layer
Properties:
- Query complexity: O(log N) — dramatically faster than exact O(N)
- Recall: 95-99% accuracy (finds nearly all true nearest neighbors)
- Memory: High (graph edges stored in RAM for fast traversal)
- Build time: Moderate (O(N log N))
Used by: pgvector (default in v0.5+), Weaviate, Qdrant, Chroma, Pinecone
Trade-offs and Comparisons
LSM-Tree vs B-Tree — when to use each?
?
Use LSM-Tree when:
- Write-heavy workloads: High insert/update throughput is critical
- SSD storage: Sequential writes extend SSD life (less random write wear)
- Compression matters: LSM stores sorted data → excellent compression ratios
- Example workloads: Time-series ingestion, event logging, wide-column stores
- Systems: Cassandra, RocksDB, HBase, ScyllaDB
Use B-Tree when:
- Read-heavy OLTP: Low-latency reads are critical (single tree path = predictable)
- Range queries are frequent: B-Tree leaf pages are linked, range scans very efficient
- Consistency/recovery: Simpler crash recovery story (well-understood, battle-tested)
- Mixed read/write: Most OLTP apps have more reads than writes
- Systems: MySQL InnoDB, PostgreSQL, SQLite, Oracle
Key trade-off: LSM wins on write throughput; B-Tree wins on read latency and predictability. Bloom filters partially close LSM’s read gap.
What are the trade-offs of column-oriented storage vs row-oriented storage?
?
| Dimension | Row-Oriented | Column-Oriented |
|---|---|---|
| Full row read | Fast (one I/O) | Slow (N column files) |
| Single column scan | Slow (reads all columns) | Fast (reads only that file) |
| Write pattern | Simple (append row) | Complex (update N files) |
| Compression | Poor (mixed types) | Excellent (homogeneous type) |
| SIMD effectiveness | Poor (mixed types) | Excellent (uniform arrays) |
| OLTP queries | Good | Poor |
| OLAP aggregations | Poor | Excellent |
Key insight: Column stores shine for analytical queries that scan 3-10 columns out of 100. Row stores shine for OLTP that reads most columns of specific rows by key.
Modern convergence: Snowflake, BigQuery, and DuckDB are column-oriented but support SQL. PostgreSQL (row-oriented) added columnar storage extensions (pg_analytics, Hydra).
What are the trade-offs of in-memory vs disk-based databases?
?
In-memory advantages:
- Much faster access: No I/O latency (RAM ~100ns vs SSD ~100µs = 1000x)
- Richer data structures: Hash tables, linked lists, sorted sets — impractical on disk
- No serialization overhead: Data stays in native in-memory format
In-memory limitations:
- Volatile by default: Power loss = data loss (mitigated with WAL, replication, snapshots)
- Dataset size limited: Must fit in RAM (expensive)
- Not always faster for large datasets: Disk I/O cost amortized by OS page cache; hot data is cached anyway
Durability approaches for in-memory DBs:
- Write-ahead log + periodic snapshots (Redis AOF + RDB)
- Disk as a backup medium (load on startup, operate in RAM)
- Replication (treat replicas as durability mechanism)
Key insight (from DDIA): The performance advantage of in-memory DBs over disk-based DBs is not primarily about avoiding disk reads (which are cached by the OS). It is about using richer data structures that would be too expensive to serialize/deserialize.
What are the pros and cons of materialized views?
?
Materialized view: A query result pre-computed and stored physically on disk.
Pros:
- Instant access for the pre-computed query (no recomputation)
- Can index the materialized view for additional speed
- Dramatically accelerates BI dashboards with known query patterns
Cons:
- Storage overhead (duplicate data)
- Staleness: data may be out of date between refreshes
- Refresh cost: full refresh is expensive for large datasets; incremental refresh is complex
- Inflexibility: must know query patterns in advance
When to use:
- Fixed, frequently-run aggregate queries (e.g., daily revenue by product)
- BI dashboards with predictable query patterns
- dbt uses materialized views/tables extensively as intermediate transformation layers
When NOT to use:
- Ad-hoc analytical queries (patterns unknown in advance)
- Highly volatile source data (refresh cost > query savings)
Numbers and Precision
What are the key performance numbers for storage engine comparison?
?
B-Tree:
- Page size: 4KB (traditional) to 64KB (modern SSDs)
- Branching factor: ~500 child pointers per page
- 4-level B-Tree at 500 branching factor: 500^4 = 62.5 trillion addressable key-value pairs
- Read: O(log N) pages (typically 3-4 disk reads for large datasets)
LSM-Tree:
- Memtable flush threshold: typically 4-256 MB
- Level size ratio: typically 10x per level (L0: 256MB, L1: 2.56GB, L2: 25.6GB, …)
- Bloom filter: ~10 bits per key achieves ~1% false positive rate
- Space amplification during compaction: can reach ~10x temporarily
Vector indexes:
- Typical embedding dimensions: 256 (small), 768 (BERT), 1536 (OpenAI Ada), 3072 (text-embedding-3-large)
- HNSW query complexity: O(log N) with 95-99% recall
- Exact brute-force: O(N × D) — prohibitive at N=10M, D=1536 (15B multiplications)
- pgvector HNSW: can handle ~1M vectors effectively; dedicated DBs handle billions
What are the key characteristics of a data warehouse query vs OLTP query?
?
OLTP query characteristics:
- Touches: 10-100 rows (small dataset per query)
- Uses: primary key or secondary index lookup (random access)
- Duration: < 10ms (often < 1ms)
- Concurrency: thousands of simultaneous queries
- Writes: individual row inserts/updates
OLAP query characteristics:
- Touches: millions to billions of rows (sequential scan)
- Uses: column scans + filters + aggregations
- Duration: seconds to minutes
- Concurrency: tens of simultaneous queries
- Writes: batch ETL/ELT loads (bulk inserts)
Key numbers for Snowflake/BigQuery:
- BigQuery: scans up to ~100GB/second per slot; prices by data scanned
- Snowflake: virtual warehouses scale from XS (1 node) to 6XL (128+ nodes)
- ClickHouse: can ingest >1 million rows/second per node; sub-second GROUP BY on billions of rows
Application and Failure Modes
How would you choose between a dedicated vector database and pgvector for a production system?
?
Choose pgvector when:
- Already using PostgreSQL — avoid operational complexity of another database
- Dataset is small-to-medium (<5M vectors)
- Queries combine vector similarity with relational filters (e.g.,
WHERE user_id = 42 ORDER BY embedding <-> $query LIMIT 10) - ACID transactions needed alongside vector operations
- Team wants single database system
Choose a dedicated vector DB (Pinecone, Weaviate, Qdrant) when:
- Very large scale (>100M vectors)
- Need maximum query performance (QPS > 10,000)
- Need advanced features: multi-vector search, metadata filtering at scale, distributed sharding
- No existing PostgreSQL dependency
Hybrid approach: Use pgvector for development and small production workloads; migrate to a dedicated vector DB if throughput or scale requirements exceed pgvector’s limits.
Warning signs pgvector is insufficient: Query latency >100ms at target load, dataset approaching 50M+ vectors, requiring complex metadata filtering on billions of vectors.
What are the failure modes of LSM-Tree compaction?
?
-
Write stalls: If writes arrive faster than compaction can proceed, the memtable fills up and writes must be blocked (“write stall”). Mitigation: tune compaction concurrency, use rate limiting, monitor L0 file count.
-
Compaction I/O spikes: Background compaction competes with foreground reads/writes for disk bandwidth. Result: unpredictable read/write latency spikes. Mitigation: prioritize foreground I/O, use dedicated compaction disks.
-
Space amplification: During compaction, old and new SSTables coexist temporarily. Total disk usage can be 2-10x the logical data size. Mitigation: monitor disk usage carefully; size disks with headroom.
-
Read amplification without Bloom filters: If Bloom filters are too small (low memory), false positive rate rises and the system reads many SSTables for each key lookup. Mitigation: allocate sufficient memory for Bloom filters (10 bits/key is typical).
-
Tombstone accumulation: Deletes in LSM-Trees write a tombstone marker rather than removing data. Until compaction merges away the tombstone, reads must check for it. In write-heavy workloads with many deletes, tombstones accumulate and hurt read performance.
What are the failure modes of vector search systems?
?
-
Index build time: HNSW index build for 100M vectors can take hours. Must be done offline; live re-indexing is expensive. Mitigation: plan index builds as batch operations; use delta indexes for new data.
-
Memory pressure: HNSW stores the full proximity graph in RAM. 100M vectors × 1536 dims × 4 bytes = ~600GB of raw vectors, plus graph edges. Mitigation: use product quantization (PQ) to compress, or use IVF+PQ instead of HNSW.
-
Recall degradation under filtering: When applying metadata filters with vector search, naively running filtered ANN can return fewer than K results if the filter is very restrictive. Mitigation: use systems that natively support pre-filtering (Qdrant’s payload indexing, Weaviate’s filtering).
-
Embedding model drift: If the model used to generate stored embeddings changes, old and new embeddings are in different spaces — similarity scores become meaningless. Mitigation: version embeddings; plan for full re-embedding when model changes.
-
Stale embeddings: Documents updated after embedding don’t reflect their new content until re-embedded. Mitigation: trigger re-embedding on document updates; maintain embedding freshness metadata.
How would you design the storage layer for a RAG (Retrieval-Augmented Generation) system?
?
Components needed:
-
Document store: Raw documents (S3, PostgreSQL, or document DB). Stores original text and metadata. Needed for reconstructing context to pass to LLM.
-
Chunk store: Documents split into 256-512 token chunks. Stored alongside their embeddings.
-
Vector index: HNSW index over chunk embeddings.
- Small scale (<1M chunks): pgvector
- Medium scale (1M-50M): pgvector or Qdrant
- Large scale (>50M): Pinecone, Weaviate, Milvus
-
Keyword index (optional but recommended): Elasticsearch or Typesense for BM25 keyword search. Hybrid search (vector + keyword) significantly outperforms vector-only.
-
Metadata filters: Store document metadata (date, source, category) alongside vectors for filtering before or after vector retrieval.
Query flow:
User query → embed (model call) →
→ vector search top-K (HNSW) + keyword search top-K (BM25)
→ merge and deduplicate
→ optionally re-rank with cross-encoder
→ assemble context → LLM prompt → response
Key trade-off: More chunks = more context coverage but higher retrieval latency and LLM token cost. Typical chunk size: 256-512 tokens with 20% overlap.
Modern Context
What is the Lakehouse architecture and why has it become dominant?
?
Lakehouse = Data Lake (cheap object storage: S3/GCS/ADLS) + Data Warehouse (ACID, schema, queries)
Problem it solves: Traditional data lake (raw files on S3) had no ACID, no schema enforcement, no efficient updates — leading to “data swamps”. Traditional DW (Snowflake, Redshift) required copying data into proprietary formats, creating vendor lock-in and storage cost.
Key components:
- Apache Iceberg (Netflix, Apple, AWS): Open table format providing ACID, schema evolution, time travel, partition pruning on Parquet files
- Delta Lake (Databricks): Similar to Iceberg; tighter Spark integration; Delta Sharing for cross-org data
- Apache Hudi (Uber): Optimized for upserts and incremental processing
Why it has won:
- Open format: Multiple engines (Spark, Flink, Trino, Athena, DuckDB) read the same Iceberg/Delta tables
- Cost: S3 storage is ~50x cheaper than Snowflake managed storage
- No data movement: Compute engines run queries directly on the lake
- ACID: Iceberg provides snapshot isolation, concurrent writes, rollback
2026 state: Iceberg is the emerging winner over Delta and Hudi; AWS, Google, and Azure all support it natively.
How has DuckDB changed local and embedded analytics?
?
DuckDB (open source, embedded OLAP engine):
- Runs in-process — no server, no network, no cluster
- Columnar vectorized execution (same model as Snowflake/BigQuery)
- Reads Parquet, Arrow, CSV, JSON, and SQL databases natively
- Full SQL support including window functions, CTEs, JOINs
What it replaces:
- pandas for medium-scale data transformation (10GB-1TB)
- SQLite for analytical queries (SQLite is row-oriented, slow for aggregations)
- Spark for local development and small-to-medium ETL
Performance: DuckDB is typically 10-100x faster than pandas on group-by/aggregation queries, and comparable to Spark for datasets that fit on a single machine (< a few TB).
Ecosystem impact:
- MotherDuck: Managed cloud DuckDB (hybrid local + cloud execution)
- Observable: In-browser analytics using DuckDB-WASM
- dbt: Native DuckDB adapter for local development
- Arrow: Zero-copy data exchange between DuckDB, pandas, and Polars
2026 position: DuckDB is the default tool for local data analysis and embedded analytics in Python/R/JS. Replaces many use cases for Spark, Pandas, and SQLite.
Total Cards: 27
Review Time: ~25 minutes
Priority: HIGH
Last Updated: 2026-05-29