Chapter 3: Storage and Retrieval
Overview
To use databases effectively, you need to understand how they store and retrieve data internally. This chapter covers two families of storage engines: log-structured (append-only, LSM-trees) and page-oriented (B-trees), as well as the difference between OLTP (transactional) and OLAP (analytical) workloads. The core insight is that storage engines make radically different trade-offs depending on whether reads or writes are the bottleneck, and whether queries are point lookups or full-table analytics.
Fundamental Trade-off: Write-optimized storage (append-only logs, LSM-trees) vs read-optimized storage (B-trees, column stores) vs analytics-optimized storage (data warehouses, columnar formats).
Key Concepts
Log-Structured Storage: Hash Indexes
Simplest Database:
- Append-only log file:
echo "key,value" >> db.txt - Read: scan from beginning (O(n)) or maintain in-memory hash map
- Problem: Log grows forever → need compaction and segmentation
Bitcask (Riak’s storage engine):
- All writes append to log segment file
- In-memory hash map: key → byte offset in log file
- Read: look up offset in hash map, seek to position in file
- Fast writes (sequential append), fast reads (O(1) hash + one disk seek)
- Limitation: All keys must fit in RAM (hash map is in-memory only)
Log Compaction:
- Compaction: Throw away duplicate keys, keep only most recent value
- Segmentation: Split log into fixed-size segments
- Compaction runs on frozen segments in background, then merge segments
- Old segments deleted after merge; reads served from merged result
Append-only advantages:
- Sequential writes much faster than random writes (especially on HDDs)
- Crash recovery: partially written records detectable (checksums)
- No fragmentation: old values never updated in-place
- Concurrency: immutable segments are easy to handle
Hash index limitations:
- Keys must fit in RAM (no disk-based hash maps that work well)
- Range queries not efficient (must scan all keys)
SSTables and LSM-Trees
SSTable (Sorted String Table):
- Like segmented log, but keys sorted by key within each segment
- Merge segments efficiently: like merge sort (already sorted)
- Sparse in-memory index: only need offset of every few KB
- Range queries efficient: seek to start, scan forward
LSM-Tree (Log-Structured Merge-Tree):
Used by: LevelDB, RocksDB, Cassandra, HBase, Lucene
Write path:
- Write to in-memory sorted structure (memtable, often a red-black tree)
- When memtable exceeds threshold, flush to disk as an SSTable
- Background: merge and compact SSTables (compaction)
Read path:
- Check memtable (most recent writes)
- Check most recent on-disk SSTable
- Check older SSTables (in order from newest to oldest)
- Bloom filter: probabilistic structure to skip SSTables that don’t contain key
LSM advantages:
- Very high write throughput (sequential writes only)
- Compressed data (better than B-trees)
- Efficient for range queries (sorted order)
LSM disadvantages:
- Reads can be slow (multiple SSTable levels to check)
- Compaction can interfere with ongoing reads/writes (I/O competition)
- Write amplification: One logical write → many physical writes over lifetime
B-Trees
B-Tree structure:
- Fixed-size pages (blocks), typically 4KB
- Tree of pages: each page contains sorted keys + references to child pages
- Leaf pages: contain actual values (or references to values)
- Branching factor: Number of children per page (typically several hundred)
- Height: 4 levels supports ~256TB for 4KB pages with branching factor 500
Write path:
- Locate leaf page containing key range
- Update value in-place (or add new key, split page if full)
- Write modified page back to disk
B-Tree vs LSM-Tree:
| B-Tree | LSM-Tree | |
|---|---|---|
| Write performance | Slower (random I/O) | Faster (sequential I/O) |
| Read performance | Fast (predictable) | Slower (multiple levels) |
| Space | More overhead | Better compression |
| Write amplification | High (WAL + page writes) | High (compaction) |
| Transactions | Easier (pages in-place) | More complex |
Write-Ahead Log (WAL):
- Before modifying B-tree page, write change to WAL (append-only)
- If crash during write, WAL used to restore B-tree to consistent state
- Every write recorded twice: WAL + actual page
Optimizations:
- Copy-on-write (instead of WAL): Write new page, update parent reference atomically
- Abbreviate keys in interior nodes (save space)
- Lay out pages sequentially on disk for range scans
- Fractal tree indexes: absorb writes, pass down lazily
Other Indexing Structures
Secondary Indexes:
- Index on non-primary key column (e.g.,
CREATE INDEX ON users(email)) - Values can point to primary key (index covers: clustered index)
- Or values can be a list of row identifiers (non-covering)
Clustered Index: Stores actual row data in the index (InnoDB stores row data in PK index)
Covering Index (Index with Included Columns): Stores some columns in index to answer queries without heap lookup
Multi-Column Indexes:
- Concatenated index: combines multiple fields (e.g., last_name + first_name)
- Spatial indexes (R-trees): efficient 2D/3D range queries (PostGIS)
- Full-text search indexes (Lucene): fuzzy matching, linguistic analysis
In-Memory Databases (Memcached, Redis, VoltDB):
- Entire dataset in RAM
- No penalty for writing to durable format
- Redis: persistence via snapshotting or AOF (append-only file)
- Advantage: avoid overhead of encoding in-memory structures for disk
OLTP vs OLAP
OLTP (Online Transaction Processing):
- Low-latency reads/writes, small number of records per query
- Indexed queries by primary key
- Updates driven by user actions (orders, page views, etc.)
- Access pattern: Look up small number of records, update based on user input
OLAP (Online Analytical Processing):
- Large-scale aggregate queries over many records
- Full table scans, group by, aggregations
- Updates via bulk load (ETL) or stream
- Access pattern: Scan millions of records, read few columns, compute aggregates
| OLTP | OLAP | |
|---|---|---|
| Main reads | Small number of records by key | Aggregate over many records |
| Main writes | Random access, low latency | Bulk import (ETL) or event stream |
| Used by | End users, applications | Internal analysts, BI tools |
| Size | GB to TB | TB to PB |
| DB examples | MySQL, PostgreSQL | Redshift, BigQuery, Snowflake |
Data Warehouses
Motivation: Don’t run analytics on OLTP DB (impacts performance, different index structure needed)
ETL (Extract-Transform-Load):
- Extract from OLTP systems
- Transform into analysis-friendly schema
- Load into warehouse
Star Schema (Dimensional Modeling):
- Fact table: Each row = event (sale, page view, click)
- Dimension tables: Who, what, where, when, how, why of events
- Query joins fact table to dimension tables
- Queries tend to scan millions of fact rows, read few columns
Snowflake Schema: Dimension tables further normalized into sub-dimensions (more normalized than star)
Column-Oriented Storage
Row-oriented (traditional): Each row stored together → great for OLTP
Column-oriented: Each column stored together → great for OLAP aggregates
Column storage advantages for analytics:
- Only read columns needed by query (not entire row)
- Columns compress very well (repetitive values)
- Vectorized processing: operate on columns directly without decoding
Column Compression:
- Bitmap encoding: For columns with few distinct values (e.g., status = active/inactive)
- Run-length encoding:
active, active, active, inactive→3×active, 1×inactive - Compressed bitmaps enable fast bitwise AND/OR operations
Sort order in column stores:
- Can sort rows by most-queried column (like clustered index)
- Redundant copies with different sort orders (C-store, Vertica)
- Sort order chosen to help most common queries AND improve compression
Materialized Views and Data Cubes:
- Materialized view: Pre-computed query result stored on disk
- Data cube (OLAP cube): Aggregation by each combination of dimensions
- Fast for queries that match pre-computed aggregations
- Downside: not as flexible as raw data for ad-hoc queries
Important Points
- LSM-trees optimize writes; B-trees optimize reads: Choose based on your workload.
- Write amplification is real for both: B-trees write pages twice (WAL + page); LSM-trees rewrite data during compaction.
- Bloom filters are essential for LSM-tree reads: Without them, every read checks every SSTable level.
- Column storage is the key innovation in modern data warehouses: Row stores can’t compete for analytics.
- OLAP and OLTP need separate storage engines: Mixing them hurts both; use ETL to bridge them.
- In-memory DBs are not just about speed: They enable different data structures (priority queues, etc.) not efficient on disk.
- Compaction is LSM-tree’s Achilles’ heel at high write volumes: Can saturate I/O and impact reads.
Examples & Case Studies
-
LevelDB / RocksDB (LSM-Tree)
- Used as storage engine for Cassandra, HBase, TiKV, many others
- RocksDB: Facebook’s fork, optimized for NVMe SSDs
- Used inside: MyRocks (MySQL), MongoRocks (MongoDB)
-
InnoDB (B-Tree)
- MySQL’s default storage engine
- Clustered index: rows stored in primary key order
- Separate secondary indexes point back to primary key
-
Vertica / Snowflake (Column Store)
- Purpose-built for OLAP analytics
- Redundant column sorts, aggressive compression
- Query optimizer chooses best sort order at query time
-
Bitcask (Hash Index)
- Riak’s default storage engine
- All keys in RAM, values on disk
- Best for workloads with small key space, frequent updates to same keys
Questions
- Why are sequential writes faster than random writes, and how do LSM-trees exploit this?
- When would you use an LSM-tree storage engine vs a B-tree?
- How does compaction work in LSM-trees, and what are its downsides?
- What is write amplification and why does it matter?
- Why is column-oriented storage better for analytics than row-oriented?
- When should you use an OLAP warehouse vs querying your OLTP database directly?
- What is a Bloom filter and why is it critical for LSM-tree performance?
- How does a data cube differ from a materialized view?
Modern Context (2026)
NVMe SSDs Change Trade-offs:
- Random I/O penalty for B-trees largely eliminated on NVMe
- LSM-tree sequential write advantage matters less
- RocksDB tuned specifically for NVMe characteristics
- New storage engine research: LSMT variants (WiscKey: key-value separation)
Cloud-Native Storage Separation:
- Disaggregated storage: compute separated from storage (Aurora, Neon, TiDB)
- Shared storage layer replaces WAL-per-node architectures
- S3-compatible object storage as “infinite disk”
HTAP (Hybrid Transactional/Analytical Processing):
- Single database for both OLTP and OLAP workloads
- TiDB (row store + column store in one), SingleStore, YugabyteDB
- Eliminates ETL pipeline latency for real-time analytics
Lakehouse Architecture:
- Delta Lake, Apache Iceberg, Apache Hudi
- ACID transactions over Parquet files on S3
- Blur the line between data lake and data warehouse
- Merge OLAP query performance with raw data flexibility
Vector Storage:
- HNSW (Hierarchical Navigable Small World) index for approximate nearest neighbor
- Specialized storage for embedding vectors
- Integrated into PostgreSQL (pgvector), Elasticsearch, and dedicated systems
Status: Notes complete
Last Updated: 2026-04-13