Chapter 3: Storage and Retrieval

Overview

To use databases effectively, you need to understand how they store and retrieve data internally. This chapter covers two families of storage engines: log-structured (append-only, LSM-trees) and page-oriented (B-trees), as well as the difference between OLTP (transactional) and OLAP (analytical) workloads. The core insight is that storage engines make radically different trade-offs depending on whether reads or writes are the bottleneck, and whether queries are point lookups or full-table analytics.

Fundamental Trade-off: Write-optimized storage (append-only logs, LSM-trees) vs read-optimized storage (B-trees, column stores) vs analytics-optimized storage (data warehouses, columnar formats).

Key Concepts

Log-Structured Storage: Hash Indexes

Simplest Database:

  • Append-only log file: echo "key,value" >> db.txt
  • Read: scan from beginning (O(n)) or maintain in-memory hash map
  • Problem: Log grows forever → need compaction and segmentation

Bitcask (Riak’s storage engine):

  • All writes append to log segment file
  • In-memory hash map: key → byte offset in log file
  • Read: look up offset in hash map, seek to position in file
  • Fast writes (sequential append), fast reads (O(1) hash + one disk seek)
  • Limitation: All keys must fit in RAM (hash map is in-memory only)

Log Compaction:

  • Compaction: Throw away duplicate keys, keep only most recent value
  • Segmentation: Split log into fixed-size segments
  • Compaction runs on frozen segments in background, then merge segments
  • Old segments deleted after merge; reads served from merged result

Append-only advantages:

  • Sequential writes much faster than random writes (especially on HDDs)
  • Crash recovery: partially written records detectable (checksums)
  • No fragmentation: old values never updated in-place
  • Concurrency: immutable segments are easy to handle

Hash index limitations:

  • Keys must fit in RAM (no disk-based hash maps that work well)
  • Range queries not efficient (must scan all keys)

SSTables and LSM-Trees

SSTable (Sorted String Table):

  • Like segmented log, but keys sorted by key within each segment
  • Merge segments efficiently: like merge sort (already sorted)
  • Sparse in-memory index: only need offset of every few KB
  • Range queries efficient: seek to start, scan forward

LSM-Tree (Log-Structured Merge-Tree):
Used by: LevelDB, RocksDB, Cassandra, HBase, Lucene

Write path:

  1. Write to in-memory sorted structure (memtable, often a red-black tree)
  2. When memtable exceeds threshold, flush to disk as an SSTable
  3. Background: merge and compact SSTables (compaction)

Read path:

  1. Check memtable (most recent writes)
  2. Check most recent on-disk SSTable
  3. Check older SSTables (in order from newest to oldest)
  4. Bloom filter: probabilistic structure to skip SSTables that don’t contain key

LSM advantages:

  • Very high write throughput (sequential writes only)
  • Compressed data (better than B-trees)
  • Efficient for range queries (sorted order)

LSM disadvantages:

  • Reads can be slow (multiple SSTable levels to check)
  • Compaction can interfere with ongoing reads/writes (I/O competition)
  • Write amplification: One logical write → many physical writes over lifetime

B-Trees

B-Tree structure:

  • Fixed-size pages (blocks), typically 4KB
  • Tree of pages: each page contains sorted keys + references to child pages
  • Leaf pages: contain actual values (or references to values)
  • Branching factor: Number of children per page (typically several hundred)
  • Height: 4 levels supports ~256TB for 4KB pages with branching factor 500

Write path:

  1. Locate leaf page containing key range
  2. Update value in-place (or add new key, split page if full)
  3. Write modified page back to disk

B-Tree vs LSM-Tree:

B-TreeLSM-Tree
Write performanceSlower (random I/O)Faster (sequential I/O)
Read performanceFast (predictable)Slower (multiple levels)
SpaceMore overheadBetter compression
Write amplificationHigh (WAL + page writes)High (compaction)
TransactionsEasier (pages in-place)More complex

Write-Ahead Log (WAL):

  • Before modifying B-tree page, write change to WAL (append-only)
  • If crash during write, WAL used to restore B-tree to consistent state
  • Every write recorded twice: WAL + actual page

Optimizations:

  • Copy-on-write (instead of WAL): Write new page, update parent reference atomically
  • Abbreviate keys in interior nodes (save space)
  • Lay out pages sequentially on disk for range scans
  • Fractal tree indexes: absorb writes, pass down lazily

Other Indexing Structures

Secondary Indexes:

  • Index on non-primary key column (e.g., CREATE INDEX ON users(email))
  • Values can point to primary key (index covers: clustered index)
  • Or values can be a list of row identifiers (non-covering)

Clustered Index: Stores actual row data in the index (InnoDB stores row data in PK index)

Covering Index (Index with Included Columns): Stores some columns in index to answer queries without heap lookup

Multi-Column Indexes:

  • Concatenated index: combines multiple fields (e.g., last_name + first_name)
  • Spatial indexes (R-trees): efficient 2D/3D range queries (PostGIS)
  • Full-text search indexes (Lucene): fuzzy matching, linguistic analysis

In-Memory Databases (Memcached, Redis, VoltDB):

  • Entire dataset in RAM
  • No penalty for writing to durable format
  • Redis: persistence via snapshotting or AOF (append-only file)
  • Advantage: avoid overhead of encoding in-memory structures for disk

OLTP vs OLAP

OLTP (Online Transaction Processing):

  • Low-latency reads/writes, small number of records per query
  • Indexed queries by primary key
  • Updates driven by user actions (orders, page views, etc.)
  • Access pattern: Look up small number of records, update based on user input

OLAP (Online Analytical Processing):

  • Large-scale aggregate queries over many records
  • Full table scans, group by, aggregations
  • Updates via bulk load (ETL) or stream
  • Access pattern: Scan millions of records, read few columns, compute aggregates
OLTPOLAP
Main readsSmall number of records by keyAggregate over many records
Main writesRandom access, low latencyBulk import (ETL) or event stream
Used byEnd users, applicationsInternal analysts, BI tools
SizeGB to TBTB to PB
DB examplesMySQL, PostgreSQLRedshift, BigQuery, Snowflake

Data Warehouses

Motivation: Don’t run analytics on OLTP DB (impacts performance, different index structure needed)

ETL (Extract-Transform-Load):

  • Extract from OLTP systems
  • Transform into analysis-friendly schema
  • Load into warehouse

Star Schema (Dimensional Modeling):

  • Fact table: Each row = event (sale, page view, click)
  • Dimension tables: Who, what, where, when, how, why of events
  • Query joins fact table to dimension tables
  • Queries tend to scan millions of fact rows, read few columns

Snowflake Schema: Dimension tables further normalized into sub-dimensions (more normalized than star)

Column-Oriented Storage

Row-oriented (traditional): Each row stored together → great for OLTP
Column-oriented: Each column stored together → great for OLAP aggregates

Column storage advantages for analytics:

  • Only read columns needed by query (not entire row)
  • Columns compress very well (repetitive values)
  • Vectorized processing: operate on columns directly without decoding

Column Compression:

  • Bitmap encoding: For columns with few distinct values (e.g., status = active/inactive)
  • Run-length encoding: active, active, active, inactive3×active, 1×inactive
  • Compressed bitmaps enable fast bitwise AND/OR operations

Sort order in column stores:

  • Can sort rows by most-queried column (like clustered index)
  • Redundant copies with different sort orders (C-store, Vertica)
  • Sort order chosen to help most common queries AND improve compression

Materialized Views and Data Cubes:

  • Materialized view: Pre-computed query result stored on disk
  • Data cube (OLAP cube): Aggregation by each combination of dimensions
  • Fast for queries that match pre-computed aggregations
  • Downside: not as flexible as raw data for ad-hoc queries

Important Points

  • LSM-trees optimize writes; B-trees optimize reads: Choose based on your workload.
  • Write amplification is real for both: B-trees write pages twice (WAL + page); LSM-trees rewrite data during compaction.
  • Bloom filters are essential for LSM-tree reads: Without them, every read checks every SSTable level.
  • Column storage is the key innovation in modern data warehouses: Row stores can’t compete for analytics.
  • OLAP and OLTP need separate storage engines: Mixing them hurts both; use ETL to bridge them.
  • In-memory DBs are not just about speed: They enable different data structures (priority queues, etc.) not efficient on disk.
  • Compaction is LSM-tree’s Achilles’ heel at high write volumes: Can saturate I/O and impact reads.

Examples & Case Studies

  1. LevelDB / RocksDB (LSM-Tree)

    • Used as storage engine for Cassandra, HBase, TiKV, many others
    • RocksDB: Facebook’s fork, optimized for NVMe SSDs
    • Used inside: MyRocks (MySQL), MongoRocks (MongoDB)
  2. InnoDB (B-Tree)

    • MySQL’s default storage engine
    • Clustered index: rows stored in primary key order
    • Separate secondary indexes point back to primary key
  3. Vertica / Snowflake (Column Store)

    • Purpose-built for OLAP analytics
    • Redundant column sorts, aggressive compression
    • Query optimizer chooses best sort order at query time
  4. Bitcask (Hash Index)

    • Riak’s default storage engine
    • All keys in RAM, values on disk
    • Best for workloads with small key space, frequent updates to same keys

Questions

  1. Why are sequential writes faster than random writes, and how do LSM-trees exploit this?
  2. When would you use an LSM-tree storage engine vs a B-tree?
  3. How does compaction work in LSM-trees, and what are its downsides?
  4. What is write amplification and why does it matter?
  5. Why is column-oriented storage better for analytics than row-oriented?
  6. When should you use an OLAP warehouse vs querying your OLTP database directly?
  7. What is a Bloom filter and why is it critical for LSM-tree performance?
  8. How does a data cube differ from a materialized view?

Modern Context (2026)

NVMe SSDs Change Trade-offs:

  • Random I/O penalty for B-trees largely eliminated on NVMe
  • LSM-tree sequential write advantage matters less
  • RocksDB tuned specifically for NVMe characteristics
  • New storage engine research: LSMT variants (WiscKey: key-value separation)

Cloud-Native Storage Separation:

  • Disaggregated storage: compute separated from storage (Aurora, Neon, TiDB)
  • Shared storage layer replaces WAL-per-node architectures
  • S3-compatible object storage as “infinite disk”

HTAP (Hybrid Transactional/Analytical Processing):

  • Single database for both OLTP and OLAP workloads
  • TiDB (row store + column store in one), SingleStore, YugabyteDB
  • Eliminates ETL pipeline latency for real-time analytics

Lakehouse Architecture:

  • Delta Lake, Apache Iceberg, Apache Hudi
  • ACID transactions over Parquet files on S3
  • Blur the line between data lake and data warehouse
  • Merge OLAP query performance with raw data flexibility

Vector Storage:

  • HNSW (Hierarchical Navigable Small World) index for approximate nearest neighbor
  • Specialized storage for embedding vectors
  • Integrated into PostgreSQL (pgvector), Elasticsearch, and dedicated systems

Status: Notes complete
Last Updated: 2026-04-13