Chapter 3: Storage and Retrieval

Overview

To use databases effectively, you need to understand how they store and retrieve data internally. This chapter covers two families of storage engines: log-structured (append-only, LSM-trees) and page-oriented (B-trees), as well as the difference between OLTP (transactional) and OLAP (analytical) workloads. The core insight is that storage engines make radically different trade-offs depending on whether reads or writes are the bottleneck, and whether queries are point lookups or full-table analytics.

Fundamental Trade-off: Write-optimized storage (append-only logs, LSM-trees) vs read-optimized storage (B-trees, column stores) vs analytics-optimized storage (data warehouses, columnar formats).

Key Concepts

Log-Structured Storage: Hash Indexes

Simplest Database:

Append-only log file: echo "key,value" >> db.txt
Read: scan from beginning (O(n)) or maintain in-memory hash map
Problem: Log grows forever → need compaction and segmentation

Bitcask (Riak’s storage engine):

All writes append to log segment file
In-memory hash map: key → byte offset in log file
Read: look up offset in hash map, seek to position in file
Fast writes (sequential append), fast reads (O(1) hash + one disk seek)
Limitation: All keys must fit in RAM (hash map is in-memory only)

Log Compaction:

Compaction: Throw away duplicate keys, keep only most recent value
Segmentation: Split log into fixed-size segments
Compaction runs on frozen segments in background, then merge segments
Old segments deleted after merge; reads served from merged result

Append-only advantages:

Sequential writes much faster than random writes (especially on HDDs)
Crash recovery: partially written records detectable (checksums)
No fragmentation: old values never updated in-place
Concurrency: immutable segments are easy to handle

Hash index limitations:

Keys must fit in RAM (no disk-based hash maps that work well)
Range queries not efficient (must scan all keys)

SSTables and LSM-Trees

SSTable (Sorted String Table):

Like segmented log, but keys sorted by key within each segment
Merge segments efficiently: like merge sort (already sorted)
Sparse in-memory index: only need offset of every few KB
Range queries efficient: seek to start, scan forward

LSM-Tree (Log-Structured Merge-Tree):
Used by: LevelDB, RocksDB, Cassandra, HBase, Lucene

Write path:

Write to in-memory sorted structure (memtable, often a red-black tree)
When memtable exceeds threshold, flush to disk as an SSTable
Background: merge and compact SSTables (compaction)

Read path:

Check memtable (most recent writes)
Check most recent on-disk SSTable
Check older SSTables (in order from newest to oldest)
Bloom filter: probabilistic structure to skip SSTables that don’t contain key

LSM advantages:

Very high write throughput (sequential writes only)
Compressed data (better than B-trees)
Efficient for range queries (sorted order)

LSM disadvantages:

Reads can be slow (multiple SSTable levels to check)
Compaction can interfere with ongoing reads/writes (I/O competition)
Write amplification: One logical write → many physical writes over lifetime

B-Trees

B-Tree structure:

Fixed-size pages (blocks), typically 4KB
Tree of pages: each page contains sorted keys + references to child pages
Leaf pages: contain actual values (or references to values)
Branching factor: Number of children per page (typically several hundred)
Height: 4 levels supports ~256TB for 4KB pages with branching factor 500

Write path:

Locate leaf page containing key range
Update value in-place (or add new key, split page if full)
Write modified page back to disk

B-Tree vs LSM-Tree:

	B-Tree	LSM-Tree
Write performance	Slower (random I/O)	Faster (sequential I/O)
Read performance	Fast (predictable)	Slower (multiple levels)
Space	More overhead	Better compression
Write amplification	High (WAL + page writes)	High (compaction)
Transactions	Easier (pages in-place)	More complex

Write-Ahead Log (WAL):

Before modifying B-tree page, write change to WAL (append-only)
If crash during write, WAL used to restore B-tree to consistent state
Every write recorded twice: WAL + actual page

Optimizations:

Copy-on-write (instead of WAL): Write new page, update parent reference atomically
Abbreviate keys in interior nodes (save space)
Lay out pages sequentially on disk for range scans
Fractal tree indexes: absorb writes, pass down lazily

Other Indexing Structures

Secondary Indexes:

Index on non-primary key column (e.g., CREATE INDEX ON users(email))
Values can point to primary key (index covers: clustered index)
Or values can be a list of row identifiers (non-covering)

Clustered Index: Stores actual row data in the index (InnoDB stores row data in PK index)

Covering Index (Index with Included Columns): Stores some columns in index to answer queries without heap lookup

Multi-Column Indexes:

Concatenated index: combines multiple fields (e.g., last_name + first_name)
Spatial indexes (R-trees): efficient 2D/3D range queries (PostGIS)
Full-text search indexes (Lucene): fuzzy matching, linguistic analysis

In-Memory Databases (Memcached, Redis, VoltDB):

Entire dataset in RAM
No penalty for writing to durable format
Redis: persistence via snapshotting or AOF (append-only file)
Advantage: avoid overhead of encoding in-memory structures for disk

OLTP vs OLAP

OLTP (Online Transaction Processing):

Low-latency reads/writes, small number of records per query
Indexed queries by primary key
Updates driven by user actions (orders, page views, etc.)
Access pattern: Look up small number of records, update based on user input

OLAP (Online Analytical Processing):

Large-scale aggregate queries over many records
Full table scans, group by, aggregations
Updates via bulk load (ETL) or stream
Access pattern: Scan millions of records, read few columns, compute aggregates

	OLTP	OLAP
Main reads	Small number of records by key	Aggregate over many records
Main writes	Random access, low latency	Bulk import (ETL) or event stream
Used by	End users, applications	Internal analysts, BI tools
Size	GB to TB	TB to PB
DB examples	MySQL, PostgreSQL	Redshift, BigQuery, Snowflake

Data Warehouses

Motivation: Don’t run analytics on OLTP DB (impacts performance, different index structure needed)

ETL (Extract-Transform-Load):

Extract from OLTP systems
Transform into analysis-friendly schema
Load into warehouse

Star Schema (Dimensional Modeling):

Fact table: Each row = event (sale, page view, click)
Dimension tables: Who, what, where, when, how, why of events
Query joins fact table to dimension tables
Queries tend to scan millions of fact rows, read few columns

Snowflake Schema: Dimension tables further normalized into sub-dimensions (more normalized than star)

Column-Oriented Storage

Row-oriented (traditional): Each row stored together → great for OLTP
Column-oriented: Each column stored together → great for OLAP aggregates

Column storage advantages for analytics:

Only read columns needed by query (not entire row)
Columns compress very well (repetitive values)
Vectorized processing: operate on columns directly without decoding

Column Compression:

Bitmap encoding: For columns with few distinct values (e.g., status = active/inactive)
Run-length encoding: active, active, active, inactive → 3×active, 1×inactive
Compressed bitmaps enable fast bitwise AND/OR operations

Sort order in column stores:

Can sort rows by most-queried column (like clustered index)
Redundant copies with different sort orders (C-store, Vertica)
Sort order chosen to help most common queries AND improve compression

Materialized Views and Data Cubes:

Materialized view: Pre-computed query result stored on disk
Data cube (OLAP cube): Aggregation by each combination of dimensions
Fast for queries that match pre-computed aggregations
Downside: not as flexible as raw data for ad-hoc queries

Important Points

LSM-trees optimize writes; B-trees optimize reads: Choose based on your workload.
Write amplification is real for both: B-trees write pages twice (WAL + page); LSM-trees rewrite data during compaction.
Bloom filters are essential for LSM-tree reads: Without them, every read checks every SSTable level.
Column storage is the key innovation in modern data warehouses: Row stores can’t compete for analytics.
OLAP and OLTP need separate storage engines: Mixing them hurts both; use ETL to bridge them.
In-memory DBs are not just about speed: They enable different data structures (priority queues, etc.) not efficient on disk.
Compaction is LSM-tree’s Achilles’ heel at high write volumes: Can saturate I/O and impact reads.

Examples & Case Studies

LevelDB / RocksDB (LSM-Tree)
- Used as storage engine for Cassandra, HBase, TiKV, many others
- RocksDB: Facebook’s fork, optimized for NVMe SSDs
- Used inside: MyRocks (MySQL), MongoRocks (MongoDB)
InnoDB (B-Tree)
- MySQL’s default storage engine
- Clustered index: rows stored in primary key order
- Separate secondary indexes point back to primary key
Vertica / Snowflake (Column Store)
- Purpose-built for OLAP analytics
- Redundant column sorts, aggressive compression
- Query optimizer chooses best sort order at query time
Bitcask (Hash Index)
- Riak’s default storage engine
- All keys in RAM, values on disk
- Best for workloads with small key space, frequent updates to same keys

Questions

Why are sequential writes faster than random writes, and how do LSM-trees exploit this?
When would you use an LSM-tree storage engine vs a B-tree?
How does compaction work in LSM-trees, and what are its downsides?
What is write amplification and why does it matter?
Why is column-oriented storage better for analytics than row-oriented?
When should you use an OLAP warehouse vs querying your OLTP database directly?
What is a Bloom filter and why is it critical for LSM-tree performance?
How does a data cube differ from a materialized view?

Modern Context (2026)

NVMe SSDs Change Trade-offs:

Random I/O penalty for B-trees largely eliminated on NVMe
LSM-tree sequential write advantage matters less
RocksDB tuned specifically for NVMe characteristics
New storage engine research: LSMT variants (WiscKey: key-value separation)

Cloud-Native Storage Separation:

Disaggregated storage: compute separated from storage (Aurora, Neon, TiDB)
Shared storage layer replaces WAL-per-node architectures
S3-compatible object storage as “infinite disk”

HTAP (Hybrid Transactional/Analytical Processing):

Single database for both OLTP and OLAP workloads
TiDB (row store + column store in one), SingleStore, YugabyteDB
Eliminates ETL pipeline latency for real-time analytics

Lakehouse Architecture:

Delta Lake, Apache Iceberg, Apache Hudi
ACID transactions over Parquet files on S3
Blur the line between data lake and data warehouse
Merge OLAP query performance with raw data flexibility

Vector Storage:

HNSW (Hierarchical Navigable Small World) index for approximate nearest neighbor
Specialized storage for embedding vectors
Integrated into PostgreSQL (pgvector), Elasticsearch, and dedicated systems

Status: Notes complete
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch03-storage-and-retrieval