Chapter 3 Flashcards - Storage and Retrieval

Basic Concepts

What are the two families of storage engines described in DDIA Chapter 3?
?

Log-structured storage engines: Append-only writes; examples: LSM-Trees (LevelDB, RocksDB, Cassandra), Bitcask
Page-oriented storage engines: Update pages in-place; example: B-Trees (MySQL InnoDB, PostgreSQL)

Key difference: Log-structured optimizes for writes; page-oriented balances reads and writes but favors read consistency

What is an SSTable and what makes it different from a plain append-only log?
?

SSTable = Sorted String Table
Keys are sorted by key within each segment file (unlike plain log)
Advantages over plain log:
- Merging segments is efficient (merge-sort algorithm, already sorted)
- Sparse in-memory index possible (one entry per few KB)
- Range queries efficient (seek to start key, scan forward)
Used in: LevelDB, RocksDB, HBase, Cassandra, Bigtable

What is a Bloom filter and why is it critical for LSM-tree read performance?
?

Definition: Probabilistic data structure that tells you if a key is DEFINITELY NOT in a set
Properties:
- No false negatives: if key exists, always returns “possibly present”
- Has false positives: may say “possibly present” for absent keys
- Very space-efficient (a few bits per key)
LSM-tree use: Before reading each SSTable, check Bloom filter
- “Definitely not”: skip entire SSTable (huge I/O savings)
- “Maybe present”: read SSTable (may be false positive)
Without Bloom filters, every read must check every level

What is write amplification and which storage engines suffer from it?
?

Definition: One logical write from the application results in multiple physical writes to disk over its lifetime
B-tree write amplification:
- WAL write + modified page write = 2x minimum
- Page splits cause additional writes
LSM-tree write amplification:
- Data written to memtable WAL → L0 SSTable → compacted to L1 → L2 → …
- Same data written many times as it moves through levels
- Can be 10-30x for leveled compaction
Impact: Reduces SSD lifespan, increases I/O utilization, limits throughput

Log-Structured Storage

Describe the full write and read path in an LSM-tree.
?
Write path:

Append to write-ahead log (WAL) on disk for crash recovery
Insert into memtable (in-memory sorted structure, usually red-black tree)
When memtable exceeds threshold (~a few MB), flush to disk as new SSTable (L0)
Background compaction: merge L0 SSTables into L1, L1 into L2, etc.

Read path:

Check memtable (most recent writes)
For each SSTable level (newest first): check Bloom filter → if “maybe present”, binary search in SSTable
Return the most recent version of the key found

What is compaction in LSM-trees and what are its trade-offs?
?

Definition: Background process that merges multiple SSTable files, discarding old/deleted versions of keys
Purpose:
- Keep number of SSTable files manageable
- Reclaim disk space from deleted/overwritten keys
- Maintain read performance (fewer files to check)
Trade-offs:
- ✅ Keeps read performance bounded
- ❌ Compaction I/O competes with foreground reads/writes
- ❌ If writes faster than compaction → levels grow → slower reads
- ❌ Write amplification (data rewritten multiple times)
Strategies: Size-tiered compaction (HBase), Leveled compaction (LevelDB, RocksDB)

What is the Bitcask storage engine and when is it appropriate?
?

Design: Append-only log + in-memory hash map (key → byte offset in log)
Write: Append key-value pair to log; update hash map
Read: Hash map lookup (O(1)) → seek to offset → read value
Compaction: Merge log segments, keep latest value per key
Constraints:
- All keys must fit in RAM (hash map is in-memory only)
- No range queries (hash map has no ordering)
Best for: High write volume with small key set (e.g., Riak’s default engine for most workloads)

B-Trees

How does a B-tree handle a write operation?
?

Find the leaf page containing (or that should contain) the key — traverse tree from root
Update the value in-place on the leaf page
If page is full: split into two half-full pages, update parent page
All modifications written to WAL (write-ahead log) before modifying pages
Modified page written back to disk

Key properties:

Always balanced: height O(log n)
Branching factor ~500 → 4 levels handles ~256TB with 4KB pages
WAL ensures crash recovery: if crash during page write, WAL used to redo

What is a clustered index vs a non-clustered index?
?

Clustered index: Row data stored directly in the index (one per table)
- InnoDB: rows stored in primary key order in B-tree
- Reading by PK = reading the data itself (fast)
- Secondary indexes store primary key values (double lookup)
Non-clustered index: Index stores pointers (heap file offsets) to actual rows
- PostgreSQL: heap-based; all indexes point to heap
- Updates can move row → update all index pointers
Covering index: Non-clustered index that includes extra columns to satisfy query without heap lookup
- Trades storage for read performance

OLTP vs OLAP

What distinguishes OLTP from OLAP workloads?
?

	OLTP	OLAP
Reads	Small number of records by key	Aggregate over millions of records
Writes	Random, low latency	Bulk ETL or streaming
Users	End users via application	Analysts, BI tools
Query shape	Index lookup, point queries	Full scans, GROUP BY, aggregations
Data size	GB to TB	TB to PB
DB examples	MySQL, PostgreSQL	Snowflake, BigQuery, Redshift
Storage engine	B-tree or LSM-tree	Columnar

What is a star schema and what are its components?
?

Star schema: Data warehouse schema with one central fact table surrounded by dimension tables
Fact table:
- Each row = one event (sale, click, transaction)
- Contains foreign keys to dimension tables + numeric metrics
- Very wide (many columns), very tall (billions of rows)
Dimension tables: Who, what, where, when of each event
- Product, Customer, Date, Location, etc.
- Smaller, denormalized for query convenience
Snowflake schema: Dimension tables further normalized into sub-dimensions
Queries: JOIN fact to dimensions, GROUP BY dimension attributes, aggregate metrics

What is ETL and why is it used in data warehousing?
?

ETL = Extract, Transform, Load
Extract: Read data from OLTP databases (or other source systems)
Transform: Clean, normalize, join, apply business rules
Load: Write into data warehouse in analytical-friendly format
Why needed:
- Running analytics on OLTP DB would degrade transaction performance
- Analytical schema (star schema) differs from transactional schema
- Data quality issues from multiple source systems need resolution
Modern: ELT (Extract, Load, Transform) — load raw first, transform in warehouse with SQL

Column-Oriented Storage

Why is column-oriented storage better for analytical queries?
?

Analytics typically: Read 3-5 columns out of 100+, aggregate over millions of rows
Row store: Must load entire row even if only need 2 columns → wastes I/O
Column store: Read only the columns needed → much less I/O

Additional advantages:

Each column has homogeneous data type → better compression (up to 10x)
Vectorized processing: operate on compressed columns without decoding
CPU cache efficiency: same-type data fits in cache lines

Trade-off: Writes are expensive (one row = write to multiple column files)

How does bitmap encoding enable fast analytical queries?
?

Bitmap encoding: For each distinct value in a column, create a bit array
- Bit = 1 if row has that value; 0 otherwise
- status column with values {active, inactive}:
  - active: 1 1 1 0 1 0 0
  - inactive: 0 0 0 1 0 1 1
Query acceleration: WHERE status = active AND region = west
- = bitwise AND of active bitmap with west bitmap
- Extremely fast: CPU can process 64 bits in one instruction
Run-length encoding further compresses sparse bitmaps
Most effective when column has low cardinality (few distinct values)

What is a materialized view vs a virtual view?
?

Virtual view: Named query; SQL statement stored; data computed at query time; always fresh
Materialized view: Pre-computed query result stored on disk; periodically refreshed
- Faster reads (no computation at query time)
- Trade-off: may be stale; storage overhead; maintenance cost on base data changes
Data cube (OLAP cube): Special case — materialized aggregations for every combination of dimensions
- Example: Sales aggregated by (date × product × region) — all 8 combinations pre-computed
- Allows instant “drill-down” queries
- Less flexible than raw data for ad-hoc queries

Modern Context (2026)

How has NVMe SSD changed the B-tree vs LSM-tree trade-off?
?

Traditional (HDD era): Random I/O ~200x slower than sequential → LSM-tree’s sequential writes huge win
NVMe SSD era:
- Random IOPS: ~1M ops/sec (vs HDD’s ~100)
- Random vs sequential gap: ~5x (vs HDD’s ~200x)
- Result: B-tree’s random write penalty is much smaller on NVMe
Implications:
- LSM-tree’s write advantage is less dominant
- B-tree becomes more competitive for write-heavy workloads on NVMe
- WiscKey optimization: separate keys (LSM) from values (log file) → reduce compaction overhead
Caveat: Write amplification still wears SSDs — matters for total cost

What is the Lakehouse architecture and how does it blur OLTP/OLAP boundaries?
?

Lakehouse = Data Lake + Data Warehouse capabilities
Formats: Delta Lake, Apache Iceberg, Apache Hudi (all on Parquet files + object storage)
Key features:
- ACID transactions on cloud object storage (S3)
- Schema enforcement + evolution
- Time travel (query historical versions)
- Update/delete support (unlike traditional data lakes)
Advantages over pure warehouse: Cheap storage, raw data retained, ML-ready
Advantages over pure data lake: ACID, performance, governance
2026: Iceberg + Trino / Spark / DuckDB = dominant open analytics stack

Interview Scenarios

A write-heavy application is experiencing slow insert performance. How do you diagnose and fix it?
?
Diagnosis:

Check if using B-tree storage — high write amplification may be bottleneck
Check LSM-tree compaction lag — if writes faster than compaction → levels pile up
Check disk I/O saturation — are writes bottlenecked on physical I/O?
Check write-ahead log — synchronous fsync on every write?

Solutions:

Switch to LSM-tree engine (RocksDB/Cassandra) if workload is append-heavy
Tune LSM compaction: increase L0 file count threshold, use tiered compaction
Batch writes (reduce number of fsync calls)
Use NVMe SSD to reduce I/O latency
For analytics writes: load via bulk ETL, not row-by-row inserts

Your analytics team says reports take hours to run. The data is in PostgreSQL. What do you recommend?
?
Problem: PostgreSQL (row store, OLTP) is not designed for large analytical scans

Solution: Extract data into a proper OLAP system

Options:

Cloud data warehouse: Snowflake/BigQuery/Redshift
- ETL PostgreSQL → warehouse (dbt + Fivetran/Airbyte)
- Pay-per-query, columnar, highly scalable
In-process OLAP: DuckDB
- Query Parquet files directly on cheap storage
- No server needed, excellent for moderate data sizes
HTAP: Migrate to TiDB (if real-time analytics needed)

Quick win: Add proper indexes + query optimization in PostgreSQL first — often 10-100x speedup before needing a warehouse

Quick Facts

What is the typical branching factor of a B-tree and what depth does that support?
?

Branching factor: ~500 children per page (typical, with 4KB pages and ~8-byte keys)
Height: 4 levels with branching factor 500
- Level 1 (root): 1 page
- Level 2: 500 pages
- Level 3: 250,000 pages
- Level 4 (leaves): 125 million pages × 4KB = ~500GB per leaf level
Real-world: 3-4 levels supports most databases in practice
Key insight: Only a few disk seeks per lookup even for huge datasets

What are the key metrics that separate LSM-tree variants (size-tiered vs leveled compaction)?
?
Size-tiered compaction (HBase, early Cassandra):

Similar-sized SSTables grouped together and merged
Better write throughput (less frequent compaction)
Worse read performance (more files to check), more space amplification

Leveled compaction (LevelDB, RocksDB, modern Cassandra):

Each level has bounded total size; L(n) is ~10x larger than L(n-1)
Better read performance and space efficiency
Higher write amplification (more frequent merges)

TWCS (Time Window Compaction Strategy, Cassandra):

For time-series data: compact within time windows
Old windows sealed and not recompacted → very efficient for time-series

What does 99th percentile disk read latency look like for B-trees vs LSM-trees on NVMe?
?

B-tree read (NVMe): ~100-500 microseconds (4-level tree = 4 random reads, ~25µs each on NVMe)
LSM-tree read (worst case, no Bloom filter): Check memtable + L0 + L1 + L2 + … → 5-10ms
LSM-tree read (with Bloom filters, key in L1): ~200-800 microseconds
Key insight: B-tree provides more predictable read latency; LSM-tree has higher variance
For p99 SLAs, B-tree is often safer choice for reads

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch03-flashcards