Chapter 3 Flashcards - Storage and Retrieval
Basic Concepts
What are the two families of storage engines described in DDIA Chapter 3?
?
- Log-structured storage engines: Append-only writes; examples: LSM-Trees (LevelDB, RocksDB, Cassandra), Bitcask
- Page-oriented storage engines: Update pages in-place; example: B-Trees (MySQL InnoDB, PostgreSQL)
Key difference: Log-structured optimizes for writes; page-oriented balances reads and writes but favors read consistency
What is an SSTable and what makes it different from a plain append-only log?
?
- SSTable = Sorted String Table
- Keys are sorted by key within each segment file (unlike plain log)
- Advantages over plain log:
- Merging segments is efficient (merge-sort algorithm, already sorted)
- Sparse in-memory index possible (one entry per few KB)
- Range queries efficient (seek to start key, scan forward)
- Used in: LevelDB, RocksDB, HBase, Cassandra, Bigtable
What is a Bloom filter and why is it critical for LSM-tree read performance?
?
- Definition: Probabilistic data structure that tells you if a key is DEFINITELY NOT in a set
- Properties:
- No false negatives: if key exists, always returns “possibly present”
- Has false positives: may say “possibly present” for absent keys
- Very space-efficient (a few bits per key)
- LSM-tree use: Before reading each SSTable, check Bloom filter
- “Definitely not”: skip entire SSTable (huge I/O savings)
- “Maybe present”: read SSTable (may be false positive)
- Without Bloom filters, every read must check every level
What is write amplification and which storage engines suffer from it?
?
- Definition: One logical write from the application results in multiple physical writes to disk over its lifetime
- B-tree write amplification:
- WAL write + modified page write = 2x minimum
- Page splits cause additional writes
- LSM-tree write amplification:
- Data written to memtable WAL → L0 SSTable → compacted to L1 → L2 → …
- Same data written many times as it moves through levels
- Can be 10-30x for leveled compaction
- Impact: Reduces SSD lifespan, increases I/O utilization, limits throughput
Log-Structured Storage
Describe the full write and read path in an LSM-tree.
?
Write path:
- Append to write-ahead log (WAL) on disk for crash recovery
- Insert into memtable (in-memory sorted structure, usually red-black tree)
- When memtable exceeds threshold (~a few MB), flush to disk as new SSTable (L0)
- Background compaction: merge L0 SSTables into L1, L1 into L2, etc.
Read path:
- Check memtable (most recent writes)
- For each SSTable level (newest first): check Bloom filter → if “maybe present”, binary search in SSTable
- Return the most recent version of the key found
What is compaction in LSM-trees and what are its trade-offs?
?
- Definition: Background process that merges multiple SSTable files, discarding old/deleted versions of keys
- Purpose:
- Keep number of SSTable files manageable
- Reclaim disk space from deleted/overwritten keys
- Maintain read performance (fewer files to check)
- Trade-offs:
- ✅ Keeps read performance bounded
- ❌ Compaction I/O competes with foreground reads/writes
- ❌ If writes faster than compaction → levels grow → slower reads
- ❌ Write amplification (data rewritten multiple times)
- Strategies: Size-tiered compaction (HBase), Leveled compaction (LevelDB, RocksDB)
What is the Bitcask storage engine and when is it appropriate?
?
- Design: Append-only log + in-memory hash map (key → byte offset in log)
- Write: Append key-value pair to log; update hash map
- Read: Hash map lookup (O(1)) → seek to offset → read value
- Compaction: Merge log segments, keep latest value per key
- Constraints:
- All keys must fit in RAM (hash map is in-memory only)
- No range queries (hash map has no ordering)
- Best for: High write volume with small key set (e.g., Riak’s default engine for most workloads)
B-Trees
How does a B-tree handle a write operation?
?
- Find the leaf page containing (or that should contain) the key — traverse tree from root
- Update the value in-place on the leaf page
- If page is full: split into two half-full pages, update parent page
- All modifications written to WAL (write-ahead log) before modifying pages
- Modified page written back to disk
Key properties:
- Always balanced: height O(log n)
- Branching factor ~500 → 4 levels handles ~256TB with 4KB pages
- WAL ensures crash recovery: if crash during page write, WAL used to redo
What is a clustered index vs a non-clustered index?
?
-
Clustered index: Row data stored directly in the index (one per table)
- InnoDB: rows stored in primary key order in B-tree
- Reading by PK = reading the data itself (fast)
- Secondary indexes store primary key values (double lookup)
-
Non-clustered index: Index stores pointers (heap file offsets) to actual rows
- PostgreSQL: heap-based; all indexes point to heap
- Updates can move row → update all index pointers
-
Covering index: Non-clustered index that includes extra columns to satisfy query without heap lookup
- Trades storage for read performance
OLTP vs OLAP
What distinguishes OLTP from OLAP workloads?
?
| OLTP | OLAP | |
|---|---|---|
| Reads | Small number of records by key | Aggregate over millions of records |
| Writes | Random, low latency | Bulk ETL or streaming |
| Users | End users via application | Analysts, BI tools |
| Query shape | Index lookup, point queries | Full scans, GROUP BY, aggregations |
| Data size | GB to TB | TB to PB |
| DB examples | MySQL, PostgreSQL | Snowflake, BigQuery, Redshift |
| Storage engine | B-tree or LSM-tree | Columnar |
What is a star schema and what are its components?
?
- Star schema: Data warehouse schema with one central fact table surrounded by dimension tables
- Fact table:
- Each row = one event (sale, click, transaction)
- Contains foreign keys to dimension tables + numeric metrics
- Very wide (many columns), very tall (billions of rows)
- Dimension tables: Who, what, where, when of each event
- Product, Customer, Date, Location, etc.
- Smaller, denormalized for query convenience
- Snowflake schema: Dimension tables further normalized into sub-dimensions
- Queries: JOIN fact to dimensions, GROUP BY dimension attributes, aggregate metrics
What is ETL and why is it used in data warehousing?
?
- ETL = Extract, Transform, Load
- Extract: Read data from OLTP databases (or other source systems)
- Transform: Clean, normalize, join, apply business rules
- Load: Write into data warehouse in analytical-friendly format
- Why needed:
- Running analytics on OLTP DB would degrade transaction performance
- Analytical schema (star schema) differs from transactional schema
- Data quality issues from multiple source systems need resolution
- Modern: ELT (Extract, Load, Transform) — load raw first, transform in warehouse with SQL
Column-Oriented Storage
Why is column-oriented storage better for analytical queries?
?
- Analytics typically: Read 3-5 columns out of 100+, aggregate over millions of rows
- Row store: Must load entire row even if only need 2 columns → wastes I/O
- Column store: Read only the columns needed → much less I/O
Additional advantages:
- Each column has homogeneous data type → better compression (up to 10x)
- Vectorized processing: operate on compressed columns without decoding
- CPU cache efficiency: same-type data fits in cache lines
Trade-off: Writes are expensive (one row = write to multiple column files)
How does bitmap encoding enable fast analytical queries?
?
- Bitmap encoding: For each distinct value in a column, create a bit array
- Bit = 1 if row has that value; 0 otherwise
statuscolumn with values {active, inactive}:- active:
1 1 1 0 1 0 0 - inactive:
0 0 0 1 0 1 1
- active:
- Query acceleration:
WHERE status = active AND region = west- = bitwise AND of active bitmap with west bitmap
- Extremely fast: CPU can process 64 bits in one instruction
- Run-length encoding further compresses sparse bitmaps
- Most effective when column has low cardinality (few distinct values)
What is a materialized view vs a virtual view?
?
-
Virtual view: Named query; SQL statement stored; data computed at query time; always fresh
-
Materialized view: Pre-computed query result stored on disk; periodically refreshed
- Faster reads (no computation at query time)
- Trade-off: may be stale; storage overhead; maintenance cost on base data changes
-
Data cube (OLAP cube): Special case — materialized aggregations for every combination of dimensions
- Example: Sales aggregated by (date × product × region) — all 8 combinations pre-computed
- Allows instant “drill-down” queries
- Less flexible than raw data for ad-hoc queries
Modern Context (2026)
How has NVMe SSD changed the B-tree vs LSM-tree trade-off?
?
- Traditional (HDD era): Random I/O ~200x slower than sequential → LSM-tree’s sequential writes huge win
- NVMe SSD era:
- Random IOPS: ~1M ops/sec (vs HDD’s ~100)
- Random vs sequential gap: ~5x (vs HDD’s ~200x)
- Result: B-tree’s random write penalty is much smaller on NVMe
- Implications:
- LSM-tree’s write advantage is less dominant
- B-tree becomes more competitive for write-heavy workloads on NVMe
- WiscKey optimization: separate keys (LSM) from values (log file) → reduce compaction overhead
- Caveat: Write amplification still wears SSDs — matters for total cost
What is the Lakehouse architecture and how does it blur OLTP/OLAP boundaries?
?
- Lakehouse = Data Lake + Data Warehouse capabilities
- Formats: Delta Lake, Apache Iceberg, Apache Hudi (all on Parquet files + object storage)
- Key features:
- ACID transactions on cloud object storage (S3)
- Schema enforcement + evolution
- Time travel (query historical versions)
- Update/delete support (unlike traditional data lakes)
- Advantages over pure warehouse: Cheap storage, raw data retained, ML-ready
- Advantages over pure data lake: ACID, performance, governance
- 2026: Iceberg + Trino / Spark / DuckDB = dominant open analytics stack
Interview Scenarios
A write-heavy application is experiencing slow insert performance. How do you diagnose and fix it?
?
Diagnosis:
- Check if using B-tree storage — high write amplification may be bottleneck
- Check LSM-tree compaction lag — if writes faster than compaction → levels pile up
- Check disk I/O saturation — are writes bottlenecked on physical I/O?
- Check write-ahead log — synchronous fsync on every write?
Solutions:
- Switch to LSM-tree engine (RocksDB/Cassandra) if workload is append-heavy
- Tune LSM compaction: increase L0 file count threshold, use tiered compaction
- Batch writes (reduce number of fsync calls)
- Use NVMe SSD to reduce I/O latency
- For analytics writes: load via bulk ETL, not row-by-row inserts
Your analytics team says reports take hours to run. The data is in PostgreSQL. What do you recommend?
?
Problem: PostgreSQL (row store, OLTP) is not designed for large analytical scans
Solution: Extract data into a proper OLAP system
Options:
- Cloud data warehouse: Snowflake/BigQuery/Redshift
- ETL PostgreSQL → warehouse (dbt + Fivetran/Airbyte)
- Pay-per-query, columnar, highly scalable
- In-process OLAP: DuckDB
- Query Parquet files directly on cheap storage
- No server needed, excellent for moderate data sizes
- HTAP: Migrate to TiDB (if real-time analytics needed)
Quick win: Add proper indexes + query optimization in PostgreSQL first — often 10-100x speedup before needing a warehouse
Quick Facts
What is the typical branching factor of a B-tree and what depth does that support?
?
- Branching factor: ~500 children per page (typical, with 4KB pages and ~8-byte keys)
- Height: 4 levels with branching factor 500
- Level 1 (root): 1 page
- Level 2: 500 pages
- Level 3: 250,000 pages
- Level 4 (leaves): 125 million pages × 4KB = ~500GB per leaf level
- Real-world: 3-4 levels supports most databases in practice
- Key insight: Only a few disk seeks per lookup even for huge datasets
What are the key metrics that separate LSM-tree variants (size-tiered vs leveled compaction)?
?
Size-tiered compaction (HBase, early Cassandra):
- Similar-sized SSTables grouped together and merged
- Better write throughput (less frequent compaction)
- Worse read performance (more files to check), more space amplification
Leveled compaction (LevelDB, RocksDB, modern Cassandra):
- Each level has bounded total size; L(n) is ~10x larger than L(n-1)
- Better read performance and space efficiency
- Higher write amplification (more frequent merges)
TWCS (Time Window Compaction Strategy, Cassandra):
- For time-series data: compact within time windows
- Old windows sealed and not recompacted → very efficient for time-series
What does 99th percentile disk read latency look like for B-trees vs LSM-trees on NVMe?
?
- B-tree read (NVMe): ~100-500 microseconds (4-level tree = 4 random reads, ~25µs each on NVMe)
- LSM-tree read (worst case, no Bloom filter): Check memtable + L0 + L1 + L2 + … → 5-10ms
- LSM-tree read (with Bloom filters, key in L1): ~200-800 microseconds
- Key insight: B-tree provides more predictable read latency; LSM-tree has higher variance
- For p99 SLAs, B-tree is often safer choice for reads
Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13