Chapter 3 Flashcards - Storage and Retrieval

flashcards chapter-3 ddia


Basic Concepts

What are the two families of storage engines described in DDIA Chapter 3?
?

  1. Log-structured storage engines: Append-only writes; examples: LSM-Trees (LevelDB, RocksDB, Cassandra), Bitcask
  2. Page-oriented storage engines: Update pages in-place; example: B-Trees (MySQL InnoDB, PostgreSQL)

Key difference: Log-structured optimizes for writes; page-oriented balances reads and writes but favors read consistency

What is an SSTable and what makes it different from a plain append-only log?
?

  • SSTable = Sorted String Table
  • Keys are sorted by key within each segment file (unlike plain log)
  • Advantages over plain log:
    • Merging segments is efficient (merge-sort algorithm, already sorted)
    • Sparse in-memory index possible (one entry per few KB)
    • Range queries efficient (seek to start key, scan forward)
  • Used in: LevelDB, RocksDB, HBase, Cassandra, Bigtable

What is a Bloom filter and why is it critical for LSM-tree read performance?
?

  • Definition: Probabilistic data structure that tells you if a key is DEFINITELY NOT in a set
  • Properties:
    • No false negatives: if key exists, always returns “possibly present”
    • Has false positives: may say “possibly present” for absent keys
    • Very space-efficient (a few bits per key)
  • LSM-tree use: Before reading each SSTable, check Bloom filter
    • “Definitely not”: skip entire SSTable (huge I/O savings)
    • “Maybe present”: read SSTable (may be false positive)
  • Without Bloom filters, every read must check every level

What is write amplification and which storage engines suffer from it?
?

  • Definition: One logical write from the application results in multiple physical writes to disk over its lifetime
  • B-tree write amplification:
    • WAL write + modified page write = 2x minimum
    • Page splits cause additional writes
  • LSM-tree write amplification:
    • Data written to memtable WAL → L0 SSTable → compacted to L1 → L2 → …
    • Same data written many times as it moves through levels
    • Can be 10-30x for leveled compaction
  • Impact: Reduces SSD lifespan, increases I/O utilization, limits throughput

Log-Structured Storage

Describe the full write and read path in an LSM-tree.
?
Write path:

  1. Append to write-ahead log (WAL) on disk for crash recovery
  2. Insert into memtable (in-memory sorted structure, usually red-black tree)
  3. When memtable exceeds threshold (~a few MB), flush to disk as new SSTable (L0)
  4. Background compaction: merge L0 SSTables into L1, L1 into L2, etc.

Read path:

  1. Check memtable (most recent writes)
  2. For each SSTable level (newest first): check Bloom filter → if “maybe present”, binary search in SSTable
  3. Return the most recent version of the key found

What is compaction in LSM-trees and what are its trade-offs?
?

  • Definition: Background process that merges multiple SSTable files, discarding old/deleted versions of keys
  • Purpose:
    • Keep number of SSTable files manageable
    • Reclaim disk space from deleted/overwritten keys
    • Maintain read performance (fewer files to check)
  • Trade-offs:
    • ✅ Keeps read performance bounded
    • ❌ Compaction I/O competes with foreground reads/writes
    • ❌ If writes faster than compaction → levels grow → slower reads
    • ❌ Write amplification (data rewritten multiple times)
  • Strategies: Size-tiered compaction (HBase), Leveled compaction (LevelDB, RocksDB)

What is the Bitcask storage engine and when is it appropriate?
?

  • Design: Append-only log + in-memory hash map (key → byte offset in log)
  • Write: Append key-value pair to log; update hash map
  • Read: Hash map lookup (O(1)) → seek to offset → read value
  • Compaction: Merge log segments, keep latest value per key
  • Constraints:
    • All keys must fit in RAM (hash map is in-memory only)
    • No range queries (hash map has no ordering)
  • Best for: High write volume with small key set (e.g., Riak’s default engine for most workloads)

B-Trees

How does a B-tree handle a write operation?
?

  1. Find the leaf page containing (or that should contain) the key — traverse tree from root
  2. Update the value in-place on the leaf page
  3. If page is full: split into two half-full pages, update parent page
  4. All modifications written to WAL (write-ahead log) before modifying pages
  5. Modified page written back to disk

Key properties:

  • Always balanced: height O(log n)
  • Branching factor ~500 → 4 levels handles ~256TB with 4KB pages
  • WAL ensures crash recovery: if crash during page write, WAL used to redo

What is a clustered index vs a non-clustered index?
?

  • Clustered index: Row data stored directly in the index (one per table)

    • InnoDB: rows stored in primary key order in B-tree
    • Reading by PK = reading the data itself (fast)
    • Secondary indexes store primary key values (double lookup)
  • Non-clustered index: Index stores pointers (heap file offsets) to actual rows

    • PostgreSQL: heap-based; all indexes point to heap
    • Updates can move row → update all index pointers
  • Covering index: Non-clustered index that includes extra columns to satisfy query without heap lookup

    • Trades storage for read performance

OLTP vs OLAP

What distinguishes OLTP from OLAP workloads?
?

OLTPOLAP
ReadsSmall number of records by keyAggregate over millions of records
WritesRandom, low latencyBulk ETL or streaming
UsersEnd users via applicationAnalysts, BI tools
Query shapeIndex lookup, point queriesFull scans, GROUP BY, aggregations
Data sizeGB to TBTB to PB
DB examplesMySQL, PostgreSQLSnowflake, BigQuery, Redshift
Storage engineB-tree or LSM-treeColumnar

What is a star schema and what are its components?
?

  • Star schema: Data warehouse schema with one central fact table surrounded by dimension tables
  • Fact table:
    • Each row = one event (sale, click, transaction)
    • Contains foreign keys to dimension tables + numeric metrics
    • Very wide (many columns), very tall (billions of rows)
  • Dimension tables: Who, what, where, when of each event
    • Product, Customer, Date, Location, etc.
    • Smaller, denormalized for query convenience
  • Snowflake schema: Dimension tables further normalized into sub-dimensions
  • Queries: JOIN fact to dimensions, GROUP BY dimension attributes, aggregate metrics

What is ETL and why is it used in data warehousing?
?

  • ETL = Extract, Transform, Load
  • Extract: Read data from OLTP databases (or other source systems)
  • Transform: Clean, normalize, join, apply business rules
  • Load: Write into data warehouse in analytical-friendly format
  • Why needed:
    • Running analytics on OLTP DB would degrade transaction performance
    • Analytical schema (star schema) differs from transactional schema
    • Data quality issues from multiple source systems need resolution
  • Modern: ELT (Extract, Load, Transform) — load raw first, transform in warehouse with SQL

Column-Oriented Storage

Why is column-oriented storage better for analytical queries?
?

  • Analytics typically: Read 3-5 columns out of 100+, aggregate over millions of rows
  • Row store: Must load entire row even if only need 2 columns → wastes I/O
  • Column store: Read only the columns needed → much less I/O

Additional advantages:

  • Each column has homogeneous data type → better compression (up to 10x)
  • Vectorized processing: operate on compressed columns without decoding
  • CPU cache efficiency: same-type data fits in cache lines

Trade-off: Writes are expensive (one row = write to multiple column files)

How does bitmap encoding enable fast analytical queries?
?

  • Bitmap encoding: For each distinct value in a column, create a bit array
    • Bit = 1 if row has that value; 0 otherwise
    • status column with values {active, inactive}:
      • active: 1 1 1 0 1 0 0
      • inactive: 0 0 0 1 0 1 1
  • Query acceleration: WHERE status = active AND region = west
    • = bitwise AND of active bitmap with west bitmap
    • Extremely fast: CPU can process 64 bits in one instruction
  • Run-length encoding further compresses sparse bitmaps
  • Most effective when column has low cardinality (few distinct values)

What is a materialized view vs a virtual view?
?

  • Virtual view: Named query; SQL statement stored; data computed at query time; always fresh

  • Materialized view: Pre-computed query result stored on disk; periodically refreshed

    • Faster reads (no computation at query time)
    • Trade-off: may be stale; storage overhead; maintenance cost on base data changes
  • Data cube (OLAP cube): Special case — materialized aggregations for every combination of dimensions

    • Example: Sales aggregated by (date × product × region) — all 8 combinations pre-computed
    • Allows instant “drill-down” queries
    • Less flexible than raw data for ad-hoc queries

Modern Context (2026)

How has NVMe SSD changed the B-tree vs LSM-tree trade-off?
?

  • Traditional (HDD era): Random I/O ~200x slower than sequential → LSM-tree’s sequential writes huge win
  • NVMe SSD era:
    • Random IOPS: ~1M ops/sec (vs HDD’s ~100)
    • Random vs sequential gap: ~5x (vs HDD’s ~200x)
    • Result: B-tree’s random write penalty is much smaller on NVMe
  • Implications:
    • LSM-tree’s write advantage is less dominant
    • B-tree becomes more competitive for write-heavy workloads on NVMe
    • WiscKey optimization: separate keys (LSM) from values (log file) → reduce compaction overhead
  • Caveat: Write amplification still wears SSDs — matters for total cost

What is the Lakehouse architecture and how does it blur OLTP/OLAP boundaries?
?

  • Lakehouse = Data Lake + Data Warehouse capabilities
  • Formats: Delta Lake, Apache Iceberg, Apache Hudi (all on Parquet files + object storage)
  • Key features:
    • ACID transactions on cloud object storage (S3)
    • Schema enforcement + evolution
    • Time travel (query historical versions)
    • Update/delete support (unlike traditional data lakes)
  • Advantages over pure warehouse: Cheap storage, raw data retained, ML-ready
  • Advantages over pure data lake: ACID, performance, governance
  • 2026: Iceberg + Trino / Spark / DuckDB = dominant open analytics stack

Interview Scenarios

A write-heavy application is experiencing slow insert performance. How do you diagnose and fix it?
?
Diagnosis:

  1. Check if using B-tree storage — high write amplification may be bottleneck
  2. Check LSM-tree compaction lag — if writes faster than compaction → levels pile up
  3. Check disk I/O saturation — are writes bottlenecked on physical I/O?
  4. Check write-ahead log — synchronous fsync on every write?

Solutions:

  • Switch to LSM-tree engine (RocksDB/Cassandra) if workload is append-heavy
  • Tune LSM compaction: increase L0 file count threshold, use tiered compaction
  • Batch writes (reduce number of fsync calls)
  • Use NVMe SSD to reduce I/O latency
  • For analytics writes: load via bulk ETL, not row-by-row inserts

Your analytics team says reports take hours to run. The data is in PostgreSQL. What do you recommend?
?
Problem: PostgreSQL (row store, OLTP) is not designed for large analytical scans

Solution: Extract data into a proper OLAP system

Options:

  1. Cloud data warehouse: Snowflake/BigQuery/Redshift
    • ETL PostgreSQL → warehouse (dbt + Fivetran/Airbyte)
    • Pay-per-query, columnar, highly scalable
  2. In-process OLAP: DuckDB
    • Query Parquet files directly on cheap storage
    • No server needed, excellent for moderate data sizes
  3. HTAP: Migrate to TiDB (if real-time analytics needed)

Quick win: Add proper indexes + query optimization in PostgreSQL first — often 10-100x speedup before needing a warehouse

Quick Facts

What is the typical branching factor of a B-tree and what depth does that support?
?

  • Branching factor: ~500 children per page (typical, with 4KB pages and ~8-byte keys)
  • Height: 4 levels with branching factor 500
    • Level 1 (root): 1 page
    • Level 2: 500 pages
    • Level 3: 250,000 pages
    • Level 4 (leaves): 125 million pages × 4KB = ~500GB per leaf level
  • Real-world: 3-4 levels supports most databases in practice
  • Key insight: Only a few disk seeks per lookup even for huge datasets

What are the key metrics that separate LSM-tree variants (size-tiered vs leveled compaction)?
?
Size-tiered compaction (HBase, early Cassandra):

  • Similar-sized SSTables grouped together and merged
  • Better write throughput (less frequent compaction)
  • Worse read performance (more files to check), more space amplification

Leveled compaction (LevelDB, RocksDB, modern Cassandra):

  • Each level has bounded total size; L(n) is ~10x larger than L(n-1)
  • Better read performance and space efficiency
  • Higher write amplification (more frequent merges)

TWCS (Time Window Compaction Strategy, Cassandra):

  • For time-series data: compact within time windows
  • Old windows sealed and not recompacted → very efficient for time-series

What does 99th percentile disk read latency look like for B-trees vs LSM-trees on NVMe?
?

  • B-tree read (NVMe): ~100-500 microseconds (4-level tree = 4 random reads, ~25µs each on NVMe)
  • LSM-tree read (worst case, no Bloom filter): Check memtable + L0 + L1 + L2 + … → 5-10ms
  • LSM-tree read (with Bloom filters, key in L1): ~200-800 microseconds
  • Key insight: B-tree provides more predictable read latency; LSM-tree has higher variance
  • For p99 SLAs, B-tree is often safer choice for reads

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13