Chapter 3 Flashcards — Data Models and Query Languages

flashcards ddia-2e chapter3 data-models query-languages

Definitions and Mechanisms

What is the object-relational impedance mismatch?
?

Definition: The conceptual gap between how OOP languages represent data (objects with nested structures) and how relational DBs store it (flat tables with foreign keys)
Application objects often have nested arrays/collections; relational requires splitting across multiple tables
ORMs (Hibernate, ActiveRecord, SQLAlchemy) reduce boilerplate but do not eliminate the conceptual gap
Document model reduces mismatch for tree-shaped data; JSONB in PostgreSQL is a hybrid compromise
Matters most for: one-to-many relationships (e.g., user with many positions)

What is the difference between schema-on-write and schema-on-read?
?

Schema-on-write (relational): DB enforces schema at write time; ALTER TABLE required to change structure
- Like static type checking: errors caught at insert, data always consistent
Schema-on-read (document): Schema implicit, interpreted by the application at read time
- Like dynamic type checking: flexible, but bugs surface at read time
Neither is universally better — choose based on data uniformity and rate of schema change
Schema-on-read is advantageous when: data is heterogeneous, structure is externally controlled, or items have varying fields

What is a property graph and what are its two building blocks?
?
Property graph model (Neo4j, Amazon Neptune, TigerGraph):

Vertices (nodes) have:

Unique identifier
Set of outgoing and incoming edges
Key-value properties (e.g., {name: "Alice", age: 30})

Edges have:

Unique identifier
Tail vertex (start) and head vertex (end)
A label describing the relationship type (e.g., BORN_IN, KNOWS)
Key-value properties (e.g., {since: 2020})

Key property: Any vertex can have an edge to any other vertex — no schema restriction on connections

What is a triple store and how does it represent data?
?

Triple store: All information stored as three-part statements: (subject, predicate, object)
Subject: A vertex (entity being described)
Predicate: An edge label or property key
Object: Another vertex (relationship) OR a literal value (property)
Examples:
- (Jim, age, 33) → property (object is a value)
- (Jim, bornIn, Idaho) → relationship (object is a vertex)
- (Idaho, type, USState) → type assertion
Standard: RDF (Resource Description Framework) uses URIs as identifiers
Query language: SPARQL
Systems: AllegroGraph, Apache Jena, Datomic, Wikidata

What is Event Sourcing and how does it differ from traditional CRUD?
?
Event Sourcing: Store the full sequence of events that led to a state, rather than the current state itself.

CRUD: account.balance = 1000 (mutable state, history lost)

Event Sourcing:

AccountOpened  { initial_balance: 0    }
MoneyDeposited { amount: 1500          }
MoneyWithdrawn { amount: 500           }

Current state = 0 + 1500 - 500 = 1000 (derived by replaying events)

Key benefits:

Complete audit trail built in
Time-travel queries: what was the balance on date X?
Multiple projections from the same event stream
Debugging: reproduce exact state that caused a bug

Key costs:

Storage grows unboundedly (events are immutable, never deleted)
Eventual consistency between write and read models
Schema evolution for events is complex (must handle old event formats forever)

What is CQRS and how does it pair with Event Sourcing?
?
CQRS = Command Query Responsibility Segregation

Core idea: Separate the write model (commands that validate and emit events) from the read model (projections optimized for queries).

Write side: Receives commands → validates → emits events to the event store (append-only)
Read side: Consumes the event stream → maintains denormalized, query-optimized views (projections)

Why pair with Event Sourcing?

Event store is the single source of truth (write side)
Multiple projections can be derived from the same events (read side)
Read model can be rebuilt from scratch by replaying events

Trade-off: The read model is eventually consistent — a brief lag between writing an event and the projection updating. Synchronous reads against the event log are possible but slower.

What is a star schema and what are its components?
?
Star schema: The standard analytical (OLAP) schema pattern.

Components:

Fact table: Center of the star. One row per business event (sale, click, order). Very wide (many foreign keys + numeric measures), very tall (billions of rows). Examples: fact_sales, fact_pageviews
Dimension tables: Describe the who/what/when/where. Smaller tables with human-readable attributes. Examples: dim_customer, dim_product, dim_date, dim_store

Why it’s important:

BI tools (Tableau, Looker, Power BI) assume and generate queries for star schema
dbt transformations typically produce star/snowflake schemas
Star is preferred over snowflake (normalized dimensions) because column stores make joins cheap

Snowflake schema: Dimension tables are further normalized into sub-dimensions. More normalized, more joins required.

What is GraphQL and how does it differ from REST?
?
GraphQL: A client-defined query language for APIs (not databases). Created by Facebook, open-sourced 2015.

Problem it solves:

Over-fetching (REST): Endpoint returns fixed payload; client receives unnecessary fields
Under-fetching (REST): Getting user + posts requires multiple round trips
Tight coupling: REST API changes require versioning

How GraphQL solves it:

Client specifies exactly which fields it needs in a single request
Server resolves query against any backend (relational, document, microservices)
Schema (SDL) is required and introspectable

Key distinction: GraphQL is an API layer, not a storage layer. It sits between the client and whatever database you’re using.

Also supports: Mutations (writes) and Subscriptions (real-time events)

Trade-offs and Comparisons

When should you use a document model vs relational model?
?
Use Document model when:

Data is tree-shaped (one-to-many hierarchies with few cross-references)
Schema flexibility needed: different documents have different shapes
Data locality needed: reads require most fields of a document
External systems control data structure (can’t enforce schema)
Example: CMS content, user profiles, product catalogs

Use Relational model when:

Data has many-to-many relationships
Complex joins and aggregate queries are common
Referential integrity and consistency are critical
Reporting and analytics over normalized data
Example: ERP, banking, inventory, CRM

Warning signs you chose wrong:

Document DB: app code is full of multi-query joins
Relational DB: deeply nested objects requiring 6+ joins for a single read

When does the graph model pay off over relational?
?
Use Graph model when:

Traversal is the primary query: “find all nodes reachable within N hops”
Relationships are as important as the data itself
Data has many-to-many relationships with no clear center
Use cases: social networks, fraud detection, knowledge graphs, recommendation engines, road networks, access control

Do NOT use Graph model when:

Data is mostly tree-shaped or has few cross-entity relationships
Most queries are aggregations (COUNT, SUM, AVG) — relational handles this better
Simple CRUD operations dominate — graph DB overhead is unnecessary

The SQL graph problem: SQL can express graph queries with recursive CTEs, but they are verbose and poorly optimized. Cypher (Neo4j) is designed for this and is significantly more concise and performant.

What are the trade-offs of Event Sourcing?
?

Aspect	Benefit	Cost
Audit trail	Complete, immutable history	Storage grows without bound
Temporal queries	Replay to any point in time	Must replay many events (can be slow)
Multiple views	Same events → different projections	Each projection adds maintenance
Debugging	Reproduce exact state	More complex to set up initially
Consistency	Write is append-only (simple)	Read model is eventually consistent
Schema evolution	Events are immutable	Upcasting/versioning events is hard

Use Event Sourcing when: audit requirements are strict (finance, healthcare, legal), temporal queries are frequent, multiple heterogeneous consumers need different views.

Avoid when: simple CRUD app, small team unfamiliar with pattern, synchronous consistency is required.

GraphQL vs REST vs gRPC — key differences?
?

Feature	REST	GraphQL	gRPC
Over-fetching	Common	Eliminated	No
Multiple round trips	Yes (N+1 common)	Single request	No (streaming)
Schema	Optional	Required (SDL)	Required (.proto)
HTTP caching	Native (GET)	Difficult (POST)	No HTTP cache
Real-time	WebSocket add-on	Subscriptions native	Server streaming native
Browser support	Native	Native	Needs grpc-web
Best for	Public CRUD APIs	Rich UIs, mobile	Internal microservices

Rule of thumb: Use REST for simple public APIs, GraphQL for data-rich frontends and BFF patterns, gRPC for internal service-to-service communication.

What is the relationship between normalization and many-to-one vs many-to-many relationships?
?

Many-to-one (normalization): Many records point to one canonical value stored once (e.g., many users → one city). Foreign key ID never changes even if the human-readable value does.
- Pro: Single source of truth; updates in one place; no duplication
- Con: JOIN required at read time
Many-to-many (complex): Many records relate to many other records (users ↔ roles, authors ↔ books). Requires a join table in relational or embedding/referencing in document.
- Document model struggles: Must either embed (duplicates data) or reference (requires app-level JOIN)
- Relational model excels: JOIN TABLE + foreign keys handle this natively
- Graph model excels: Edges directly model the relationship

Numbers and Precision

What are the key characteristics of a fact table in a star schema?
?
Fact table characteristics:

One row per business event: one row per sale, order, click, page view
Very wide: 50-200+ columns (foreign keys to all dimensions + numeric measures)
Very tall: Billions to trillions of rows in large data warehouses
Immutable: Events happened; fact rows are rarely (never) updated
Foreign keys: Point to all dimension tables (customer_id, product_id, date_id, store_id)
Measures: Numeric values to aggregate (quantity, revenue, duration, bytes_transferred)

Column-oriented storage is ideal for fact tables because queries typically read 3-5 columns out of 100+; column storage reads only needed columns.

What is Datalog and what does it add over SQL for graph queries?
?
Datalog (1970s, foundation of Datomic):

Defines facts and rules recursively
Rules can reference themselves for recursive traversal
More compact than recursive SQL CTEs

Key capability: Recursive rules allow expressing “find all ancestors” as one rule, not a fixed-depth self-join:

ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).

SQL equivalent requires a recursive CTE and is much more verbose.

Systems using Datalog-style queries: Datomic (Clojure), Cascalog, LogicBlox

Practical significance: Datalog influenced the design of modern graph query standards like GQL (ISO Graph Query Language, 2024).

Application and Failure Modes

You’re designing an e-commerce system. Which data models would you use for which components?
?
Relational (PostgreSQL):

Orders, order line items, payments — need ACID, many-to-many (orders ↔ products)
Customer accounts, addresses, payment methods
Inventory counts — need strong consistency

Document (MongoDB or PostgreSQL JSONB):

Product catalog — heterogeneous schema (different product types have different attributes)
User sessions and cart state — tree-shaped, schema evolves frequently

Graph (Neo4j or Neptune):

Product recommendations (“customers who bought X also bought Y”)
Fraud detection (shared devices/addresses between accounts)

Event Sourcing (Kafka):

Order lifecycle events (OrderPlaced, PaymentProcessed, OrderShipped)
Audit trail for compliance
Feeds real-time dashboards and analytics pipelines

Key principle: Different components of the same system have different data shapes — use the right model per component.

What are the failure modes of using a document database for data with complex relationships?
?

Application-level join explosion: Fetching an order with its customer, products, and their categories requires 4+ sequential queries; no server-side optimizer helps
Denormalization duplication: Storing product name in every order means a product rename requires updating thousands of orders
Inconsistent data: Concurrent updates to denormalized copies cause stale or conflicting data
Loss of referential integrity: Document DBs don’t enforce foreign key constraints; orphaned references silently corrupt data
N+1 query problem: Loading a list of orders and then loading each order’s customer = 1 + N queries; explodes at scale

Historical context: This is exactly the problem that plagued the pre-relational hierarchical and network models before Codd’s relational model.

What happens when an Event Sourcing system needs to change the schema of an event?
?
The core problem: Events are immutable and must be stored forever. Old events cannot be modified. But new code may expect different fields.

Solutions:

Upcasting: When replaying, apply a function that transforms old event format to new format on the fly
- Example: OrderPlaced_v1 → OrderPlaced_v2 (add missing fields with defaults)
Multiple event versions: Keep old and new event types; handlers support both versions
Snapshot: Periodically store a full state snapshot so you don’t need to replay events from the beginning
Event versioning from day one: Include a schema_version field in every event; design upgrade paths proactively

Key lesson: Event schema evolution is one of the hardest operational challenges in event sourcing — it must be planned upfront, not retrofitted.

How does GraphQL’s N+1 query problem arise and how is it solved?
?
The N+1 problem in GraphQL:

query { users { name posts { title } } }

Naive implementation:

Query: SELECT * FROM users → returns 100 users
For each user: SELECT * FROM posts WHERE user_id = ? → 100 queries
= 1 + 100 = 101 queries

Solution: The DataLoader pattern (Facebook’s open-source solution):

Batch all IDs collected during a single request
Execute one query: SELECT * FROM posts WHERE user_id IN (1, 2, ..., 100)
Cache results within the request

Implementation: dataloader library (Node.js), strawberry-graphql DataLoader (Python), absinthe batching (Elixir)

Key insight: The N+1 problem exists in any API layer (REST also suffers from it); GraphQL makes it more visible and forces you to solve it explicitly.

Modern Context

How has event sourcing infrastructure evolved by 2026?
?
Mainstream platforms:

Apache Kafka: De facto standard event log for large-scale systems; log compaction enables infinite retention
EventStoreDB: Purpose-built event store with native projection support and subscriptions
Axon Framework (Java): Full CQRS + Event Sourcing framework with Axon Server
Temporal.io: Durable execution platform built on event sourcing principles (used at Stripe, Netflix)
Restate: New durable execution framework using event sourcing for workflow state

Patterns that have emerged:

Outbox pattern: Write events to a DB table and Kafka atomically to avoid dual-write failures
Event-carried state transfer: Events carry enough data for consumers to update projections without querying the source
Log compaction: Kafka’s built-in feature that retains only the latest value per key (tombstoning)

Trend: Event sourcing is no longer niche — it’s standard for financial systems, logistics, and any audit-heavy domain.

What is the role of Apache Arrow in the modern data stack?
?
Apache Arrow: A language-independent columnar in-memory format and IPC (inter-process communication) protocol.

Problems it solves:

Data exchange between pandas, Spark, DuckDB, Polars, and other tools required serialization/deserialization overhead
Each tool had its own internal format → copying data between tools was expensive

Arrow provides:

Zero-copy reads: Tools share the same memory buffer without copying
Columnar layout: Optimal for vectorized processing and SIMD instructions
Language bindings: Python, Java, C++, Rust, Go, R all speak Arrow natively

Impact: DuckDB reads Arrow format directly; Pandas 2.0 uses Arrow as its optional backend; Apache Spark 3.x uses Arrow for Python UDF execution. Arrow is becoming the universal exchange format for the analytical data stack.

Total Cards: 28
Review Time: ~25 minutes
Priority: HIGH
Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch03-flashcards