Chapter 3 Flashcards — Data Models and Query Languages
flashcards ddia-2e chapter3 data-models query-languages
Definitions and Mechanisms
What is the object-relational impedance mismatch?
?
- Definition: The conceptual gap between how OOP languages represent data (objects with nested structures) and how relational DBs store it (flat tables with foreign keys)
- Application objects often have nested arrays/collections; relational requires splitting across multiple tables
- ORMs (Hibernate, ActiveRecord, SQLAlchemy) reduce boilerplate but do not eliminate the conceptual gap
- Document model reduces mismatch for tree-shaped data; JSONB in PostgreSQL is a hybrid compromise
- Matters most for: one-to-many relationships (e.g., user with many positions)
What is the difference between schema-on-write and schema-on-read?
?
- Schema-on-write (relational): DB enforces schema at write time;
ALTER TABLErequired to change structure- Like static type checking: errors caught at insert, data always consistent
- Schema-on-read (document): Schema implicit, interpreted by the application at read time
- Like dynamic type checking: flexible, but bugs surface at read time
- Neither is universally better — choose based on data uniformity and rate of schema change
- Schema-on-read is advantageous when: data is heterogeneous, structure is externally controlled, or items have varying fields
What is a property graph and what are its two building blocks?
?
Property graph model (Neo4j, Amazon Neptune, TigerGraph):
Vertices (nodes) have:
- Unique identifier
- Set of outgoing and incoming edges
- Key-value properties (e.g.,
{name: "Alice", age: 30})
Edges have:
- Unique identifier
- Tail vertex (start) and head vertex (end)
- A label describing the relationship type (e.g.,
BORN_IN,KNOWS) - Key-value properties (e.g.,
{since: 2020})
Key property: Any vertex can have an edge to any other vertex — no schema restriction on connections
What is a triple store and how does it represent data?
?
- Triple store: All information stored as three-part statements: (subject, predicate, object)
- Subject: A vertex (entity being described)
- Predicate: An edge label or property key
- Object: Another vertex (relationship) OR a literal value (property)
- Examples:
(Jim, age, 33)→ property (object is a value)(Jim, bornIn, Idaho)→ relationship (object is a vertex)(Idaho, type, USState)→ type assertion
- Standard: RDF (Resource Description Framework) uses URIs as identifiers
- Query language: SPARQL
- Systems: AllegroGraph, Apache Jena, Datomic, Wikidata
What is Event Sourcing and how does it differ from traditional CRUD?
?
Event Sourcing: Store the full sequence of events that led to a state, rather than the current state itself.
CRUD: account.balance = 1000 (mutable state, history lost)
Event Sourcing:
AccountOpened { initial_balance: 0 }
MoneyDeposited { amount: 1500 }
MoneyWithdrawn { amount: 500 }
Current state = 0 + 1500 - 500 = 1000 (derived by replaying events)
Key benefits:
- Complete audit trail built in
- Time-travel queries: what was the balance on date X?
- Multiple projections from the same event stream
- Debugging: reproduce exact state that caused a bug
Key costs:
- Storage grows unboundedly (events are immutable, never deleted)
- Eventual consistency between write and read models
- Schema evolution for events is complex (must handle old event formats forever)
What is CQRS and how does it pair with Event Sourcing?
?
CQRS = Command Query Responsibility Segregation
Core idea: Separate the write model (commands that validate and emit events) from the read model (projections optimized for queries).
Write side: Receives commands → validates → emits events to the event store (append-only)
Read side: Consumes the event stream → maintains denormalized, query-optimized views (projections)
Why pair with Event Sourcing?
- Event store is the single source of truth (write side)
- Multiple projections can be derived from the same events (read side)
- Read model can be rebuilt from scratch by replaying events
Trade-off: The read model is eventually consistent — a brief lag between writing an event and the projection updating. Synchronous reads against the event log are possible but slower.
What is a star schema and what are its components?
?
Star schema: The standard analytical (OLAP) schema pattern.
Components:
- Fact table: Center of the star. One row per business event (sale, click, order). Very wide (many foreign keys + numeric measures), very tall (billions of rows). Examples:
fact_sales,fact_pageviews - Dimension tables: Describe the who/what/when/where. Smaller tables with human-readable attributes. Examples:
dim_customer,dim_product,dim_date,dim_store
Why it’s important:
- BI tools (Tableau, Looker, Power BI) assume and generate queries for star schema
- dbt transformations typically produce star/snowflake schemas
- Star is preferred over snowflake (normalized dimensions) because column stores make joins cheap
Snowflake schema: Dimension tables are further normalized into sub-dimensions. More normalized, more joins required.
What is GraphQL and how does it differ from REST?
?
GraphQL: A client-defined query language for APIs (not databases). Created by Facebook, open-sourced 2015.
Problem it solves:
- Over-fetching (REST): Endpoint returns fixed payload; client receives unnecessary fields
- Under-fetching (REST): Getting user + posts requires multiple round trips
- Tight coupling: REST API changes require versioning
How GraphQL solves it:
- Client specifies exactly which fields it needs in a single request
- Server resolves query against any backend (relational, document, microservices)
- Schema (SDL) is required and introspectable
Key distinction: GraphQL is an API layer, not a storage layer. It sits between the client and whatever database you’re using.
Also supports: Mutations (writes) and Subscriptions (real-time events)
Trade-offs and Comparisons
When should you use a document model vs relational model?
?
Use Document model when:
- Data is tree-shaped (one-to-many hierarchies with few cross-references)
- Schema flexibility needed: different documents have different shapes
- Data locality needed: reads require most fields of a document
- External systems control data structure (can’t enforce schema)
- Example: CMS content, user profiles, product catalogs
Use Relational model when:
- Data has many-to-many relationships
- Complex joins and aggregate queries are common
- Referential integrity and consistency are critical
- Reporting and analytics over normalized data
- Example: ERP, banking, inventory, CRM
Warning signs you chose wrong:
- Document DB: app code is full of multi-query joins
- Relational DB: deeply nested objects requiring 6+ joins for a single read
When does the graph model pay off over relational?
?
Use Graph model when:
- Traversal is the primary query: “find all nodes reachable within N hops”
- Relationships are as important as the data itself
- Data has many-to-many relationships with no clear center
- Use cases: social networks, fraud detection, knowledge graphs, recommendation engines, road networks, access control
Do NOT use Graph model when:
- Data is mostly tree-shaped or has few cross-entity relationships
- Most queries are aggregations (COUNT, SUM, AVG) — relational handles this better
- Simple CRUD operations dominate — graph DB overhead is unnecessary
The SQL graph problem: SQL can express graph queries with recursive CTEs, but they are verbose and poorly optimized. Cypher (Neo4j) is designed for this and is significantly more concise and performant.
What are the trade-offs of Event Sourcing?
?
| Aspect | Benefit | Cost |
|---|---|---|
| Audit trail | Complete, immutable history | Storage grows without bound |
| Temporal queries | Replay to any point in time | Must replay many events (can be slow) |
| Multiple views | Same events → different projections | Each projection adds maintenance |
| Debugging | Reproduce exact state | More complex to set up initially |
| Consistency | Write is append-only (simple) | Read model is eventually consistent |
| Schema evolution | Events are immutable | Upcasting/versioning events is hard |
Use Event Sourcing when: audit requirements are strict (finance, healthcare, legal), temporal queries are frequent, multiple heterogeneous consumers need different views.
Avoid when: simple CRUD app, small team unfamiliar with pattern, synchronous consistency is required.
GraphQL vs REST vs gRPC — key differences?
?
| Feature | REST | GraphQL | gRPC |
|---|---|---|---|
| Over-fetching | Common | Eliminated | No |
| Multiple round trips | Yes (N+1 common) | Single request | No (streaming) |
| Schema | Optional | Required (SDL) | Required (.proto) |
| HTTP caching | Native (GET) | Difficult (POST) | No HTTP cache |
| Real-time | WebSocket add-on | Subscriptions native | Server streaming native |
| Browser support | Native | Native | Needs grpc-web |
| Best for | Public CRUD APIs | Rich UIs, mobile | Internal microservices |
Rule of thumb: Use REST for simple public APIs, GraphQL for data-rich frontends and BFF patterns, gRPC for internal service-to-service communication.
What is the relationship between normalization and many-to-one vs many-to-many relationships?
?
-
Many-to-one (normalization): Many records point to one canonical value stored once (e.g., many users → one city). Foreign key ID never changes even if the human-readable value does.
- Pro: Single source of truth; updates in one place; no duplication
- Con: JOIN required at read time
-
Many-to-many (complex): Many records relate to many other records (users ↔ roles, authors ↔ books). Requires a join table in relational or embedding/referencing in document.
- Document model struggles: Must either embed (duplicates data) or reference (requires app-level JOIN)
- Relational model excels: JOIN TABLE + foreign keys handle this natively
- Graph model excels: Edges directly model the relationship
Numbers and Precision
What are the key characteristics of a fact table in a star schema?
?
Fact table characteristics:
- One row per business event: one row per sale, order, click, page view
- Very wide: 50-200+ columns (foreign keys to all dimensions + numeric measures)
- Very tall: Billions to trillions of rows in large data warehouses
- Immutable: Events happened; fact rows are rarely (never) updated
- Foreign keys: Point to all dimension tables (customer_id, product_id, date_id, store_id)
- Measures: Numeric values to aggregate (quantity, revenue, duration, bytes_transferred)
Column-oriented storage is ideal for fact tables because queries typically read 3-5 columns out of 100+; column storage reads only needed columns.
What is Datalog and what does it add over SQL for graph queries?
?
Datalog (1970s, foundation of Datomic):
- Defines facts and rules recursively
- Rules can reference themselves for recursive traversal
- More compact than recursive SQL CTEs
Key capability: Recursive rules allow expressing “find all ancestors” as one rule, not a fixed-depth self-join:
ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).SQL equivalent requires a recursive CTE and is much more verbose.
Systems using Datalog-style queries: Datomic (Clojure), Cascalog, LogicBlox
Practical significance: Datalog influenced the design of modern graph query standards like GQL (ISO Graph Query Language, 2024).
Application and Failure Modes
You’re designing an e-commerce system. Which data models would you use for which components?
?
Relational (PostgreSQL):
- Orders, order line items, payments — need ACID, many-to-many (orders ↔ products)
- Customer accounts, addresses, payment methods
- Inventory counts — need strong consistency
Document (MongoDB or PostgreSQL JSONB):
- Product catalog — heterogeneous schema (different product types have different attributes)
- User sessions and cart state — tree-shaped, schema evolves frequently
Graph (Neo4j or Neptune):
- Product recommendations (“customers who bought X also bought Y”)
- Fraud detection (shared devices/addresses between accounts)
Event Sourcing (Kafka):
- Order lifecycle events (OrderPlaced, PaymentProcessed, OrderShipped)
- Audit trail for compliance
- Feeds real-time dashboards and analytics pipelines
Key principle: Different components of the same system have different data shapes — use the right model per component.
What are the failure modes of using a document database for data with complex relationships?
?
- Application-level join explosion: Fetching an order with its customer, products, and their categories requires 4+ sequential queries; no server-side optimizer helps
- Denormalization duplication: Storing product name in every order means a product rename requires updating thousands of orders
- Inconsistent data: Concurrent updates to denormalized copies cause stale or conflicting data
- Loss of referential integrity: Document DBs don’t enforce foreign key constraints; orphaned references silently corrupt data
- N+1 query problem: Loading a list of orders and then loading each order’s customer = 1 + N queries; explodes at scale
Historical context: This is exactly the problem that plagued the pre-relational hierarchical and network models before Codd’s relational model.
What happens when an Event Sourcing system needs to change the schema of an event?
?
The core problem: Events are immutable and must be stored forever. Old events cannot be modified. But new code may expect different fields.
Solutions:
- Upcasting: When replaying, apply a function that transforms old event format to new format on the fly
- Example:
OrderPlaced_v1→OrderPlaced_v2(add missing fields with defaults)
- Example:
- Multiple event versions: Keep old and new event types; handlers support both versions
- Snapshot: Periodically store a full state snapshot so you don’t need to replay events from the beginning
- Event versioning from day one: Include a
schema_versionfield in every event; design upgrade paths proactively
Key lesson: Event schema evolution is one of the hardest operational challenges in event sourcing — it must be planned upfront, not retrofitted.
How does GraphQL’s N+1 query problem arise and how is it solved?
?
The N+1 problem in GraphQL:
query { users { name posts { title } } }Naive implementation:
- Query:
SELECT * FROM users→ returns 100 users - For each user:
SELECT * FROM posts WHERE user_id = ?→ 100 queries
= 1 + 100 = 101 queries
Solution: The DataLoader pattern (Facebook’s open-source solution):
- Batch all IDs collected during a single request
- Execute one query:
SELECT * FROM posts WHERE user_id IN (1, 2, ..., 100) - Cache results within the request
Implementation: dataloader library (Node.js), strawberry-graphql DataLoader (Python), absinthe batching (Elixir)
Key insight: The N+1 problem exists in any API layer (REST also suffers from it); GraphQL makes it more visible and forces you to solve it explicitly.
Modern Context
How has event sourcing infrastructure evolved by 2026?
?
Mainstream platforms:
- Apache Kafka: De facto standard event log for large-scale systems; log compaction enables infinite retention
- EventStoreDB: Purpose-built event store with native projection support and subscriptions
- Axon Framework (Java): Full CQRS + Event Sourcing framework with Axon Server
- Temporal.io: Durable execution platform built on event sourcing principles (used at Stripe, Netflix)
- Restate: New durable execution framework using event sourcing for workflow state
Patterns that have emerged:
- Outbox pattern: Write events to a DB table and Kafka atomically to avoid dual-write failures
- Event-carried state transfer: Events carry enough data for consumers to update projections without querying the source
- Log compaction: Kafka’s built-in feature that retains only the latest value per key (tombstoning)
Trend: Event sourcing is no longer niche — it’s standard for financial systems, logistics, and any audit-heavy domain.
What is the role of Apache Arrow in the modern data stack?
?
Apache Arrow: A language-independent columnar in-memory format and IPC (inter-process communication) protocol.
Problems it solves:
- Data exchange between pandas, Spark, DuckDB, Polars, and other tools required serialization/deserialization overhead
- Each tool had its own internal format → copying data between tools was expensive
Arrow provides:
- Zero-copy reads: Tools share the same memory buffer without copying
- Columnar layout: Optimal for vectorized processing and SIMD instructions
- Language bindings: Python, Java, C++, Rust, Go, R all speak Arrow natively
Impact: DuckDB reads Arrow format directly; Pandas 2.0 uses Arrow as its optional backend; Apache Spark 3.x uses Arrow for Python UDF execution. Arrow is becoming the universal exchange format for the analytical data stack.
Total Cards: 28
Review Time: ~25 minutes
Priority: HIGH
Last Updated: 2026-05-29