Chapter 3: Data Models and Query Languages

ddia-2e data-models query-languages relational document graph graphql event-sourcing cqrs

Status: Notes complete

Overview

Data models are the most fundamental architectural decision in any software system — they determine what is easy to express, what is hard, and what is impossible. This chapter surveys the full landscape of data models as of 2026: relational, document, graph, GraphQL, event sourcing (CQRS), and analytical schemas (star/snowflake). The central insight is that each model optimizes for a different shape of data and a different access pattern, and matching model to workload is more important than any other database choice. The 2nd edition significantly expands on the 1st by adding GraphQL, Event Sourcing/CQRS, DataFrames, and analytical star/snowflake schemas, reflecting how diverse modern data pipelines have become.

Key Concepts

The Object-Relational Mismatch

Relational databases (SQL, Edgar Codd 1970) organize data into relations (tables) of tuples (rows). All values in a row are atomic; relationships between tables are encoded as foreign keys joined at query time.

The impedance mismatch: Application code uses objects — classes with nested collections, inheritance, and polymorphism. Relational tables are flat. Converting between the two requires either:

ORM frameworks (Hibernate, ActiveRecord, SQLAlchemy) — reduce boilerplate but cannot eliminate the conceptual gap
Manual mapping — explicit, verbose, but gives full control
JSONB columns in PostgreSQL — store nested data inside a relational row, a hybrid approach

The mismatch matters most for one-to-many relationships: a user with many resume positions is a natural JSON array, but must be split across two tables with a foreign key in the relational model.

-- Relational: two tables required
CREATE TABLE users (id INT PRIMARY KEY, name TEXT);
CREATE TABLE positions (id INT, user_id INT REFERENCES users(id), title TEXT, company TEXT);
 
-- Document: natural nesting
{ "id": 1, "name": "Alice", "positions": [
    { "title": "Engineer", "company": "Acme" },
    { "title": "Lead", "company": "Globex" }
  ]
}

Normalization, Denormalization, and Joins

Normalization eliminates duplication by storing each fact once and referencing it by ID. In a normalized schema, a city name is stored in a cities table; all records reference the city by ID. Benefits:

Updates in one place (city renames don’t require scanning every row)
IDs are machine-friendly and never change even if the human-readable value changes
No risk of inconsistent copies of the same fact

Denormalization stores redundant copies of data to avoid joins at read time. A denormalized row stores the city name directly. Benefits:

Faster reads (no join needed)
Simpler queries

Trade-off rule of thumb: Normalize first for correctness and maintainability. Denormalize only after you have measured a join bottleneck, and be prepared to handle cache-invalidation complexity.

Joins in document databases: MongoDB, CouchDB, and similar databases do not support server-side joins natively (or only support limited forms). When a document references another document, the application must issue multiple queries and join in application code — exactly the problem the relational model solved in 1970. Modern MongoDB supports $lookup for aggregation joins, but it is not as flexible or performant as SQL JOIN.

Many-to-One and Many-to-Many Relationships

Many-to-one: Many records refer to a single canonical value (e.g., many users live in the same city). Naturally expressed as a foreign key in relational; awkward in document model (requires app-level join).

Many-to-many: Many records relate to many other records (e.g., users and skills, authors and books). This is where relational databases shine and document databases struggle most. The document model forces a choice between:

Embedding (denormalize): duplicate the related data in every document — update complexity grows
Referencing (normalize): store only the ID — requires app-level join across multiple queries

Historical context: The network model (CODASYL, 1970s) and hierarchical model (IMS) both failed to handle many-to-many cleanly. The relational model solved this by introducing the query optimizer — programmers no longer needed to specify access paths manually. Document databases partially recreate the pre-relational problem for data with complex relationships.

Stars and Snowflakes: Schemas for Analytics

OLAP (Online Analytical Processing) databases use dimensional schemas designed for fast aggregate queries over large datasets. Two main patterns:

Star schema: A central fact table surrounded by dimension tables.

Fact table: Records individual events (one row per sale, click, order). Very wide (dozens of columns), very tall (billions of rows). Contains foreign keys to dimensions and numeric measures.
Dimension tables: Describe the who/what/when/where of an event (customer, product, date, store). Smaller, human-readable.

        dim_customer ←── fact_sales ──→ dim_product
                              │
                         dim_date
                              │
                         dim_store

Snowflake schema: Dimension tables are further normalized into sub-dimensions (e.g., dim_product references a separate dim_category table). More normalized but requires more joins; star schema is preferred in most OLAP tools because column-oriented storage makes joins relatively cheap.

Why star/snowflake matter: Most BI tools (Tableau, Looker, dbt) generate SQL that assumes a star schema. Understanding this structure helps in designing data warehouses and writing efficient analytical queries.

When to Use Which Model

Criterion	Relational	Document	Graph
Data shape	Tabular, uniform rows	Tree/nested, heterogeneous	Highly interconnected
Primary operation	Joins, aggregates	Load whole document	Traversal
Many-to-many	Native (JOIN)	Difficult (app join)	Native (traversal)
Schema flexibility	Low (schema-on-write)	High (schema-on-read)	Medium
Consistency guarantees	Strong (ACID)	Depends	Depends
Best fit	ERP, CRM, banking	CMS, catalog, user profiles	Social graphs, fraud detection, knowledge graphs
Examples	PostgreSQL, MySQL, Oracle	MongoDB, CouchDB, DynamoDB	Neo4j, Amazon Neptune, Dgraph

Decision heuristic:

If your data is mostly tree-shaped (one-to-many hierarchies with few cross-references) → Document
If your data has complex many-to-many relationships or you need powerful aggregation → Relational
If traversing relationships is the primary query (find all nodes within N hops) → Graph
If reading and writing events is primary, with state derived from history → Event Sourcing

Graph-Like Data Models

Property Graphs

The property graph model is the most widely deployed graph model (Neo4j, Amazon Neptune, TigerGraph). Two building blocks:

Vertices (nodes):

Unique identifier
Set of outgoing edges
Set of incoming edges
Collection of key-value properties

Edges:

Unique identifier
Tail vertex (start of edge)
Head vertex (end of edge)
Label describing the relationship type
Collection of key-value properties

Key design properties:

Any vertex can have an edge to any other vertex — no schema enforced on connections
Edges can carry rich metadata (properties) alongside the label
The same data can be traversed in both directions efficiently
Multiple different kinds of information can be stored in one graph (heterogeneous schema)

(Alice)-[:BORN_IN {year: 1985}]->(France)
(Alice)-[:LIVES_IN]->(London)
(Alice)-[:WORKS_AT {since: 2020}]->(Acme Corp)
(Bob)-[:KNOWS {since: 2015}]->(Alice)

The power of this model: the same query language (Cypher) handles social relationships, geographic data, organizational hierarchies, and knowledge graphs — all coexisting in one graph.

The Cypher Query Language

Cypher is Neo4j’s declarative, pattern-matching query language for property graphs. It uses ASCII-art notation to express graph patterns.

-- Find people born in France who live in the UK
MATCH (person:Person)-[:BORN_IN]->(country:Country {name: 'France'}),
      (person)-[:LIVES_IN]->(city:City)-[:LOCATED_IN]->(uk:Country {name: 'UK'})
RETURN person.name, city.name
 
-- Find all mutual friends of Alice and Bob
MATCH (alice:Person {name: 'Alice'})-[:KNOWS]->(mutual)<-[:KNOWS]-(bob:Person {name: 'Bob'})
RETURN mutual.name
 
-- Create a node and relationship
CREATE (charlie:Person {name: 'Charlie', age: 30})
CREATE (charlie)-[:KNOWS {since: 2024}]->(alice)

Cypher vs SQL for graph queries: Consider finding all people who were born in the US and emigrated to Europe. In SQL this requires a self-join on a location table with recursive CTEs or multiple joins. In Cypher:

MATCH (p:Person)-[:BORN_IN]->(:Country {continent: 'North America'}),
      (p)-[:LIVES_IN]->(:Country {continent: 'Europe'})
RETURN p.name

The Cypher version is not only more concise — it is also more readable and maps directly to the conceptual model.

Graph Queries in SQL

Representing a graph in SQL is possible but awkward. You need two tables — one for vertices and one for edges — and then use recursive CTEs for traversal:

-- Find all people reachable from person_id = 1 within 3 hops
WITH RECURSIVE reachable(person_id, depth) AS (
  SELECT 1, 0
  UNION ALL
  SELECT e.head_vertex, r.depth + 1
  FROM edges e
  JOIN reachable r ON e.tail_vertex = r.person_id
  WHERE r.depth < 3
)
SELECT DISTINCT person_id FROM reachable;

SQL:1999 added recursive CTEs for exactly this purpose. However, recursive SQL queries are verbose, hard to optimize, and much less ergonomic than Cypher. This is why specialized graph databases exist.

Triple Stores and SPARQL

The triple-store model represents all information as three-part statements: (subject, predicate, object):

(Jim, age, 33) → property fact (object is a value)
(Jim, bornIn, Idaho) → relationship (object is a vertex/entity)
(Idaho, type, USState) → type assertion

RDF (Resource Description Framework) is the W3C standard format for triple stores. All identifiers are URIs (universally resolvable), enabling linked data across the web:

@prefix schema: <http://schema.org/> .
<http://example.com/people/jim> schema:name "Jim" .
<http://example.com/people/jim> schema:birthPlace <http://example.com/places/idaho> .

SPARQL is the query language for RDF triple stores:

PREFIX schema: <http://schema.org/>
SELECT ?person ?birthplace
WHERE {
  ?person schema:birthPlace ?birthplace .
  ?birthplace a schema:State .
}

Triple stores are used in knowledge graphs (Google’s Knowledge Graph, Wikidata), biomedical databases, and enterprise linked data projects.

Datalog: Recursive Relational Queries

Datalog is the oldest graph-capable query language (1970s), and the inspiration for modern graph query systems. It defines facts and rules recursively:

% Facts
born_in(alice, france).
within(france, europe).
within(europe, western_region).
 
% Rule: recursive ancestry
within_region(X, Y) :- within(X, Y).
within_region(X, Y) :- within(X, Z), within_region(Z, Y).
 
% Query: who was born in a place within Europe?
born_in_region(Person, Region) :-
  born_in(Person, Place),
  within_region(Place, Region).

Datalog is the foundation for Datomic (the database created by Rich Hickey at Cognitect, now acquired by Nubank) and influenced Cascalog (Clojure-based dataflow). Its recursive rules make complex graph traversals compact and composable.

GraphQL

GraphQL is a query language for APIs (not databases directly), created by Facebook in 2012 and open-sourced in 2015. It allows clients to request exactly the data they need — no more, no less.

The REST problem GraphQL solves:

Over-fetching: REST endpoints return fixed payloads; the client gets fields it doesn’t need
Under-fetching: Getting a user’s posts requires multiple round trips (/user/1, then /user/1/posts)
Tight coupling: API changes require versioning; clients break when fields are added/removed

GraphQL vs REST vs RPC:

Feature	REST	GraphQL	gRPC/RPC
Query flexibility	Fixed endpoints	Client-defined queries	Fixed methods
Over-fetching	Common	Eliminated	N/A (typed)
Introspection	No standard	Built-in schema	.proto file
Network efficiency	Multiple round trips	Single request	Streaming
Type system	Optional (OpenAPI)	Required (SDL)	Required (.proto)
Browser support	Native	Native (HTTP)	Limited (grpc-web)
Best for	Public APIs, CRUD	Data-rich UIs, mobile	Internal microservices

# GraphQL schema definition
type User {
  id: ID!
  name: String!
  posts: [Post!]!
}
 
type Post {
  id: ID!
  title: String!
  body: String!
  author: User!
}
 
# Client query (client decides exactly what it needs)
query {
  user(id: "42") {
    name
    posts {
      title
    }
  }
}

Important: GraphQL is a query language for APIs, not a database. A GraphQL server resolves queries against any backend — relational, document, graph, or microservices. It sits at the API layer, not the storage layer.

Subscriptions: GraphQL also supports real-time subscriptions, making it useful for live dashboards and collaborative apps.

Event Sourcing and CQRS

Event Sourcing is a data modeling pattern where instead of storing the current state of an entity, you store the full sequence of events that led to that state. The current state is derived by replaying events.

Banking example (the canonical illustration):

Traditional approach: Store account.balance = 1000

Event sourcing approach: Store the event log:

AccountOpened  { account_id: 1, initial_balance: 0,    timestamp: 2024-01-01 }
MoneyDeposited { account_id: 1, amount: 1500,           timestamp: 2024-01-10 }
MoneyWithdrawn { account_id: 1, amount: 500,            timestamp: 2024-01-15 }

Current balance = replay events: 0 + 1500 - 500 = 1000

Why event sourcing?

Complete audit log: You can answer “what was the balance on Jan 12?” (impossible with mutable state)
Temporal queries: Replay to any point in time
Debugging: Reproduce exact system state that caused a bug
Multiple projections: Same event stream produces different read models (balance view, transaction history view, compliance report)
Decoupling: Downstream systems subscribe to the event stream without coupling to the write model

CQRS (Command Query Responsibility Segregation) is the architectural pattern that naturally pairs with event sourcing. It separates:

Command side (writes): Validates commands, emits events to the event store
Query side (reads): Maintains optimized read models (projections) by consuming the event stream

User Request
     │
     ▼
┌────────────┐     event      ┌──────────────┐
│  Command   │ ─────────────▶ │  Event Store │
│  Handler   │                │  (append-    │
│  (validate)│                │   only log)  │
└────────────┘                └──────┬───────┘
                                     │ stream
                              ┌──────▼───────┐
                              │  Projector   │
                              │  (consumers) │
                              └──────┬───────┘
                                     │ writes
                              ┌──────▼───────┐
                              │  Read Model  │ ◀── Query Side
                              │  (optimized  │     (fast reads)
                              │   for reads) │
                              └──────────────┘

E-commerce example:

Commands: PlaceOrder, CancelOrder, ShipOrder
Events: OrderPlaced, OrderCancelled, OrderShipped
Read projections: “pending orders” view, “fulfillment dashboard”, “revenue report” — each optimized independently

Trade-offs of Event Sourcing / CQRS:

Aspect	Benefit	Cost
Audit trail	Complete, immutable history	Storage grows unboundedly
Temporal queries	Easy time-travel	Must replay many events
Read performance	Projections optimized per use case	Eventual consistency between write and read
Schema evolution	Events are immutable (versioning needed)	Upcasting/event migration is complex
Complexity	Clear separation of concerns	Significantly more moving parts than CRUD

When to use Event Sourcing:

Audit requirements are strict (finance, healthcare, legal)
Temporal queries are frequent (“what was the state at time T?”)
Multiple heterogeneous consumers need the same data in different shapes
Undo/redo is a first-class requirement

When NOT to use Event Sourcing:

Simple CRUD applications without audit requirements
Small teams unfamiliar with the pattern (high learning curve)
When consistency requirements are strict and eventual consistency is unacceptable

Kafka as an event store: Apache Kafka is the most widely deployed event log infrastructure. A Kafka topic with a long retention period functions as an event store; consumers maintain their own projections (materialized views). This is the foundation of modern data mesh and streaming data platforms.

DataFrames, Matrices, and Arrays

Beyond row-oriented and column-oriented relational models, data-intensive applications increasingly work with:

DataFrames: A tabular data structure with named columns and typed data, popularized by pandas (Python) and Apache Spark. DataFrames combine:

Tabular structure (like a relational table)
In-memory columnar layout (like column-oriented storage)
Rich transformation API (map, filter, join, groupBy, window functions)
Lazy evaluation in distributed systems (Spark, Dask)

DataFrames are the primary interface for data engineering, ML feature engineering, and exploratory data analysis. The DataFrame API in Spark is translated into a physical query plan — functionally equivalent to SQL but expressed in a programming language.

# Spark DataFrame — same as SQL but programmatic
df = spark.read.parquet("s3://bucket/events/")
result = (df
  .filter(df.event_type == "purchase")
  .groupBy("customer_id")
  .agg(sum("amount").alias("total_spend"))
  .orderBy("total_spend", ascending=False)
  .limit(100)
)

Matrices and arrays: Scientific computing and ML work with n-dimensional arrays (tensors):

NumPy arrays: Foundation of Python numerical computing
Tensors in PyTorch/TensorFlow: Multi-dimensional arrays with GPU acceleration
Zarr, HDF5, NetCDF: Array storage formats for large scientific datasets (climate data, genomics)

Apache Arrow: A columnar in-memory format and IPC protocol that allows zero-copy data sharing between systems (pandas, Spark, DuckDB, Polars). Arrow is becoming the universal exchange format for analytical data.

Comparison Tables

Full Data Model Comparison

Dimension	Relational	Document	Graph	Event Sourcing	Analytical (Star)
Core abstraction	Tables/rows	JSON documents	Vertices/edges	Immutable events	Fact + dimension tables
Relationships	Foreign keys + JOIN	Embedding or reference	Edges (first-class)	Derived from events	Foreign keys (denorm)
Schema	Strict, on-write	Flexible, on-read	Flexible	Immutable events	Strict
Primary query	SELECT/JOIN/GROUP BY	Find/project document	Traverse graph	Replay/project	Aggregate (SUM, COUNT)
Write pattern	UPDATE in place	Replace document	Add vertices/edges	Append-only	Bulk insert/ETL
Temporal queries	Difficult	Difficult	Difficult	Native (replay)	Slowly Changing Dims
Audit trail	Requires extra work	Requires extra work	Requires extra work	Built-in	Snapshot-based
Scale	Vertical → sharded	Horizontal	Vertical (mostly)	Horizontal (Kafka)	Massively parallel
Examples	PostgreSQL, MySQL	MongoDB, Firestore	Neo4j, Neptune	Kafka+Flink, Axon	Snowflake, BigQuery

GraphQL vs REST vs gRPC

Feature	REST	GraphQL	gRPC
Transport	HTTP/1.1, 2	HTTP/1.1, 2	HTTP/2
Data format	JSON (typically)	JSON	Protocol Buffers
Schema	Optional (OpenAPI)	Required (SDL)	Required (.proto)
Versioning	URL versioning	Schema evolution	.proto evolution
Over-fetching	Yes (common)	No (client selects)	No (typed methods)
Caching	HTTP cache native	Complex (POST)	No HTTP cache
Real-time	WebSocket (add-on)	Subscriptions native	Server streaming native
Browser	Native	Native	Limited (grpc-web)
Learning curve	Low	Medium	Medium-High
Best use	Public CRUD APIs	Rich UIs, mobile apps	Internal microservices

Important Points Summary

Data model = what is easy vs what is hard: The most consequential architectural decision; changing models is expensive.
Relational solved CODASYL’s problem: The query optimizer replaced manual access path navigation — a lesson that declarative beats imperative for data access.
Document model is not schema-less: Schema-on-read moves enforcement to the application layer; it’s a trade-off, not elimination.
Graph databases are justified by traversal: Only reach for a graph DB when the relationships themselves are the primary query object.
GraphQL is at the API layer, not the storage layer: It solves over-fetching and under-fetching but does not replace your database.
Event Sourcing provides temporal queries and audit trails: At the cost of eventual consistency and significant operational complexity.
CQRS separates write and read models: Write models optimize for correctness; read projections optimize for query performance.
Star schemas are the language of data warehouses: BI tools assume dimensional modeling; understanding it is essential for analytics engineering.
DataFrames are converging with SQL: Spark, Polars, DuckDB blur the line between programmatic and declarative data transformation.
Apache Arrow is the universal columnar format: Zero-copy interoperability between pandas, Spark, DuckDB, and Polars in the modern data stack.

Modern Context (2026)

Multi-model databases have become the default:

PostgreSQL handles relational, JSONB (document), PostGIS (geospatial), pgvector (vector), and graph queries (recursive CTEs) in a single system
MongoDB Atlas adds time series, full-text search, and vector search alongside document storage
“Polyglot persistence” (separate DB per model) is being replaced by multi-model databases for most use cases

GraphQL proliferation:

GraphQL is now the dominant API style for internal data APIs and BFF (Backend for Frontend) patterns
Hasura, PostGraphile, and Apollo Federation auto-generate GraphQL APIs from relational schemas
Federation allows stitching multiple GraphQL services into a unified graph

Event Sourcing in production:

Event sourcing has moved from niche to mainstream for financial systems and audit-heavy domains
Kafka is the de facto event store for large-scale systems (used at LinkedIn, Uber, Airbnb)
Axon Framework (Java) and EventStoreDB are purpose-built event sourcing platforms
Temporal.io and Restate provide durable execution built on event sourcing principles

The dbt revolution in analytics:

dbt (data build tool) has standardized SQL-based transformation of star/snowflake schemas
Analytical engineering as a discipline emerged from treating data transformations as code
dbt + Snowflake/BigQuery/Databricks is the dominant OLAP stack in 2026

Vector embeddings as a data model:

Semantic similarity search has become a first-class data access pattern (see Chapter 4)
pgvector, Pinecone, Weaviate, and Qdrant implement approximate nearest-neighbor search
LLM applications (RAG, semantic search) drive widespread adoption

Questions for Reflection

When is the document model a better fit than the relational model, and what are the warning signs that you’ve made the wrong choice?
How does the event sourcing pattern relate to the append-only nature of LSM-trees and Kafka? What shared principle do they embody?
Why is GraphQL described as a query language for APIs rather than databases? What would a “GraphQL for storage” look like?
In what scenarios does CQRS genuinely pay off versus just adding complexity? How would you explain this trade-off to a skeptical engineer?
What is the relationship between the star schema in data warehouses and denormalization? Why is denormalization desirable in OLAP but problematic in OLTP?
If you were designing a ride-sharing application (like Uber), which data models would you use for which parts of the system, and why?

ch04-storage-and-retrieval — How these data models are physically stored and indexed
ch05-encoding-and-evolution — How data encoded in events (event sourcing) must evolve over time
ch12-stream-processing — Kafka as the event backbone; stream-table duality
ch02-data-models-query-languages — 1st edition coverage for comparison
ch01-tradeoffs-data-systems — OLTP vs OLAP context for when to use which model

Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch03-data-models-query-languages

Chapter 3: Data Models and Query Languages

Overview

Key Concepts

The Object-Relational Mismatch

Normalization, Denormalization, and Joins

Many-to-One and Many-to-Many Relationships

Stars and Snowflakes: Schemas for Analytics

When to Use Which Model

Graph-Like Data Models

Property Graphs

The Cypher Query Language

Graph Queries in SQL

Triple Stores and SPARQL

Datalog: Recursive Relational Queries

GraphQL

Event Sourcing and CQRS

DataFrames, Matrices, and Arrays

Comparison Tables

Full Data Model Comparison

GraphQL vs REST vs gRPC

Important Points Summary

Modern Context (2026)

Questions for Reflection

Graph View

Table of Contents

Backlinks

Study Notes by Niladri & AI

Explorer

ch03-data-models-query-languages

Chapter 3: Data Models and Query Languages

Overview

Key Concepts

The Object-Relational Mismatch

Normalization, Denormalization, and Joins

Many-to-One and Many-to-Many Relationships

Stars and Snowflakes: Schemas for Analytics

When to Use Which Model

Graph-Like Data Models

Property Graphs

The Cypher Query Language

Graph Queries in SQL

Triple Stores and SPARQL

Datalog: Recursive Relational Queries

GraphQL

Event Sourcing and CQRS

DataFrames, Matrices, and Arrays

Comparison Tables

Full Data Model Comparison

GraphQL vs REST vs gRPC

Important Points Summary

Modern Context (2026)

Questions for Reflection

Related Resources

Graph View

Table of Contents

Backlinks