Chapter 3: Data Models and Query Languages

ddia-2e data-models query-languages relational document graph graphql event-sourcing cqrs

Status: Notes complete


Overview

Data models are the most fundamental architectural decision in any software system — they determine what is easy to express, what is hard, and what is impossible. This chapter surveys the full landscape of data models as of 2026: relational, document, graph, GraphQL, event sourcing (CQRS), and analytical schemas (star/snowflake). The central insight is that each model optimizes for a different shape of data and a different access pattern, and matching model to workload is more important than any other database choice. The 2nd edition significantly expands on the 1st by adding GraphQL, Event Sourcing/CQRS, DataFrames, and analytical star/snowflake schemas, reflecting how diverse modern data pipelines have become.


Key Concepts

The Object-Relational Mismatch

Relational databases (SQL, Edgar Codd 1970) organize data into relations (tables) of tuples (rows). All values in a row are atomic; relationships between tables are encoded as foreign keys joined at query time.

The impedance mismatch: Application code uses objects — classes with nested collections, inheritance, and polymorphism. Relational tables are flat. Converting between the two requires either:

  • ORM frameworks (Hibernate, ActiveRecord, SQLAlchemy) — reduce boilerplate but cannot eliminate the conceptual gap
  • Manual mapping — explicit, verbose, but gives full control
  • JSONB columns in PostgreSQL — store nested data inside a relational row, a hybrid approach

The mismatch matters most for one-to-many relationships: a user with many resume positions is a natural JSON array, but must be split across two tables with a foreign key in the relational model.

-- Relational: two tables required
CREATE TABLE users (id INT PRIMARY KEY, name TEXT);
CREATE TABLE positions (id INT, user_id INT REFERENCES users(id), title TEXT, company TEXT);
 
-- Document: natural nesting
{ "id": 1, "name": "Alice", "positions": [
    { "title": "Engineer", "company": "Acme" },
    { "title": "Lead", "company": "Globex" }
  ]
}

Normalization, Denormalization, and Joins

Normalization eliminates duplication by storing each fact once and referencing it by ID. In a normalized schema, a city name is stored in a cities table; all records reference the city by ID. Benefits:

  • Updates in one place (city renames don’t require scanning every row)
  • IDs are machine-friendly and never change even if the human-readable value changes
  • No risk of inconsistent copies of the same fact

Denormalization stores redundant copies of data to avoid joins at read time. A denormalized row stores the city name directly. Benefits:

  • Faster reads (no join needed)
  • Simpler queries

Trade-off rule of thumb: Normalize first for correctness and maintainability. Denormalize only after you have measured a join bottleneck, and be prepared to handle cache-invalidation complexity.

Joins in document databases: MongoDB, CouchDB, and similar databases do not support server-side joins natively (or only support limited forms). When a document references another document, the application must issue multiple queries and join in application code — exactly the problem the relational model solved in 1970. Modern MongoDB supports $lookup for aggregation joins, but it is not as flexible or performant as SQL JOIN.

Many-to-One and Many-to-Many Relationships

Many-to-one: Many records refer to a single canonical value (e.g., many users live in the same city). Naturally expressed as a foreign key in relational; awkward in document model (requires app-level join).

Many-to-many: Many records relate to many other records (e.g., users and skills, authors and books). This is where relational databases shine and document databases struggle most. The document model forces a choice between:

  1. Embedding (denormalize): duplicate the related data in every document — update complexity grows
  2. Referencing (normalize): store only the ID — requires app-level join across multiple queries

Historical context: The network model (CODASYL, 1970s) and hierarchical model (IMS) both failed to handle many-to-many cleanly. The relational model solved this by introducing the query optimizer — programmers no longer needed to specify access paths manually. Document databases partially recreate the pre-relational problem for data with complex relationships.

Stars and Snowflakes: Schemas for Analytics

OLAP (Online Analytical Processing) databases use dimensional schemas designed for fast aggregate queries over large datasets. Two main patterns:

Star schema: A central fact table surrounded by dimension tables.

  • Fact table: Records individual events (one row per sale, click, order). Very wide (dozens of columns), very tall (billions of rows). Contains foreign keys to dimensions and numeric measures.
  • Dimension tables: Describe the who/what/when/where of an event (customer, product, date, store). Smaller, human-readable.
        dim_customer ←── fact_sales ──→ dim_product
                              │
                         dim_date
                              │
                         dim_store

Snowflake schema: Dimension tables are further normalized into sub-dimensions (e.g., dim_product references a separate dim_category table). More normalized but requires more joins; star schema is preferred in most OLAP tools because column-oriented storage makes joins relatively cheap.

Why star/snowflake matter: Most BI tools (Tableau, Looker, dbt) generate SQL that assumes a star schema. Understanding this structure helps in designing data warehouses and writing efficient analytical queries.

When to Use Which Model

CriterionRelationalDocumentGraph
Data shapeTabular, uniform rowsTree/nested, heterogeneousHighly interconnected
Primary operationJoins, aggregatesLoad whole documentTraversal
Many-to-manyNative (JOIN)Difficult (app join)Native (traversal)
Schema flexibilityLow (schema-on-write)High (schema-on-read)Medium
Consistency guaranteesStrong (ACID)DependsDepends
Best fitERP, CRM, bankingCMS, catalog, user profilesSocial graphs, fraud detection, knowledge graphs
ExamplesPostgreSQL, MySQL, OracleMongoDB, CouchDB, DynamoDBNeo4j, Amazon Neptune, Dgraph

Decision heuristic:

  • If your data is mostly tree-shaped (one-to-many hierarchies with few cross-references) → Document
  • If your data has complex many-to-many relationships or you need powerful aggregation → Relational
  • If traversing relationships is the primary query (find all nodes within N hops) → Graph
  • If reading and writing events is primary, with state derived from history → Event Sourcing

Graph-Like Data Models

Property Graphs

The property graph model is the most widely deployed graph model (Neo4j, Amazon Neptune, TigerGraph). Two building blocks:

Vertices (nodes):

  • Unique identifier
  • Set of outgoing edges
  • Set of incoming edges
  • Collection of key-value properties

Edges:

  • Unique identifier
  • Tail vertex (start of edge)
  • Head vertex (end of edge)
  • Label describing the relationship type
  • Collection of key-value properties

Key design properties:

  1. Any vertex can have an edge to any other vertex — no schema enforced on connections
  2. Edges can carry rich metadata (properties) alongside the label
  3. The same data can be traversed in both directions efficiently
  4. Multiple different kinds of information can be stored in one graph (heterogeneous schema)
(Alice)-[:BORN_IN {year: 1985}]->(France)
(Alice)-[:LIVES_IN]->(London)
(Alice)-[:WORKS_AT {since: 2020}]->(Acme Corp)
(Bob)-[:KNOWS {since: 2015}]->(Alice)

The power of this model: the same query language (Cypher) handles social relationships, geographic data, organizational hierarchies, and knowledge graphs — all coexisting in one graph.

The Cypher Query Language

Cypher is Neo4j’s declarative, pattern-matching query language for property graphs. It uses ASCII-art notation to express graph patterns.

-- Find people born in France who live in the UK
MATCH (person:Person)-[:BORN_IN]->(country:Country {name: 'France'}),
      (person)-[:LIVES_IN]->(city:City)-[:LOCATED_IN]->(uk:Country {name: 'UK'})
RETURN person.name, city.name
 
-- Find all mutual friends of Alice and Bob
MATCH (alice:Person {name: 'Alice'})-[:KNOWS]->(mutual)<-[:KNOWS]-(bob:Person {name: 'Bob'})
RETURN mutual.name
 
-- Create a node and relationship
CREATE (charlie:Person {name: 'Charlie', age: 30})
CREATE (charlie)-[:KNOWS {since: 2024}]->(alice)

Cypher vs SQL for graph queries: Consider finding all people who were born in the US and emigrated to Europe. In SQL this requires a self-join on a location table with recursive CTEs or multiple joins. In Cypher:

MATCH (p:Person)-[:BORN_IN]->(:Country {continent: 'North America'}),
      (p)-[:LIVES_IN]->(:Country {continent: 'Europe'})
RETURN p.name

The Cypher version is not only more concise — it is also more readable and maps directly to the conceptual model.

Graph Queries in SQL

Representing a graph in SQL is possible but awkward. You need two tables — one for vertices and one for edges — and then use recursive CTEs for traversal:

-- Find all people reachable from person_id = 1 within 3 hops
WITH RECURSIVE reachable(person_id, depth) AS (
  SELECT 1, 0
  UNION ALL
  SELECT e.head_vertex, r.depth + 1
  FROM edges e
  JOIN reachable r ON e.tail_vertex = r.person_id
  WHERE r.depth < 3
)
SELECT DISTINCT person_id FROM reachable;

SQL:1999 added recursive CTEs for exactly this purpose. However, recursive SQL queries are verbose, hard to optimize, and much less ergonomic than Cypher. This is why specialized graph databases exist.

Triple Stores and SPARQL

The triple-store model represents all information as three-part statements: (subject, predicate, object):

  • (Jim, age, 33) → property fact (object is a value)
  • (Jim, bornIn, Idaho) → relationship (object is a vertex/entity)
  • (Idaho, type, USState) → type assertion

RDF (Resource Description Framework) is the W3C standard format for triple stores. All identifiers are URIs (universally resolvable), enabling linked data across the web:

@prefix schema: <http://schema.org/> .
<http://example.com/people/jim> schema:name "Jim" .
<http://example.com/people/jim> schema:birthPlace <http://example.com/places/idaho> .

SPARQL is the query language for RDF triple stores:

PREFIX schema: <http://schema.org/>
SELECT ?person ?birthplace
WHERE {
  ?person schema:birthPlace ?birthplace .
  ?birthplace a schema:State .
}

Triple stores are used in knowledge graphs (Google’s Knowledge Graph, Wikidata), biomedical databases, and enterprise linked data projects.

Datalog: Recursive Relational Queries

Datalog is the oldest graph-capable query language (1970s), and the inspiration for modern graph query systems. It defines facts and rules recursively:

% Facts
born_in(alice, france).
within(france, europe).
within(europe, western_region).
 
% Rule: recursive ancestry
within_region(X, Y) :- within(X, Y).
within_region(X, Y) :- within(X, Z), within_region(Z, Y).
 
% Query: who was born in a place within Europe?
born_in_region(Person, Region) :-
  born_in(Person, Place),
  within_region(Place, Region).

Datalog is the foundation for Datomic (the database created by Rich Hickey at Cognitect, now acquired by Nubank) and influenced Cascalog (Clojure-based dataflow). Its recursive rules make complex graph traversals compact and composable.

GraphQL

GraphQL is a query language for APIs (not databases directly), created by Facebook in 2012 and open-sourced in 2015. It allows clients to request exactly the data they need — no more, no less.

The REST problem GraphQL solves:

  • Over-fetching: REST endpoints return fixed payloads; the client gets fields it doesn’t need
  • Under-fetching: Getting a user’s posts requires multiple round trips (/user/1, then /user/1/posts)
  • Tight coupling: API changes require versioning; clients break when fields are added/removed

GraphQL vs REST vs RPC:

FeatureRESTGraphQLgRPC/RPC
Query flexibilityFixed endpointsClient-defined queriesFixed methods
Over-fetchingCommonEliminatedN/A (typed)
IntrospectionNo standardBuilt-in schema.proto file
Network efficiencyMultiple round tripsSingle requestStreaming
Type systemOptional (OpenAPI)Required (SDL)Required (.proto)
Browser supportNativeNative (HTTP)Limited (grpc-web)
Best forPublic APIs, CRUDData-rich UIs, mobileInternal microservices
# GraphQL schema definition
type User {
  id: ID!
  name: String!
  posts: [Post!]!
}
 
type Post {
  id: ID!
  title: String!
  body: String!
  author: User!
}
 
# Client query (client decides exactly what it needs)
query {
  user(id: "42") {
    name
    posts {
      title
    }
  }
}

Important: GraphQL is a query language for APIs, not a database. A GraphQL server resolves queries against any backend — relational, document, graph, or microservices. It sits at the API layer, not the storage layer.

Subscriptions: GraphQL also supports real-time subscriptions, making it useful for live dashboards and collaborative apps.

Event Sourcing and CQRS

Event Sourcing is a data modeling pattern where instead of storing the current state of an entity, you store the full sequence of events that led to that state. The current state is derived by replaying events.

Banking example (the canonical illustration):

  • Traditional approach: Store account.balance = 1000
  • Event sourcing approach: Store the event log:
    AccountOpened  { account_id: 1, initial_balance: 0,    timestamp: 2024-01-01 }
    MoneyDeposited { account_id: 1, amount: 1500,           timestamp: 2024-01-10 }
    MoneyWithdrawn { account_id: 1, amount: 500,            timestamp: 2024-01-15 }
    
    Current balance = replay events: 0 + 1500 - 500 = 1000

Why event sourcing?

  • Complete audit log: You can answer “what was the balance on Jan 12?” (impossible with mutable state)
  • Temporal queries: Replay to any point in time
  • Debugging: Reproduce exact system state that caused a bug
  • Multiple projections: Same event stream produces different read models (balance view, transaction history view, compliance report)
  • Decoupling: Downstream systems subscribe to the event stream without coupling to the write model

CQRS (Command Query Responsibility Segregation) is the architectural pattern that naturally pairs with event sourcing. It separates:

  • Command side (writes): Validates commands, emits events to the event store
  • Query side (reads): Maintains optimized read models (projections) by consuming the event stream
User Request
     │
     ▼
┌────────────┐     event      ┌──────────────┐
│  Command   │ ─────────────▶ │  Event Store │
│  Handler   │                │  (append-    │
│  (validate)│                │   only log)  │
└────────────┘                └──────┬───────┘
                                     │ stream
                              ┌──────▼───────┐
                              │  Projector   │
                              │  (consumers) │
                              └──────┬───────┘
                                     │ writes
                              ┌──────▼───────┐
                              │  Read Model  │ ◀── Query Side
                              │  (optimized  │     (fast reads)
                              │   for reads) │
                              └──────────────┘

E-commerce example:

  • Commands: PlaceOrder, CancelOrder, ShipOrder
  • Events: OrderPlaced, OrderCancelled, OrderShipped
  • Read projections: “pending orders” view, “fulfillment dashboard”, “revenue report” — each optimized independently

Trade-offs of Event Sourcing / CQRS:

AspectBenefitCost
Audit trailComplete, immutable historyStorage grows unboundedly
Temporal queriesEasy time-travelMust replay many events
Read performanceProjections optimized per use caseEventual consistency between write and read
Schema evolutionEvents are immutable (versioning needed)Upcasting/event migration is complex
ComplexityClear separation of concernsSignificantly more moving parts than CRUD

When to use Event Sourcing:

  • Audit requirements are strict (finance, healthcare, legal)
  • Temporal queries are frequent (“what was the state at time T?”)
  • Multiple heterogeneous consumers need the same data in different shapes
  • Undo/redo is a first-class requirement

When NOT to use Event Sourcing:

  • Simple CRUD applications without audit requirements
  • Small teams unfamiliar with the pattern (high learning curve)
  • When consistency requirements are strict and eventual consistency is unacceptable

Kafka as an event store: Apache Kafka is the most widely deployed event log infrastructure. A Kafka topic with a long retention period functions as an event store; consumers maintain their own projections (materialized views). This is the foundation of modern data mesh and streaming data platforms.

DataFrames, Matrices, and Arrays

Beyond row-oriented and column-oriented relational models, data-intensive applications increasingly work with:

DataFrames: A tabular data structure with named columns and typed data, popularized by pandas (Python) and Apache Spark. DataFrames combine:

  • Tabular structure (like a relational table)
  • In-memory columnar layout (like column-oriented storage)
  • Rich transformation API (map, filter, join, groupBy, window functions)
  • Lazy evaluation in distributed systems (Spark, Dask)

DataFrames are the primary interface for data engineering, ML feature engineering, and exploratory data analysis. The DataFrame API in Spark is translated into a physical query plan — functionally equivalent to SQL but expressed in a programming language.

# Spark DataFrame — same as SQL but programmatic
df = spark.read.parquet("s3://bucket/events/")
result = (df
  .filter(df.event_type == "purchase")
  .groupBy("customer_id")
  .agg(sum("amount").alias("total_spend"))
  .orderBy("total_spend", ascending=False)
  .limit(100)
)

Matrices and arrays: Scientific computing and ML work with n-dimensional arrays (tensors):

  • NumPy arrays: Foundation of Python numerical computing
  • Tensors in PyTorch/TensorFlow: Multi-dimensional arrays with GPU acceleration
  • Zarr, HDF5, NetCDF: Array storage formats for large scientific datasets (climate data, genomics)

Apache Arrow: A columnar in-memory format and IPC protocol that allows zero-copy data sharing between systems (pandas, Spark, DuckDB, Polars). Arrow is becoming the universal exchange format for analytical data.


Comparison Tables

Full Data Model Comparison

DimensionRelationalDocumentGraphEvent SourcingAnalytical (Star)
Core abstractionTables/rowsJSON documentsVertices/edgesImmutable eventsFact + dimension tables
RelationshipsForeign keys + JOINEmbedding or referenceEdges (first-class)Derived from eventsForeign keys (denorm)
SchemaStrict, on-writeFlexible, on-readFlexibleImmutable eventsStrict
Primary querySELECT/JOIN/GROUP BYFind/project documentTraverse graphReplay/projectAggregate (SUM, COUNT)
Write patternUPDATE in placeReplace documentAdd vertices/edgesAppend-onlyBulk insert/ETL
Temporal queriesDifficultDifficultDifficultNative (replay)Slowly Changing Dims
Audit trailRequires extra workRequires extra workRequires extra workBuilt-inSnapshot-based
ScaleVertical → shardedHorizontalVertical (mostly)Horizontal (Kafka)Massively parallel
ExamplesPostgreSQL, MySQLMongoDB, FirestoreNeo4j, NeptuneKafka+Flink, AxonSnowflake, BigQuery

GraphQL vs REST vs gRPC

FeatureRESTGraphQLgRPC
TransportHTTP/1.1, 2HTTP/1.1, 2HTTP/2
Data formatJSON (typically)JSONProtocol Buffers
SchemaOptional (OpenAPI)Required (SDL)Required (.proto)
VersioningURL versioningSchema evolution.proto evolution
Over-fetchingYes (common)No (client selects)No (typed methods)
CachingHTTP cache nativeComplex (POST)No HTTP cache
Real-timeWebSocket (add-on)Subscriptions nativeServer streaming native
BrowserNativeNativeLimited (grpc-web)
Learning curveLowMediumMedium-High
Best usePublic CRUD APIsRich UIs, mobile appsInternal microservices

Important Points Summary

  • Data model = what is easy vs what is hard: The most consequential architectural decision; changing models is expensive.
  • Relational solved CODASYL’s problem: The query optimizer replaced manual access path navigation — a lesson that declarative beats imperative for data access.
  • Document model is not schema-less: Schema-on-read moves enforcement to the application layer; it’s a trade-off, not elimination.
  • Graph databases are justified by traversal: Only reach for a graph DB when the relationships themselves are the primary query object.
  • GraphQL is at the API layer, not the storage layer: It solves over-fetching and under-fetching but does not replace your database.
  • Event Sourcing provides temporal queries and audit trails: At the cost of eventual consistency and significant operational complexity.
  • CQRS separates write and read models: Write models optimize for correctness; read projections optimize for query performance.
  • Star schemas are the language of data warehouses: BI tools assume dimensional modeling; understanding it is essential for analytics engineering.
  • DataFrames are converging with SQL: Spark, Polars, DuckDB blur the line between programmatic and declarative data transformation.
  • Apache Arrow is the universal columnar format: Zero-copy interoperability between pandas, Spark, DuckDB, and Polars in the modern data stack.

Modern Context (2026)

Multi-model databases have become the default:

  • PostgreSQL handles relational, JSONB (document), PostGIS (geospatial), pgvector (vector), and graph queries (recursive CTEs) in a single system
  • MongoDB Atlas adds time series, full-text search, and vector search alongside document storage
  • “Polyglot persistence” (separate DB per model) is being replaced by multi-model databases for most use cases

GraphQL proliferation:

  • GraphQL is now the dominant API style for internal data APIs and BFF (Backend for Frontend) patterns
  • Hasura, PostGraphile, and Apollo Federation auto-generate GraphQL APIs from relational schemas
  • Federation allows stitching multiple GraphQL services into a unified graph

Event Sourcing in production:

  • Event sourcing has moved from niche to mainstream for financial systems and audit-heavy domains
  • Kafka is the de facto event store for large-scale systems (used at LinkedIn, Uber, Airbnb)
  • Axon Framework (Java) and EventStoreDB are purpose-built event sourcing platforms
  • Temporal.io and Restate provide durable execution built on event sourcing principles

The dbt revolution in analytics:

  • dbt (data build tool) has standardized SQL-based transformation of star/snowflake schemas
  • Analytical engineering as a discipline emerged from treating data transformations as code
  • dbt + Snowflake/BigQuery/Databricks is the dominant OLAP stack in 2026

Vector embeddings as a data model:

  • Semantic similarity search has become a first-class data access pattern (see Chapter 4)
  • pgvector, Pinecone, Weaviate, and Qdrant implement approximate nearest-neighbor search
  • LLM applications (RAG, semantic search) drive widespread adoption

Questions for Reflection

  1. When is the document model a better fit than the relational model, and what are the warning signs that you’ve made the wrong choice?
  2. How does the event sourcing pattern relate to the append-only nature of LSM-trees and Kafka? What shared principle do they embody?
  3. Why is GraphQL described as a query language for APIs rather than databases? What would a “GraphQL for storage” look like?
  4. In what scenarios does CQRS genuinely pay off versus just adding complexity? How would you explain this trade-off to a skeptical engineer?
  5. What is the relationship between the star schema in data warehouses and denormalization? Why is denormalization desirable in OLAP but problematic in OLTP?
  6. If you were designing a ride-sharing application (like Uber), which data models would you use for which parts of the system, and why?

Last Updated: 2026-05-29