Chapter 2: Data Models and Query Languages

Overview

Data models are the most fundamental aspect of software design—they shape how we think about the problems we solve. This chapter surveys the landscape of data models: relational, document, graph, and others, along with their corresponding query languages. The key insight is that each model makes certain things easy and others difficult, and the choice of data model has deep implications for what you can do efficiently.

Core Question: How should data be structured and queried? Relational tables? JSON documents? Graphs? The answer depends on the relationships in your data and your access patterns.

Key Concepts

Relational Model vs Document Model

Relational Model (SQL, Edgar Codd 1970):

Data organized into relations (tables), each a collection of tuples (rows)
Values in each row are atomic; rows in different tables related via foreign keys
Strengths: joins, many-to-one, many-to-many relationships
Query optimizer handles execution details automatically

Document Model (MongoDB, CouchDB, RethinkDB):

Data stored as self-contained documents (JSON/BSON/XML)
One-to-many relationships represented as nested structures
Schema-on-read: structure interpreted at read time (no enforced schema)
Strengths: locality (all related data in one document), no impedance mismatch

The Object-Relational Impedance Mismatch:

Application code uses objects (OOP), but relational DB uses tables
ORMs (Hibernate, ActiveRecord) reduce boilerplate but don’t fully solve it
Document model maps more naturally to in-memory data structures

Normalization vs Denormalization:

Normalized (relational): Eliminate duplication, use IDs, join on read
- IDs never need to change even if what they represent changes
- Removing duplication = many-to-one relationships (e.g., city stored as ID)
Denormalized (document): Store data redundantly for fast reads
- Works well when data doesn’t need cross-document joins

Document Model Limitations

When Document Model Struggles:

Many-to-many relationships (references between documents require application-level joins)
Joins not natively supported (application must do multiple queries)
Cannot refer to nested item within document without loading whole document

When Document Model Excels:

Tree-like (one-to-many) structure matches data well
Data locality: need to load whole document at once
Schema flexibility (schema-on-read)

Historical parallel: Network model (CODASYL, 1970s) and hierarchical model had similar problems with many-to-many. Relational model solved this by introducing the query optimizer.

Query Languages

Declarative vs Imperative:

Imperative: Tell the computer how to do something step-by-step (most programming languages)
Declarative: Specify what you want, let the system figure out how (SQL, CSS)
Declarative allows query optimizer to choose best execution strategy
Declarative is often more concise and easier to parallelize

SQL (Structured Query Language):

Declarative language for relational databases
Query optimizer automatically chooses indexes, join order, etc.
Can run queries in parallel without changes to query text
Example: SELECT * FROM animals WHERE family = 'Sharks'

MapReduce (MongoDB, Hadoop):

Programming model for large-scale data processing across many machines
Map: Extract key-value pairs from each document
Reduce: Aggregate values with the same key
Neither purely declarative nor purely imperative—somewhere in between
MongoDB added aggregation pipeline as more declarative alternative

Graph-Based Query Languages:

Cypher (Neo4j): Declarative, pattern-matching syntax
- MATCH (person)-[:BORN_IN]->(city) RETURN person.name
SPARQL: Query language for triple-stores (RDF)
Datalog: Older language, foundation for Datomic

Graph Data Model

When to Use Graphs:

Many-to-many relationships with complex interconnections
Social networks, web graphs, road networks, recommendation engines
When connections between data are as important as the data itself

Property Graph Model (Neo4j, Titan, InfiniteGraph):

Vertices: Unique ID, set of outgoing/incoming edges, key-value properties
Edges: Unique ID, tail vertex, head vertex, label, key-value properties
Any vertex can have edge to any other vertex (no schema restriction)
Can traverse graph efficiently in any direction

Triple-Store Model (AllegroGraph, Datomic, Apache Jena):

All information stored as three-part statements: (subject, predicate, object)
Example: (Jim, likes, bananas)
Subject is a vertex; predicate is edge label or property; object is another vertex or value
RDF (Resource Description Framework) standard format

Schema Flexibility

Schema-on-write (relational):

Schema explicitly defined, DB enforces it on write
Migration required to change schema (ALTER TABLE)
Better for static, well-understood data structures

Schema-on-read (document):

Schema implicit, interpreted when data is read
No enforcement—different documents can have different fields
Better when items don’t all have the same structure

When schema-on-read is advantageous:

Many different types of objects, impractical to put each in own table
Structure determined by external systems you can’t control
Data is heterogeneous

Data Locality

Document locality:

Document stored as continuous string (JSON, XML, BSON)
Entire document loaded when needed (wasteful if only accessing part)
Only beneficial if you need most of the document at once

Relational locality (Google Spanner, Oracle):

Can declare tables to be interleaved (nested) within parent table
Data locality without full document model

Column-family model (Cassandra, HBase):

Locality achieved by grouping related columns together

Important Points

Data model shapes what is easy vs hard: Choose the model that matches your data’s natural structure.
Relational model solved the network model’s problem: Query optimizer replaced manual access path navigation.
Document model is not always schema-less: Schema-on-read still has an implicit schema; it’s just not enforced.
Joins are not just for relational DBs: Document databases increasingly support joins (MongoDB $lookup).
Graph databases are for highly connected data: When traversing relationships is the primary operation.
Declarative languages are more future-proof: Query optimizer can take advantage of new hardware/algorithms.
Hybrid models are converging: PostgreSQL supports JSON, MongoDB supports joins—lines are blurring.

Examples & Case Studies

LinkedIn Profile (Document Model)
- Resume-like structure (one user, many positions/education/contacts)
- Represented as one-to-many tree, natural fit for JSON document
- References between users (connections) require joins → hybrid approach
Geographic Data (Graph Model)
- Road network: vertices = intersections, edges = roads
- Enables shortest-path queries (Dijkstra’s algorithm)
- Also used for: social graphs, knowledge graphs, fraud detection
Twitter (Relational + Document Hybrid)
- Users, tweets, follows stored in relational tables
- Tweet content with media/metadata fits document model
- Modern systems mix models freely
Facebook’s Social Graph
- 1 billion vertices (people), over 1 trillion edges (friendships, likes, etc.)
- Graph traversal: “Who are friends of my friends that I don’t know?”
- TAO (The Associations and Objects) system for graph queries at scale

Questions

When would you choose a document database over a relational database?
What are the trade-offs between schema-on-read and schema-on-write?
How do you handle many-to-many relationships in a document database?
When does the graph data model make sense over relational?
Why is SQL declarative and why does that matter for performance?
What is the impedance mismatch problem and how do you address it?
How has the relational model adapted to incorporate document and graph features?
What are the performance implications of data locality in document databases?

Modern Context (2026)

Multi-Model Databases:

PostgreSQL with JSONB: relational + document in one system
MongoDB 5+ with Atlas: document + time series + search
TigerGraph, Dgraph: graph + document hybrids
“Polyglot persistence” → increasingly replaced by multi-model DBs

NewSQL and HTAP:

TiDB, CockroachDB: relational model with horizontal scale
HTAP (Hybrid Transactional/Analytical Processing): serve OLTP and OLAP from single store
Google AlloyDB, Neon: cloud-native PostgreSQL variants

AI and Vector Databases:

New data model for 2026: vector embeddings for ML/AI workloads
pgvector (PostgreSQL extension), Pinecone, Weaviate, Qdrant
Similarity search replaces exact-match queries
LLM applications drive demand for vector + keyword hybrid search

GraphQL Layer:

Not a database, but a query language for APIs
Decouples client data needs from server data model
Widely adopted in 2026 for flexible client-driven queries

Graph ML:

Graph Neural Networks (GNNs) for fraud detection, recommendation
Property graph model increasingly central to AI/ML pipelines

Status: Notes complete
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch02-data-models-query-languages