Chapter 2: Data Models and Query Languages
Overview
Data models are the most fundamental aspect of software design—they shape how we think about the problems we solve. This chapter surveys the landscape of data models: relational, document, graph, and others, along with their corresponding query languages. The key insight is that each model makes certain things easy and others difficult, and the choice of data model has deep implications for what you can do efficiently.
Core Question: How should data be structured and queried? Relational tables? JSON documents? Graphs? The answer depends on the relationships in your data and your access patterns.
Key Concepts
Relational Model vs Document Model
Relational Model (SQL, Edgar Codd 1970):
- Data organized into relations (tables), each a collection of tuples (rows)
- Values in each row are atomic; rows in different tables related via foreign keys
- Strengths: joins, many-to-one, many-to-many relationships
- Query optimizer handles execution details automatically
Document Model (MongoDB, CouchDB, RethinkDB):
- Data stored as self-contained documents (JSON/BSON/XML)
- One-to-many relationships represented as nested structures
- Schema-on-read: structure interpreted at read time (no enforced schema)
- Strengths: locality (all related data in one document), no impedance mismatch
The Object-Relational Impedance Mismatch:
- Application code uses objects (OOP), but relational DB uses tables
- ORMs (Hibernate, ActiveRecord) reduce boilerplate but don’t fully solve it
- Document model maps more naturally to in-memory data structures
Normalization vs Denormalization:
- Normalized (relational): Eliminate duplication, use IDs, join on read
- IDs never need to change even if what they represent changes
- Removing duplication = many-to-one relationships (e.g., city stored as ID)
- Denormalized (document): Store data redundantly for fast reads
- Works well when data doesn’t need cross-document joins
Document Model Limitations
When Document Model Struggles:
- Many-to-many relationships (references between documents require application-level joins)
- Joins not natively supported (application must do multiple queries)
- Cannot refer to nested item within document without loading whole document
When Document Model Excels:
- Tree-like (one-to-many) structure matches data well
- Data locality: need to load whole document at once
- Schema flexibility (schema-on-read)
Historical parallel: Network model (CODASYL, 1970s) and hierarchical model had similar problems with many-to-many. Relational model solved this by introducing the query optimizer.
Query Languages
Declarative vs Imperative:
- Imperative: Tell the computer how to do something step-by-step (most programming languages)
- Declarative: Specify what you want, let the system figure out how (SQL, CSS)
- Declarative allows query optimizer to choose best execution strategy
- Declarative is often more concise and easier to parallelize
SQL (Structured Query Language):
- Declarative language for relational databases
- Query optimizer automatically chooses indexes, join order, etc.
- Can run queries in parallel without changes to query text
- Example:
SELECT * FROM animals WHERE family = 'Sharks'
MapReduce (MongoDB, Hadoop):
- Programming model for large-scale data processing across many machines
- Map: Extract key-value pairs from each document
- Reduce: Aggregate values with the same key
- Neither purely declarative nor purely imperative—somewhere in between
- MongoDB added aggregation pipeline as more declarative alternative
Graph-Based Query Languages:
- Cypher (Neo4j): Declarative, pattern-matching syntax
MATCH (person)-[:BORN_IN]->(city) RETURN person.name
- SPARQL: Query language for triple-stores (RDF)
- Datalog: Older language, foundation for Datomic
Graph Data Model
When to Use Graphs:
- Many-to-many relationships with complex interconnections
- Social networks, web graphs, road networks, recommendation engines
- When connections between data are as important as the data itself
Property Graph Model (Neo4j, Titan, InfiniteGraph):
- Vertices: Unique ID, set of outgoing/incoming edges, key-value properties
- Edges: Unique ID, tail vertex, head vertex, label, key-value properties
- Any vertex can have edge to any other vertex (no schema restriction)
- Can traverse graph efficiently in any direction
Triple-Store Model (AllegroGraph, Datomic, Apache Jena):
- All information stored as three-part statements: (subject, predicate, object)
- Example:
(Jim, likes, bananas) - Subject is a vertex; predicate is edge label or property; object is another vertex or value
- RDF (Resource Description Framework) standard format
Schema Flexibility
Schema-on-write (relational):
- Schema explicitly defined, DB enforces it on write
- Migration required to change schema (
ALTER TABLE) - Better for static, well-understood data structures
Schema-on-read (document):
- Schema implicit, interpreted when data is read
- No enforcement—different documents can have different fields
- Better when items don’t all have the same structure
When schema-on-read is advantageous:
- Many different types of objects, impractical to put each in own table
- Structure determined by external systems you can’t control
- Data is heterogeneous
Data Locality
Document locality:
- Document stored as continuous string (JSON, XML, BSON)
- Entire document loaded when needed (wasteful if only accessing part)
- Only beneficial if you need most of the document at once
Relational locality (Google Spanner, Oracle):
- Can declare tables to be interleaved (nested) within parent table
- Data locality without full document model
Column-family model (Cassandra, HBase):
- Locality achieved by grouping related columns together
Important Points
- Data model shapes what is easy vs hard: Choose the model that matches your data’s natural structure.
- Relational model solved the network model’s problem: Query optimizer replaced manual access path navigation.
- Document model is not always schema-less: Schema-on-read still has an implicit schema; it’s just not enforced.
- Joins are not just for relational DBs: Document databases increasingly support joins (MongoDB $lookup).
- Graph databases are for highly connected data: When traversing relationships is the primary operation.
- Declarative languages are more future-proof: Query optimizer can take advantage of new hardware/algorithms.
- Hybrid models are converging: PostgreSQL supports JSON, MongoDB supports joins—lines are blurring.
Examples & Case Studies
-
LinkedIn Profile (Document Model)
- Resume-like structure (one user, many positions/education/contacts)
- Represented as one-to-many tree, natural fit for JSON document
- References between users (connections) require joins → hybrid approach
-
Geographic Data (Graph Model)
- Road network: vertices = intersections, edges = roads
- Enables shortest-path queries (Dijkstra’s algorithm)
- Also used for: social graphs, knowledge graphs, fraud detection
-
Twitter (Relational + Document Hybrid)
- Users, tweets, follows stored in relational tables
- Tweet content with media/metadata fits document model
- Modern systems mix models freely
-
Facebook’s Social Graph
- 1 billion vertices (people), over 1 trillion edges (friendships, likes, etc.)
- Graph traversal: “Who are friends of my friends that I don’t know?”
- TAO (The Associations and Objects) system for graph queries at scale
Questions
- When would you choose a document database over a relational database?
- What are the trade-offs between schema-on-read and schema-on-write?
- How do you handle many-to-many relationships in a document database?
- When does the graph data model make sense over relational?
- Why is SQL declarative and why does that matter for performance?
- What is the impedance mismatch problem and how do you address it?
- How has the relational model adapted to incorporate document and graph features?
- What are the performance implications of data locality in document databases?
Modern Context (2026)
Multi-Model Databases:
- PostgreSQL with JSONB: relational + document in one system
- MongoDB 5+ with Atlas: document + time series + search
- TigerGraph, Dgraph: graph + document hybrids
- “Polyglot persistence” → increasingly replaced by multi-model DBs
NewSQL and HTAP:
- TiDB, CockroachDB: relational model with horizontal scale
- HTAP (Hybrid Transactional/Analytical Processing): serve OLTP and OLAP from single store
- Google AlloyDB, Neon: cloud-native PostgreSQL variants
AI and Vector Databases:
- New data model for 2026: vector embeddings for ML/AI workloads
- pgvector (PostgreSQL extension), Pinecone, Weaviate, Qdrant
- Similarity search replaces exact-match queries
- LLM applications drive demand for vector + keyword hybrid search
GraphQL Layer:
- Not a database, but a query language for APIs
- Decouples client data needs from server data model
- Widely adopted in 2026 for flexible client-driven queries
Graph ML:
- Graph Neural Networks (GNNs) for fraud detection, recommendation
- Property graph model increasingly central to AI/ML pipelines
Status: Notes complete
Last Updated: 2026-04-13