Chapter 2 Flashcards - Data Models and Query Languages
Basic Concepts
What is the object-relational impedance mismatch?
?
- Definition: The disconnect between how OOP languages represent data (objects) and how relational databases store it (tables/rows)
- Application code uses objects with nested structures; DB uses flat tables
- ORMs (Hibernate, ActiveRecord) reduce boilerplate but don’t fully solve it
- Document model: Maps more naturally to in-memory objects, reducing mismatch
What are the three main data models covered in DDIA Chapter 2?
?
- Relational model: Data in tables, rows related by foreign keys, powerful joins
- Document model: Self-contained JSON/BSON documents, tree-shaped, schema-on-read
- Graph model: Vertices + edges, designed for highly connected many-to-many data
What is schema-on-read vs schema-on-write?
?
- Schema-on-write (relational): Schema enforced by DB at write time;
ALTER TABLEto change- Like static type checking: errors caught early
- Schema-on-read (document): Schema implicit, interpreted at read time; app handles variation
- Like dynamic type checking: flexible but bugs surface later
- Neither is universally better — depends on how uniform and stable your data is
What is data locality and when does it matter?
?
- Definition: Storing related data physically close together to improve read performance
- Document locality: Entire document loaded at once (good if you need all fields, wasteful otherwise)
- When it matters: Performance-sensitive reads that need most of a document’s fields
- Relational alternative: Interleaved tables (Google Spanner), column-family stores
What is the difference between declarative and imperative query languages?
?
- Imperative: Tells computer how to compute step-by-step (most programming languages)
- Specifies exact execution path
- Hard to optimize or parallelize
- Declarative: Specifies what result you want (SQL, CSS, Cypher)
- Query optimizer figures out execution
- Can take advantage of new indexes, parallelism, without query changes
- SQL’s declarative nature is a key reason for its longevity and performance
Relational Model
What problem did the relational model solve that the network and hierarchical models couldn’t?
?
- Hierarchical model (1960s IMS): Tree structure; many-to-many required duplication
- Network model (CODASYL): Linked lists of records; required programmers to track access paths manually
- Relational model (Codd, 1970): Query optimizer handles access paths automatically
- Key innovation: declarative query language (SQL) + query optimizer = programmers don’t need to think about physical data layout
Why is normalization preferred in relational databases?
?
- Removes duplication: Each fact stored once; updates in one place
- IDs never change: Even if human-readable value changes (city name), ID stays stable
- Consistency: No risk of inconsistent copies of same data
- Trade-off: Requires joins at read time (slower reads, simpler writes)
- Many-to-one: Normalized data uses IDs = many records point to one canonical value
Document Model
When should you choose a document model over relational?
?
- Data has a tree-like (one-to-many) structure (e.g., resume with positions/education)
- You need schema flexibility for heterogeneous data
- Data locality needed: load entire document at once for performance
- No significant many-to-many relationships requiring joins
- External systems control data shape (you can’t enforce schema anyway)
Avoid document model when: data has many cross-document references or complex joins
What is the key limitation of document databases for many-to-many relationships?
?
- Document DBs don’t support joins natively (or joins are limited/slow)
- Application must do multiple queries and join data in code
- Denormalization (storing data in multiple documents) causes duplication and update headaches
- Historical parallel: Same problem as pre-relational hierarchical/network models
- Modern solution: MongoDB added
$lookup(aggregation join), but still not as powerful as SQL JOINs
Graph Model
What two things does a property graph model consist of?
?
Vertices (nodes) have:
- Unique identifier
- Set of outgoing edges
- Set of incoming edges
- Collection of key-value properties
Edges have:
- Unique identifier
- Tail vertex (start)
- Head vertex (end)
- Label describing the relationship
- Collection of key-value properties
Any vertex can have an edge to any other vertex — no schema restriction on connections
What is a triple-store and how does it represent data?
?
- Definition: All information stored as three-part statements: (subject, predicate, object)
- Subject: Vertex (the entity being described)
- Predicate: Edge label or property key
- Object: Another vertex (a relationship) OR a value (a property)
- Examples:
(Jim, age, 33)→ property(Jim, bornIn, Idaho)→ relationship
- Standards: RDF (Resource Description Framework), SPARQL query language
- Used by: AllegroGraph, Apache Jena, Datomic
When should you choose a graph database?
?
- Data has many-to-many relationships with complex interconnections
- Traversal is a primary operation (“find all friends within 3 hops”)
- Relationships are as important as the data itself
- Use cases: Social networks, recommendation engines, fraud detection, knowledge graphs, road networks, access control graphs
- Not worth it when: data is mostly tree-shaped or has few relationships
Query Languages
How does Cypher (Neo4j’s query language) express a graph pattern?
?
- Cypher is a declarative pattern-matching language for graphs
- Uses ASCII art to express graph patterns
- Example: Find all people born in the US who emigrated to Europe:
MATCH (person)-[:BORN_IN]->(us:Location {name:'United States'}),
(person)-[:LIVES_IN]->(europe:Location {continent:'Europe'})
RETURN person.name(person)= vertex,-[:BORN_IN]->= directed edge with label- Much more concise than equivalent SQL with self-joins
What is MapReduce and how does it relate to declarative/imperative querying?
?
- MapReduce: Programming model for large-scale distributed data processing
- Map function: Called for each document; emits key-value pairs
- Reduce function: Aggregates values with the same key
- Position: Neither purely declarative nor imperative — hybrid
- Advantage: Can use full programming language for complex logic
- Disadvantage: Verbose, harder to optimize than pure declarative
- MongoDB added aggregation pipeline as more declarative alternative to MapReduce
Trade-offs and Design Decisions
What are the trade-offs between denormalization and normalization?
?
Normalization:
- ✅ Single source of truth, easy to update
- ✅ No duplicate data
- ❌ Joins required at read time
- Use when: data changes, consistency matters
Denormalization:
- ✅ Faster reads, no joins
- ❌ Duplicate data, complex updates
- ❌ Risk of inconsistency
- Use when: read-heavy, data rarely changes, performance critical
Rule of thumb: Normalize first, denormalize only when you have measured a performance problem
Why do many applications today use a combination of data models?
?
- Different data has different shapes: User profiles (document) + transaction history (relational) + social graph (graph)
- Polyglot persistence: Use the right tool for each type of data
- Modern convergence: PostgreSQL supports JSONB; MongoDB supports joins
- Practical approach: Start with one model, add others as needs emerge
- Cost: Operating multiple databases adds complexity; multi-model DBs (PostgreSQL) reduce this
Modern Context (2026)
How have vector databases introduced a fourth major data model?
?
- Vector/embedding model (emerged 2020s, mainstream 2024-2026)
- Data: High-dimensional floating-point vectors (embeddings from ML models)
- Query: Similarity search (find nearest neighbors by cosine distance / dot product)
- Use cases: Semantic search, recommendation, image similarity, RAG (retrieval-augmented generation)
- Products: Pinecone, Weaviate, Qdrant, pgvector (PostgreSQL extension), Chroma
- Key difference: No exact-match queries — “most similar” replaces “equals”
How are the boundaries between relational and document models blurring in 2026?
?
- PostgreSQL: JSONB type gives full document model features (indexing, querying) inside relational DB
- MongoDB: Added
$lookup(joins), multi-document ACID transactions, time series collections - CockroachDB: Relational SQL with horizontal scaling (previously a NoSQL advantage)
- Trend: Multi-model databases replacing “polyglot persistence” in many stacks
- Implication: The original NoSQL vs SQL debate is largely obsolete — choose based on data shape, not just scaling needs
Interview Scenarios
You’re designing a LinkedIn-style profile system. Which data model would you choose and why?
?
Choice: Document model (or relational with JSONB)
Reasoning:
- Profile is tree-shaped: one user → many positions, education entries, skills
- Self-contained: most reads load the full profile
- Schema varies: different users have different optional fields
- Document model: Natural fit for one-to-many tree, good data locality
However:
- Connections between users = many-to-many → needs relational or graph
- Job recommendations = graph traversal
- Real LinkedIn: Uses relational DB (Oracle/Espresso) for profiles + graph DB for social graph
Interview answer pattern: Identify data shape first, then match to model
How would you design a fraud detection system’s data layer?
?
Choice: Graph database + relational hybrid
Reasoning:
- Fraud detection requires traversing relationships: “Is this device linked to a known fraudster’s network?”
- Queries like: find all accounts reachable from account X within 3 hops
- Relationship traversal = graph is ideal
Graph component (Neo4j / Amazon Neptune):
- Vertices: users, devices, IP addresses, cards
- Edges: used_device, linked_card, shares_IP
Relational component:
- Transaction history (highly structured, aggregate queries)
- Account master data
Key pattern: Separate data by access pattern and relationship complexity
A startup wants to launch quickly with flexible data requirements. What data model?
?
Choice: Document model (MongoDB or PostgreSQL with JSONB)
Reasoning:
- Early-stage: schema changes frequently, requirements unclear
- Schema-on-read allows adding/removing fields without migrations
- Most startup data is user-centric with tree-like profiles
Recommendation:
- Start with PostgreSQL + JSONB: get document flexibility with ACID guarantees
- Can add relational structure as requirements solidify
- Avoids switching databases later
Caution:
- Schema-on-read moves validation to application code — enforce it explicitly
- Don’t let flexibility become inconsistent data
Quick Facts
What year was the relational model introduced and by whom?
?
- 1970 by Edgar F. Codd (IBM researcher)
- Paper: “A Relational Model of Data for Large Shared Data Banks”
- Introduced relations (tables), tuples (rows), declarative queries
- Dominated database landscape for 40+ years
What does NoSQL stand for and what does it actually mean?
?
- Originally: “No SQL” (anti-relational)
- Retroactively redefined: “Not Only SQL”
- Encompasses: Document, key-value, wide-column, graph databases
- Not a single technology: Just a loose category of non-relational databases
- Main drivers: Need for greater scalability, schema flexibility, specialized query operations
What is an RDF triple and give an example?
?
- RDF = Resource Description Framework
- Triple format: (subject, predicate, object)
- All data expressed as triples:
(person:Jim, schema:birthPlace, location:Idaho)→ relationship(person:Jim, schema:age, "33"^^xsd:integer)→ literal value
- Standard format for linked data / semantic web
- Query language: SPARQL
Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13