Chapter 2 Flashcards - Data Models and Query Languages

flashcards chapter-2 ddia

Basic Concepts

What is the object-relational impedance mismatch?
?

Definition: The disconnect between how OOP languages represent data (objects) and how relational databases store it (tables/rows)
Application code uses objects with nested structures; DB uses flat tables
ORMs (Hibernate, ActiveRecord) reduce boilerplate but don’t fully solve it
Document model: Maps more naturally to in-memory objects, reducing mismatch

What are the three main data models covered in DDIA Chapter 2?
?

Relational model: Data in tables, rows related by foreign keys, powerful joins
Document model: Self-contained JSON/BSON documents, tree-shaped, schema-on-read
Graph model: Vertices + edges, designed for highly connected many-to-many data

What is schema-on-read vs schema-on-write?
?

Schema-on-write (relational): Schema enforced by DB at write time; ALTER TABLE to change
- Like static type checking: errors caught early
Schema-on-read (document): Schema implicit, interpreted at read time; app handles variation
- Like dynamic type checking: flexible but bugs surface later
Neither is universally better — depends on how uniform and stable your data is

What is data locality and when does it matter?
?

Definition: Storing related data physically close together to improve read performance
Document locality: Entire document loaded at once (good if you need all fields, wasteful otherwise)
When it matters: Performance-sensitive reads that need most of a document’s fields
Relational alternative: Interleaved tables (Google Spanner), column-family stores

What is the difference between declarative and imperative query languages?
?

Imperative: Tells computer how to compute step-by-step (most programming languages)
- Specifies exact execution path
- Hard to optimize or parallelize
Declarative: Specifies what result you want (SQL, CSS, Cypher)
- Query optimizer figures out execution
- Can take advantage of new indexes, parallelism, without query changes
SQL’s declarative nature is a key reason for its longevity and performance

Relational Model

What problem did the relational model solve that the network and hierarchical models couldn’t?
?

Hierarchical model (1960s IMS): Tree structure; many-to-many required duplication
Network model (CODASYL): Linked lists of records; required programmers to track access paths manually
Relational model (Codd, 1970): Query optimizer handles access paths automatically
Key innovation: declarative query language (SQL) + query optimizer = programmers don’t need to think about physical data layout

Why is normalization preferred in relational databases?
?

Removes duplication: Each fact stored once; updates in one place
IDs never change: Even if human-readable value changes (city name), ID stays stable
Consistency: No risk of inconsistent copies of same data
Trade-off: Requires joins at read time (slower reads, simpler writes)
Many-to-one: Normalized data uses IDs = many records point to one canonical value

Document Model

When should you choose a document model over relational?
?

Data has a tree-like (one-to-many) structure (e.g., resume with positions/education)
You need schema flexibility for heterogeneous data
Data locality needed: load entire document at once for performance
No significant many-to-many relationships requiring joins
External systems control data shape (you can’t enforce schema anyway)
Avoid document model when: data has many cross-document references or complex joins

What is the key limitation of document databases for many-to-many relationships?
?

Document DBs don’t support joins natively (or joins are limited/slow)
Application must do multiple queries and join data in code
Denormalization (storing data in multiple documents) causes duplication and update headaches
Historical parallel: Same problem as pre-relational hierarchical/network models
Modern solution: MongoDB added $lookup (aggregation join), but still not as powerful as SQL JOINs

Graph Model

What two things does a property graph model consist of?
?
Vertices (nodes) have:

Unique identifier
Set of outgoing edges
Set of incoming edges
Collection of key-value properties

Edges have:

Unique identifier
Tail vertex (start)
Head vertex (end)
Label describing the relationship
Collection of key-value properties

Any vertex can have an edge to any other vertex — no schema restriction on connections

What is a triple-store and how does it represent data?
?

Definition: All information stored as three-part statements: (subject, predicate, object)
Subject: Vertex (the entity being described)
Predicate: Edge label or property key
Object: Another vertex (a relationship) OR a value (a property)
Examples:
- (Jim, age, 33) → property
- (Jim, bornIn, Idaho) → relationship
Standards: RDF (Resource Description Framework), SPARQL query language
Used by: AllegroGraph, Apache Jena, Datomic

When should you choose a graph database?
?

Data has many-to-many relationships with complex interconnections
Traversal is a primary operation (“find all friends within 3 hops”)
Relationships are as important as the data itself
Use cases: Social networks, recommendation engines, fraud detection, knowledge graphs, road networks, access control graphs
Not worth it when: data is mostly tree-shaped or has few relationships

Query Languages

How does Cypher (Neo4j’s query language) express a graph pattern?
?

Cypher is a declarative pattern-matching language for graphs
Uses ASCII art to express graph patterns
Example: Find all people born in the US who emigrated to Europe:

MATCH (person)-[:BORN_IN]->(us:Location {name:'United States'}),
      (person)-[:LIVES_IN]->(europe:Location {continent:'Europe'})
RETURN person.name

(person) = vertex, -[:BORN_IN]-> = directed edge with label
Much more concise than equivalent SQL with self-joins

What is MapReduce and how does it relate to declarative/imperative querying?
?

MapReduce: Programming model for large-scale distributed data processing
Map function: Called for each document; emits key-value pairs
Reduce function: Aggregates values with the same key
Position: Neither purely declarative nor imperative — hybrid
Advantage: Can use full programming language for complex logic
Disadvantage: Verbose, harder to optimize than pure declarative
MongoDB added aggregation pipeline as more declarative alternative to MapReduce

Trade-offs and Design Decisions

What are the trade-offs between denormalization and normalization?
?
Normalization:

✅ Single source of truth, easy to update
✅ No duplicate data
❌ Joins required at read time
Use when: data changes, consistency matters

Denormalization:

✅ Faster reads, no joins
❌ Duplicate data, complex updates
❌ Risk of inconsistency
Use when: read-heavy, data rarely changes, performance critical

Rule of thumb: Normalize first, denormalize only when you have measured a performance problem

Why do many applications today use a combination of data models?
?

Different data has different shapes: User profiles (document) + transaction history (relational) + social graph (graph)
Polyglot persistence: Use the right tool for each type of data
Modern convergence: PostgreSQL supports JSONB; MongoDB supports joins
Practical approach: Start with one model, add others as needs emerge
Cost: Operating multiple databases adds complexity; multi-model DBs (PostgreSQL) reduce this

Modern Context (2026)

How have vector databases introduced a fourth major data model?
?

Vector/embedding model (emerged 2020s, mainstream 2024-2026)
Data: High-dimensional floating-point vectors (embeddings from ML models)
Query: Similarity search (find nearest neighbors by cosine distance / dot product)
Use cases: Semantic search, recommendation, image similarity, RAG (retrieval-augmented generation)
Products: Pinecone, Weaviate, Qdrant, pgvector (PostgreSQL extension), Chroma
Key difference: No exact-match queries — “most similar” replaces “equals”

How are the boundaries between relational and document models blurring in 2026?
?

PostgreSQL: JSONB type gives full document model features (indexing, querying) inside relational DB
MongoDB: Added $lookup (joins), multi-document ACID transactions, time series collections
CockroachDB: Relational SQL with horizontal scaling (previously a NoSQL advantage)
Trend: Multi-model databases replacing “polyglot persistence” in many stacks
Implication: The original NoSQL vs SQL debate is largely obsolete — choose based on data shape, not just scaling needs

Interview Scenarios

You’re designing a LinkedIn-style profile system. Which data model would you choose and why?
?
Choice: Document model (or relational with JSONB)

Reasoning:

Profile is tree-shaped: one user → many positions, education entries, skills
Self-contained: most reads load the full profile
Schema varies: different users have different optional fields
Document model: Natural fit for one-to-many tree, good data locality

However:

Connections between users = many-to-many → needs relational or graph
Job recommendations = graph traversal
Real LinkedIn: Uses relational DB (Oracle/Espresso) for profiles + graph DB for social graph

Interview answer pattern: Identify data shape first, then match to model

How would you design a fraud detection system’s data layer?
?
Choice: Graph database + relational hybrid

Reasoning:

Fraud detection requires traversing relationships: “Is this device linked to a known fraudster’s network?”
Queries like: find all accounts reachable from account X within 3 hops
Relationship traversal = graph is ideal

Graph component (Neo4j / Amazon Neptune):

Vertices: users, devices, IP addresses, cards
Edges: used_device, linked_card, shares_IP

Relational component:

Transaction history (highly structured, aggregate queries)
Account master data

Key pattern: Separate data by access pattern and relationship complexity

A startup wants to launch quickly with flexible data requirements. What data model?
?
Choice: Document model (MongoDB or PostgreSQL with JSONB)

Reasoning:

Early-stage: schema changes frequently, requirements unclear
Schema-on-read allows adding/removing fields without migrations
Most startup data is user-centric with tree-like profiles

Recommendation:

Start with PostgreSQL + JSONB: get document flexibility with ACID guarantees
Can add relational structure as requirements solidify
Avoids switching databases later

Caution:

Schema-on-read moves validation to application code — enforce it explicitly
Don’t let flexibility become inconsistent data

Quick Facts

What year was the relational model introduced and by whom?
?

1970 by Edgar F. Codd (IBM researcher)
Paper: “A Relational Model of Data for Large Shared Data Banks”
Introduced relations (tables), tuples (rows), declarative queries
Dominated database landscape for 40+ years

What does NoSQL stand for and what does it actually mean?
?

Originally: “No SQL” (anti-relational)
Retroactively redefined: “Not Only SQL”
Encompasses: Document, key-value, wide-column, graph databases
Not a single technology: Just a loose category of non-relational databases
Main drivers: Need for greater scalability, schema flexibility, specialized query operations

What is an RDF triple and give an example?
?

RDF = Resource Description Framework
Triple format: (subject, predicate, object)
All data expressed as triples:
- (person:Jim, schema:birthPlace, location:Idaho) → relationship
- (person:Jim, schema:age, "33"^^xsd:integer) → literal value
Standard format for linked data / semantic web
Query language: SPARQL

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch02-flashcards

Chapter 2 Flashcards - Data Models and Query Languages

Basic Concepts

Relational Model

Document Model

Graph Model

Query Languages

Trade-offs and Design Decisions

Modern Context (2026)

Interview Scenarios

Quick Facts

Graph View

Table of Contents