Chapter 2 Flashcards - Data Models and Query Languages

flashcards chapter-2 ddia


Basic Concepts

What is the object-relational impedance mismatch?
?

  • Definition: The disconnect between how OOP languages represent data (objects) and how relational databases store it (tables/rows)
  • Application code uses objects with nested structures; DB uses flat tables
  • ORMs (Hibernate, ActiveRecord) reduce boilerplate but don’t fully solve it
  • Document model: Maps more naturally to in-memory objects, reducing mismatch

What are the three main data models covered in DDIA Chapter 2?
?

  1. Relational model: Data in tables, rows related by foreign keys, powerful joins
  2. Document model: Self-contained JSON/BSON documents, tree-shaped, schema-on-read
  3. Graph model: Vertices + edges, designed for highly connected many-to-many data

What is schema-on-read vs schema-on-write?
?

  • Schema-on-write (relational): Schema enforced by DB at write time; ALTER TABLE to change
    • Like static type checking: errors caught early
  • Schema-on-read (document): Schema implicit, interpreted at read time; app handles variation
    • Like dynamic type checking: flexible but bugs surface later
  • Neither is universally better — depends on how uniform and stable your data is

What is data locality and when does it matter?
?

  • Definition: Storing related data physically close together to improve read performance
  • Document locality: Entire document loaded at once (good if you need all fields, wasteful otherwise)
  • When it matters: Performance-sensitive reads that need most of a document’s fields
  • Relational alternative: Interleaved tables (Google Spanner), column-family stores

What is the difference between declarative and imperative query languages?
?

  • Imperative: Tells computer how to compute step-by-step (most programming languages)
    • Specifies exact execution path
    • Hard to optimize or parallelize
  • Declarative: Specifies what result you want (SQL, CSS, Cypher)
    • Query optimizer figures out execution
    • Can take advantage of new indexes, parallelism, without query changes
  • SQL’s declarative nature is a key reason for its longevity and performance

Relational Model

What problem did the relational model solve that the network and hierarchical models couldn’t?
?

  • Hierarchical model (1960s IMS): Tree structure; many-to-many required duplication
  • Network model (CODASYL): Linked lists of records; required programmers to track access paths manually
  • Relational model (Codd, 1970): Query optimizer handles access paths automatically
  • Key innovation: declarative query language (SQL) + query optimizer = programmers don’t need to think about physical data layout

Why is normalization preferred in relational databases?
?

  • Removes duplication: Each fact stored once; updates in one place
  • IDs never change: Even if human-readable value changes (city name), ID stays stable
  • Consistency: No risk of inconsistent copies of same data
  • Trade-off: Requires joins at read time (slower reads, simpler writes)
  • Many-to-one: Normalized data uses IDs = many records point to one canonical value

Document Model

When should you choose a document model over relational?
?

  • Data has a tree-like (one-to-many) structure (e.g., resume with positions/education)
  • You need schema flexibility for heterogeneous data
  • Data locality needed: load entire document at once for performance
  • No significant many-to-many relationships requiring joins
  • External systems control data shape (you can’t enforce schema anyway)
    Avoid document model when: data has many cross-document references or complex joins

What is the key limitation of document databases for many-to-many relationships?
?

  • Document DBs don’t support joins natively (or joins are limited/slow)
  • Application must do multiple queries and join data in code
  • Denormalization (storing data in multiple documents) causes duplication and update headaches
  • Historical parallel: Same problem as pre-relational hierarchical/network models
  • Modern solution: MongoDB added $lookup (aggregation join), but still not as powerful as SQL JOINs

Graph Model

What two things does a property graph model consist of?
?
Vertices (nodes) have:

  • Unique identifier
  • Set of outgoing edges
  • Set of incoming edges
  • Collection of key-value properties

Edges have:

  • Unique identifier
  • Tail vertex (start)
  • Head vertex (end)
  • Label describing the relationship
  • Collection of key-value properties

Any vertex can have an edge to any other vertex — no schema restriction on connections

What is a triple-store and how does it represent data?
?

  • Definition: All information stored as three-part statements: (subject, predicate, object)
  • Subject: Vertex (the entity being described)
  • Predicate: Edge label or property key
  • Object: Another vertex (a relationship) OR a value (a property)
  • Examples:
    • (Jim, age, 33) → property
    • (Jim, bornIn, Idaho) → relationship
  • Standards: RDF (Resource Description Framework), SPARQL query language
  • Used by: AllegroGraph, Apache Jena, Datomic

When should you choose a graph database?
?

  • Data has many-to-many relationships with complex interconnections
  • Traversal is a primary operation (“find all friends within 3 hops”)
  • Relationships are as important as the data itself
  • Use cases: Social networks, recommendation engines, fraud detection, knowledge graphs, road networks, access control graphs
  • Not worth it when: data is mostly tree-shaped or has few relationships

Query Languages

How does Cypher (Neo4j’s query language) express a graph pattern?
?

  • Cypher is a declarative pattern-matching language for graphs
  • Uses ASCII art to express graph patterns
  • Example: Find all people born in the US who emigrated to Europe:
MATCH (person)-[:BORN_IN]->(us:Location {name:'United States'}),
      (person)-[:LIVES_IN]->(europe:Location {continent:'Europe'})
RETURN person.name
  • (person) = vertex, -[:BORN_IN]-> = directed edge with label
  • Much more concise than equivalent SQL with self-joins

What is MapReduce and how does it relate to declarative/imperative querying?
?

  • MapReduce: Programming model for large-scale distributed data processing
  • Map function: Called for each document; emits key-value pairs
  • Reduce function: Aggregates values with the same key
  • Position: Neither purely declarative nor imperative — hybrid
  • Advantage: Can use full programming language for complex logic
  • Disadvantage: Verbose, harder to optimize than pure declarative
  • MongoDB added aggregation pipeline as more declarative alternative to MapReduce

Trade-offs and Design Decisions

What are the trade-offs between denormalization and normalization?
?
Normalization:

  • ✅ Single source of truth, easy to update
  • ✅ No duplicate data
  • ❌ Joins required at read time
  • Use when: data changes, consistency matters

Denormalization:

  • ✅ Faster reads, no joins
  • ❌ Duplicate data, complex updates
  • ❌ Risk of inconsistency
  • Use when: read-heavy, data rarely changes, performance critical

Rule of thumb: Normalize first, denormalize only when you have measured a performance problem

Why do many applications today use a combination of data models?
?

  • Different data has different shapes: User profiles (document) + transaction history (relational) + social graph (graph)
  • Polyglot persistence: Use the right tool for each type of data
  • Modern convergence: PostgreSQL supports JSONB; MongoDB supports joins
  • Practical approach: Start with one model, add others as needs emerge
  • Cost: Operating multiple databases adds complexity; multi-model DBs (PostgreSQL) reduce this

Modern Context (2026)

How have vector databases introduced a fourth major data model?
?

  • Vector/embedding model (emerged 2020s, mainstream 2024-2026)
  • Data: High-dimensional floating-point vectors (embeddings from ML models)
  • Query: Similarity search (find nearest neighbors by cosine distance / dot product)
  • Use cases: Semantic search, recommendation, image similarity, RAG (retrieval-augmented generation)
  • Products: Pinecone, Weaviate, Qdrant, pgvector (PostgreSQL extension), Chroma
  • Key difference: No exact-match queries — “most similar” replaces “equals”

How are the boundaries between relational and document models blurring in 2026?
?

  • PostgreSQL: JSONB type gives full document model features (indexing, querying) inside relational DB
  • MongoDB: Added $lookup (joins), multi-document ACID transactions, time series collections
  • CockroachDB: Relational SQL with horizontal scaling (previously a NoSQL advantage)
  • Trend: Multi-model databases replacing “polyglot persistence” in many stacks
  • Implication: The original NoSQL vs SQL debate is largely obsolete — choose based on data shape, not just scaling needs

Interview Scenarios

You’re designing a LinkedIn-style profile system. Which data model would you choose and why?
?
Choice: Document model (or relational with JSONB)

Reasoning:

  • Profile is tree-shaped: one user → many positions, education entries, skills
  • Self-contained: most reads load the full profile
  • Schema varies: different users have different optional fields
  • Document model: Natural fit for one-to-many tree, good data locality

However:

  • Connections between users = many-to-many → needs relational or graph
  • Job recommendations = graph traversal
  • Real LinkedIn: Uses relational DB (Oracle/Espresso) for profiles + graph DB for social graph

Interview answer pattern: Identify data shape first, then match to model

How would you design a fraud detection system’s data layer?
?
Choice: Graph database + relational hybrid

Reasoning:

  • Fraud detection requires traversing relationships: “Is this device linked to a known fraudster’s network?”
  • Queries like: find all accounts reachable from account X within 3 hops
  • Relationship traversal = graph is ideal

Graph component (Neo4j / Amazon Neptune):

  • Vertices: users, devices, IP addresses, cards
  • Edges: used_device, linked_card, shares_IP

Relational component:

  • Transaction history (highly structured, aggregate queries)
  • Account master data

Key pattern: Separate data by access pattern and relationship complexity

A startup wants to launch quickly with flexible data requirements. What data model?
?
Choice: Document model (MongoDB or PostgreSQL with JSONB)

Reasoning:

  • Early-stage: schema changes frequently, requirements unclear
  • Schema-on-read allows adding/removing fields without migrations
  • Most startup data is user-centric with tree-like profiles

Recommendation:

  • Start with PostgreSQL + JSONB: get document flexibility with ACID guarantees
  • Can add relational structure as requirements solidify
  • Avoids switching databases later

Caution:

  • Schema-on-read moves validation to application code — enforce it explicitly
  • Don’t let flexibility become inconsistent data

Quick Facts

What year was the relational model introduced and by whom?
?

  • 1970 by Edgar F. Codd (IBM researcher)
  • Paper: “A Relational Model of Data for Large Shared Data Banks”
  • Introduced relations (tables), tuples (rows), declarative queries
  • Dominated database landscape for 40+ years

What does NoSQL stand for and what does it actually mean?
?

  • Originally: “No SQL” (anti-relational)
  • Retroactively redefined: “Not Only SQL”
  • Encompasses: Document, key-value, wide-column, graph databases
  • Not a single technology: Just a loose category of non-relational databases
  • Main drivers: Need for greater scalability, schema flexibility, specialized query operations

What is an RDF triple and give an example?
?

  • RDF = Resource Description Framework
  • Triple format: (subject, predicate, object)
  • All data expressed as triples:
    • (person:Jim, schema:birthPlace, location:Idaho) → relationship
    • (person:Jim, schema:age, "33"^^xsd:integer) → literal value
  • Standard format for linked data / semantic web
  • Query language: SPARQL

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13