Chapter 6 Flashcards — Pulling Apart Operational Data
flashcards saht data-decomposition database-types polyglot-persistence
What are data disintegrators in the context of database decomposition?
?
Data disintegrators are architectural forces that push toward splitting a shared monolithic database into separate, service-owned databases. There are six: change control, connection management, scalability, fault tolerance, architectural quantum, and database type optimization.
What are data integrators in the context of database decomposition?
?
Data integrators are forces that argue for keeping data together in a shared database. There are two: data relationships (foreign keys and joins) and database transactions (ACID guarantees). They represent real capabilities that are lost or degraded when data is split.
Why is change control a data disintegrator?
?
When multiple services share a single database schema, a schema change made to serve one service can break other services. Change control as a disintegrator means: services that deploy and evolve independently should own their data independently. The signal is schema changes that require coordinating multiple teams or long schema-freeze windows before deployments.
Why is connection management a data disintegrator?
?
A database server has a finite connection pool. As microservices scale out (e.g., 100 instances × 10 connections each = 1,000 connections), the shared database’s connection limit is exhausted. Connection management is a disintegrator because services with high connection demand benefit from their own database servers with their own connection pools.
Why is fault tolerance a data disintegrator?
?
A shared database is a single point of failure: one database outage takes down every service that depends on it. Fault tolerance as a disintegrator means: services with high availability SLAs need database isolation so an outage in one service’s database doesn’t cascade to unrelated services.
Why is architectural quantum a data disintegrator?
?
An architectural quantum is an independently deployable unit with high functional cohesion. A shared database creates static coupling — services share a schema version dependency, preventing true independent deployability. To achieve quantum independence, each service must own its own data store with no shared schema.
Why are data relationships a data integrator?
?
Relational databases maintain referential integrity via foreign keys and support efficient JOINs between tables. When tables are split across separate databases, foreign keys between them cannot be enforced at the database level — they must be replaced with application-level checks that are weaker, slower, and bypassable. Joins must be replaced with two queries plus an in-memory merge.
Why are database transactions (ACID) a data integrator?
?
ACID transactions guarantee atomicity across multiple table operations within a single database. When a business operation spans tables in two different databases (after splitting), atomicity is lost — you cannot do a native BEGIN ... COMMIT across two databases. The alternatives — sagas, eventual consistency, two-phase commit — are all more complex and/or slower.
What does it mean to lose atomicity when data is split across services?
?
Atomicity means a group of operations either all succeed or all fail as a unit. Within a single database, BEGIN ... COMMIT provides this for free. Once data is split, a business operation touching two services’ databases cannot use a native database transaction. Instead, the system must use saga patterns with compensating transactions to simulate atomicity, which requires explicit error handling and rollback logic.
What are the 5 steps of the data decomposition process in order?
?
- Analyze the database and create data domains (group tables by bounded context).
- Assign tables to data domains (decide ownership of each table).
- Separate database connections to data domains (enforce domain access via credentials on the same server).
- Move schemas to separate database servers (physically migrate data).
- Switch over to independent database servers (decommission cross-DB replication, remove old tables).
Why is Step 3 (separate connections before moving data) so valuable in the decomposition process?
?
Step 3 enforces domain-level database credentials on a single physical database server, so any cross-domain table access in application code immediately causes an authentication/authorization error. This surfaces hidden cross-domain dependencies safely — while all data is still on one server (so joins are still trivially executable for diagnostics). It turns a code discovery problem into a deployment-safe refactoring, before the much riskier physical data migration.
What must happen to database foreign keys when data moves to separate servers (Step 4)?
?
Cross-domain foreign keys must be dropped from the database schema — they cannot be enforced across two separate database servers. Application code must replace them with application-level referential integrity checks (e.g., the ticket service calls the customer service API to validate that a customer exists before creating a ticket). This is weaker, not atomic, and adds network latency.
What is a data domain in the decomposition context?
?
A data domain is a logical cluster of tables that conceptually belong together under one bounded context. It is the database-level equivalent of a microservice domain. During Step 1, you group all tables in the monolithic database into data domains (e.g., Customer domain, Ticket domain, Billing domain) before any physical separation occurs.
What is polyglot persistence?
?
Polyglot persistence is the practice of using different database types for different services, choosing the best-fit technology for each service’s data model and access patterns, rather than using one database type (typically relational) for everything. The term “polyglot” reflects using multiple database “languages” (technologies) within a single system.
What are the two main costs of polyglot persistence?
?
- Operational expertise fragmentation: Each database technology requires specialized knowledge to operate, tune, monitor, and backup. Multiple technologies multiply operational burden.
- Infrastructure complexity: More database types means more systems to provision, patch, secure, and maintain — plus more failure modes to understand and handle.
What is the core data model of a relational database, and when is it the best choice?
?
Core model: Tables with rows and columns, enforced schema, foreign key relationships, SQL queries.
Best choice when: You need ACID transactions across multiple entities, complex queries (JOINs, aggregations), strong referential integrity, or your team needs mature tooling and broad SQL expertise. Canonical use cases: financial ledgers, order management, any domain where business logic depends heavily on relational constraints.
What is the core data model of a key-value database, and when is it the best choice?
?
Core model: A hash map at scale — each record is an opaque value addressed by a unique key. No schema, no relationships, no query flexibility beyond exact key lookup.
Best choice when: You need the fastest possible single-key reads and writes — O(1) lookups. Canonical use cases: session data, caching, shopping carts, rate limiting counters, feature flags. Examples: Redis, Memcached.
What is the core data model of a document database, and when is it the best choice?
?
Core model: Self-describing JSON/BSON documents stored in collections. Each document can have different fields (schema-flexible), with nested objects and arrays.
Best choice when: Records have variable or hierarchical structure, schema evolves per-record, or data maps naturally to programming language objects. Canonical use cases: product catalogs (variable attributes), content management, user profiles, event logs with variable payloads. Examples: MongoDB, CouchDB, Firestore.
What is the core data model of a column family database, and when is it the best choice?
?
Core model: Rows with variable, potentially thousands of columns grouped into column families. Different rows can have entirely different columns. Storage optimized for sequential writes (LSM-tree).
Best choice when: You have massive write throughput requirements, time-series or append-only data, or data that is naturally wide-row. Query patterns must be designed around partition keys. Canonical use cases: IoT sensor data, audit logs, activity feeds, metrics at scale. Examples: Apache Cassandra, HBase, Google Bigtable.
What is the key design principle that makes column family databases different from relational databases?
?
In relational databases, you design schema around the data (normalize it, then query any way you want). In column family databases, you design schema around your query patterns — the partition key and clustering key determine what queries are efficient. Wrong key choices lead to hot partitions and full scans. There is no flexible ad-hoc querying.
What is the core data model of a graph database, and when is it the best choice?
?
Core model: Data represented as nodes (entities) and edges (relationships), both with arbitrary properties. Queries traverse the graph by following edges.
Best choice when: Relationships between entities are the primary data access pattern — especially multi-hop traversals. Canonical use cases: social networks (friends of friends), fraud detection (rings of connected accounts), recommendation engines, knowledge graphs, network topology, access control hierarchies. Examples: Neo4j, Amazon Neptune.
Why do graph databases outperform relational databases for multi-hop relationship queries?
?
Relational databases infer relationships at query time by joining tables based on foreign key values — which requires scanning index structures for each hop. Graph databases store adjacency directly as edges, so following a relationship is a pointer dereference rather than an index lookup. For queries that traverse 3+ hops, this is orders of magnitude faster because there is no compounding join cost.
What is the core data model of a NewSQL database, and when is it the best choice?
?
Core model: A relational interface (SQL, JDBC, ACID transactions) built on top of a horizontally distributed architecture. Provides standard SQL semantics with distributed ACID transactions.
Best choice when: You need ACID semantics but have outgrown a single-node RDBMS — high-throughput OLTP at distributed scale. Also good for global applications needing data locality. Canonical use cases: financial services at scale, global e-commerce. Examples: CockroachDB, Google Spanner, TiDB, YugabyteDB.
What is the key differentiator of NewSQL vs. traditional NoSQL?
?
NewSQL provides distributed ACID transactions — true atomicity across multiple nodes — which NoSQL systems gave up to achieve horizontal scalability. With NewSQL you get both: SQL interface, ACID guarantees, AND horizontal scale. The cost is higher latency for distributed transactions (consensus protocols like Raft/Paxos add overhead) and operational complexity.
What is the core data model of a cloud native database, and when is it the best choice?
?
Core model: A fully managed database service provided by a cloud platform. The underlying data model varies (can be relational, document, etc.), but the defining characteristic is that infrastructure management (provisioning, patching, replication, backups, failover) is handled by the cloud provider.
Best choice when: Operational simplicity is the priority — small teams without dedicated DBA capacity, cloud-first organizations, or workloads with variable/elastic scale requirements. Examples: Amazon Aurora, Amazon DynamoDB, Google Firestore, PlanetScale.
What is the core data model of a time-series database, and when is it the best choice?
?
Core model: Optimized for storing time-indexed measurements — sequences of (timestamp, metric_name, value) tuples. Storage is optimized for sequential append and time-range reads, with built-in compression of time-ordered data.
Best choice when: You have high-volume streams of timestamped data requiring time-range queries and aggregations (average, rate, min/max over windows). Canonical use cases: application metrics, IoT sensor data, financial tick data, business KPI tracking. Examples: InfluxDB, TimescaleDB, Prometheus.
What built-in operational features make time-series databases uniquely suited to metrics data?
?
- Retention policies: Automatically delete data older than a configured age without manual cleanup.
- Downsampling: Automatically replace high-resolution raw data with lower-resolution aggregates over time (e.g., keep raw data for 7 days, hourly averages for 6 months).
- Delta compression: Sequential time-series readings have small deltas; TSDBs exploit this for extremely high compression ratios.
- Time-window functions: Built-in aggregations over time windows (moving average, rate of change) that would require complex SQL in a relational database.
When should you not use a graph database despite having relationship data?
?
When the data is not predominantly accessed via relationship traversal. If you primarily look up entities by ID and occasionally check a relationship, a relational database with foreign keys is simpler and easier to operate. Also avoid graph databases when horizontal scalability is critical — graph databases are hard to partition because edges cross partition boundaries. The specialization cost is only worth it when multi-hop traversal is a frequent, core query pattern.
In the Sysops Squad Saga, what database type was chosen for the Survey domain, and why?
?
Document database (MongoDB) was chosen for the Survey domain. The rationale: survey schemas are inherently variable — different surveys have different question types (multiple choice, text, rating scale, Likert), different numbers of questions, and different answer formats. A relational schema would require constant migrations or awkward JSON-blob columns. Document storage makes each survey a self-describing document with its own structure, eliminating schema change overhead.
In the Sysops Squad Saga, which domains stayed relational and why?
?
Ticket, Customer, and Billing stayed relational (PostgreSQL). The rationale:
- Billing: Financial data — ACID is non-negotiable; incorrect partial updates to billing records are unacceptable.
- Customer: Core reference data requiring strong consistency; many services need to trust the customer record.
- Ticket: Complex status tracking with ACID state transitions; rich queries needed.
For these domains, the disintegrator forces (change control, connection management) were addressed by moving to separate database servers — no need to also change the database type.
What is the challenge the Sysops Squad Saga reveals about the Reporting domain after decomposition?
?
The Reporting domain traditionally reads from all other domains’ tables — it needs a consolidated view across Ticket, Customer, Billing, Survey, and others. After database decomposition, no single database has all this data. The team must choose between: (a) read replicas per domain with application-layer aggregation, or (b) an event-driven architecture where each domain publishes data to a dedicated reporting store. This foreshadows the analytical data patterns in ch14-managing-analytical-data.
What is the decision sequence for choosing a database type for a service?
?
- Primary access pattern: Key lookup → key-value; hierarchical/variable records → document; relationship traversal → graph; time-ordered metrics → time-series; high-volume wide rows → column family; complex queries + ACID → relational/NewSQL.
- Need distributed ACID? Yes → NewSQL. No → any type.
- Horizontal scale needed? Yes → key-value, document, column family, NewSQL, or cloud native. No → relational is fine.
- Schema stability? Stable → relational. Variable per record → document.
- Operational capacity? Small team → cloud native (managed). Dedicated ops → any.
What is the fundamental tension captured by “data disintegrators vs. data integrators”?
?
The tension is between operational independence (what disintegrators deliver: independent scaling, deployment, fault isolation, right-fit technology) and data consistency guarantees (what integrators protect: referential integrity via FK constraints, and atomic multi-entity operations via ACID transactions). Splitting a database gains the former and loses the latter. The architecture must explicitly decide which matters more in each context, and if splitting, explicitly handle the lost guarantees in application code or via saga patterns.
Total Cards: 33
Priority: HIGH
Last Updated: 2026-05-30