Chapter 6: Pulling Apart Operational Data
saht data-decomposition database-types polyglot-persistence operational-data
Status: Notes complete
Overview
Chapter 6 is the data counterpart to Chapters 4-5 (service decomposition). Decomposing a monolithic application into services is only half the job — the monolithic database underneath must also be split. This chapter answers: how do you break apart a shared operational database without losing the consistency and relationship guarantees it provides?
The central tension is that databases provide two enormously valuable capabilities — referential integrity (foreign keys across tables) and ACID transactions — and both become hard to maintain once you split across multiple databases. The chapter frames this as a trade-off analysis: when are the benefits of splitting worth the cost of losing those capabilities?
After covering decomposition, the chapter pivots to database type selection — once you’ve decided to split, you can also choose the right type of database for each service rather than defaulting to a relational database everywhere (polyglot persistence).
The Sysops Squad Saga runs throughout, showing the team decomposing their Oracle monolith and selecting different database types for different services.
Data Decomposition Drivers
Every decision to split (or keep together) operational data is driven by forces the book calls data disintegrators and data integrators. These are the architectural forces — you identify which are active in your context, then weigh them.
Disintegrators (Forces to Split)
These forces push toward separating tables into distinct databases owned by distinct services.
1. Change Control
- Problem: When many services share one database, a schema change to serve one service’s needs can break all other services.
- Force: Services that change independently should own their data independently.
- Signal: High rate of schema migrations that require coordinating multiple teams; long “schema freeze” windows before deployments.
- Example in Sysops Squad: The ticket service and the billing service both touch the same Oracle schema. A billing team schema change breaks the ticket service build.
2. Connection Management
- Problem: A single database server has a finite connection pool. As the number of microservices scales, each service needs its own pool of persistent connections, quickly exhausting the database’s limit.
- Force: Services with high connection demand need their own database servers.
- Signal: Connection pool exhaustion errors; “too many connections” alerts; use of external connection poolers like PgBouncer becoming critical infrastructure.
- Detail: A monolith with 10 threads uses 10 connections. 100 microservice instances at 10 connections each = 1000 connections. Most databases struggle beyond a few hundred to a few thousand.
3. Scalability
- Problem: One database must serve all services’ workloads simultaneously. A reporting query that scans millions of rows blocks OLTP queries.
- Force: Services with different scalability profiles (reads vs. writes, large scans vs. point lookups) benefit from databases sized for their specific load.
- Signal: One service’s load degrades all others; need to scale the entire database tier to accommodate a single service’s peak traffic.
4. Fault Tolerance
- Problem: A single database is a single point of failure. If it goes down, every service that uses it goes down.
- Force: Services with high availability requirements need database isolation so that one database outage doesn’t cascade.
- Signal: High-SLA services being affected by unrelated low-SLA service database operations; a failing database taking down unrelated functionality.
5. Architectural Quantum
- Problem: An “architecture quantum” is an independently deployable unit with high functional cohesion. A shared database creates static coupling between services — they share the same deployment dependency.
- Force: To achieve true independent deployability, each quantum needs its own data store.
- Signal: Services that are logically independent but can’t be deployed independently because they share schema versions.
- See: ch02-discerning-coupling for the full definition of architectural quantum.
6. Database Type Optimization
- Problem: Relational databases are a general-purpose tool. Some data is better represented and queried using a specialized database type (graph, document, time-series, etc.).
- Force: When a service’s data model and access patterns align strongly with a non-relational database type, splitting allows you to choose the right tool.
- Signal: Complex workarounds in SQL to simulate graph traversals; JSON blobs stored in text columns to simulate document storage.
Integrators (Forces to Keep Together)
These forces push back against splitting — they represent real costs incurred when you separate data.
1. Data Relationships
- Problem: Relational databases maintain referential integrity via foreign keys. When Table A in Service-1’s database references Table B in Service-2’s database, you can no longer use a foreign key — the constraint exists across a network boundary.
- Cost of splitting: You must enforce referential integrity in application code. This is error-prone and slower; application-level integrity checks can be bypassed, they’re not atomic, and they add latency.
- Specific concerns:
- Foreign keys across services: Must be replaced with application-enforced consistency.
- Joins across services: A JOIN that combined tables from two services must be replaced with two separate queries and an in-memory join, or data duplication, or a denormalized view. Each approach has costs.
- Data duplication: Duplicating reference data (e.g., customer name in both the order service and billing service) causes consistency challenges — which copy is authoritative? How do you keep them in sync?
2. Database Transactions (ACID)
- Problem: ACID transactions guarantee atomicity across multiple table operations on a single database. When data moves to separate databases, you lose the ability to atomically commit a change across two services’ data.
- Cost of splitting: You must use distributed transaction patterns — sagas (see ch12-transactional-sagas), two-phase commit (2PC), or eventual consistency — all of which are more complex, slower, or both.
- Specific ACID properties lost:
- Atomicity: Either all changes commit or none do — impossible natively across two databases.
- Consistency: Database-level constraints (check constraints, triggers) can no longer span services.
- Isolation: Row-level locking and transaction isolation levels don’t cross service boundaries.
- Durability: Still available within each database; not affected by splitting per se.
- The practical impact: A business operation that was one
BEGIN ... COMMITblock in a monolith becomes an orchestrated saga with compensating transactions in a distributed system.
The 5-Step Data Decomposition Process
The book provides a concrete, ordered process for migrating from a monolithic database to separate service-owned databases. Each step can be done incrementally.
Step 1: Analyze the Database and Create Data Domains
What you do: Examine every table (and view, stored procedure, trigger) in the monolithic database. Group tables into data domains — logical clusters of related tables that conceptually belong together.
How to do it:
- Draw an entity-relationship diagram if one doesn’t exist.
- Identify which tables are tightly related through foreign keys and shared queries.
- Use bounded context thinking from Domain-Driven Design: which tables “belong” to the customer domain? The order domain? The billing domain? The ticket domain?
- Identify cross-domain relationships: tables that have foreign keys to tables in another domain. These are the hard cases — they represent data coupling.
Output: A map showing: Domain → Set of tables, plus a list of cross-domain foreign key dependencies.
Sysops Squad example: The team identifies domains: Ticket, Customer, Billing, Survey, Reporting. The Ticket table has a foreign key to the Customer table (cross-domain). The Billing table has a foreign key to both Ticket and Customer (cross-domain).
Key insight: Lots of cross-domain foreign keys signal that splitting will be painful. The more cross-domain relationships, the stronger the integrator forces.
Step 2: Assign Tables to Data Domains
What you do: Make explicit, documented decisions about which domain owns each table. For tables with cross-domain relationships, decide where the table “lives” (which domain has primary ownership).
How to do it:
- A table is owned by the domain whose service writes to it most frequently, or whose service’s business capability the table primarily supports.
- For shared reference tables (e.g., a
status_codeslookup table used by five services), choose between:- Assign to the most appropriate domain and have other services call that domain’s API for the data.
- Duplicate the table into each domain (with a synchronization strategy).
- Create a shared reference data service.
- Document the rationale for contested tables.
Output: A definitive table-to-domain assignment map.
Key tension: A table that is written by one domain but read by five others is difficult to assign. The domain that writes it has a strong claim; the domains that read it will suffer if they must cross a service boundary for every read.
Step 3: Separate Database Connections to Data Domains
What you do: Before physically moving any data, change the application code so that each service connects only to the tables in its own data domain. Enforce this at the connection level.
How to do it:
- Create separate database users/schemas for each domain within the same physical database server.
- Grant each domain’s database user only access to that domain’s tables.
- Update each service’s database connection configuration to use its domain-specific credentials.
- Run the application and fix any access violations — these violations surface hidden cross-domain dependencies in the code.
Why this step is separate: It lets you find and fix cross-domain access patterns in application code while still on a single database server, where joins and cross-schema queries are easy to replace incrementally. The physical separation in Steps 4-5 then becomes a deployment operation rather than a code-change operation.
Output: Application code where each service accesses only its own schema, enforced by database credentials.
Key difficulty: Stored procedures and views that join across domains are exposed here. They must be eliminated or rewritten as application-layer joins or API calls.
Step 4: Move Schemas to Separate Database Servers
What you do: Physically migrate each domain’s tables to its own database server (or separate managed database instance). The application already connects to separate schemas; now the schemas live on separate hosts.
How to do it:
- Provision new database servers for each domain.
- Use data migration tools (database replication, ETL pipelines, or dump/restore) to copy data from the monolith to the new servers.
- Run both the monolith and the new databases in parallel (dual-write or replication) for a period to validate data integrity.
- Update connection strings in the application to point to the new servers, one domain at a time.
Cross-domain relationships during this step: Foreign keys between domains must now be dropped from the database schema (they can no longer be enforced by the database). Application-level checks must replace them.
Output: Each domain’s data living on its own database server, with the application successfully connecting to the right server for each domain.
Step 5: Switch Over to Independent Database Servers
What you do: Complete the migration by decommissioning cross-database replication/synchronization, removing the monolithic database’s tables for the migrated domains, and declaring each service’s database fully independent.
How to do it:
- Validate that no service is still reading from or writing to the old monolithic database for the migrated domain’s tables.
- Remove the now-redundant tables from the monolith (or archive them).
- Stop any replication streams between the monolith and the new databases.
- Decommission unused database servers if the monolith’s database is now empty of all domains.
Output: Each service owns its own fully independent operational database. No cross-database foreign keys. No shared schema. True independent deployability.
Key remaining challenge: Any business operation that previously used a single ACID transaction across now-separated tables must be handled via sagas or eventual consistency. This is addressed in ch09-data-ownership-distributed-transactions and ch12-transactional-sagas.
Database Type Selection Guide
Once data is decomposed into service-owned domains, you can choose the right database type for each service. The book covers eight operational database types. Below is a detailed treatment of each.
1. Relational Databases
Core data model: Tables with rows and columns, schemas enforced by the database engine, relationships via foreign keys, queries via SQL.
Key strengths:
- ACID transactions: The gold standard for consistency; multi-row, multi-table atomicity within a single database.
- Complex queries: SQL JOINs, aggregations, window functions, and subqueries are mature and powerful.
- Referential integrity: Foreign key constraints enforced at the database level.
- Mature tooling: Decades of operational tooling, ORMs, monitoring, and DBA expertise.
- Schema enforcement: Prevents bad data at the storage layer.
Key weaknesses:
- Horizontal scaling is hard: Traditional RDBMS (PostgreSQL, MySQL, Oracle) scale vertically well but sharding horizontally is complex and often requires middleware.
- Schema rigidity: Schema migrations require downtime or careful zero-downtime migration scripts; changes are expensive in high-traffic systems.
- ORM impedance mismatch: Object graphs don’t map cleanly to tabular data; complex ORMs create their own problems.
- Expensive for simple lookups: A full relational engine is heavyweight overhead for simple key-value access patterns.
Best-fit scenarios:
- Complex business logic requiring ACID transactions spanning multiple entities (financial ledgers, order processing).
- Applications with complex, ad-hoc reporting queries.
- Domains where data relationships are rich and important to the business logic.
- Teams that have existing SQL expertise and RDBMS infrastructure.
Examples: PostgreSQL, MySQL, Oracle, SQL Server, MariaDB.
2. Key-Value Databases
Core data model: A hash map / dictionary at database scale. Each record is a key (unique identifier) mapped to a value (opaque blob, string, integer, or simple structure). No schema. No relationships between records.
Key strengths:
- Extreme read/write speed: O(1) lookup by key; no query parsing, no join execution, no index scanning beyond the primary key hash.
- Horizontal scalability: Keys partition naturally across nodes (consistent hashing). Adding nodes is linear in complexity.
- Simple operations: Get, put, delete. Nothing else needed for the use case.
- Low latency: Typically sub-millisecond for in-memory implementations.
Key weaknesses:
- No query flexibility: You can only look up by the exact key. There are no range scans, no filters on value content, no aggregations.
- No relationships: No foreign keys, no joins. The application must maintain relationships.
- Value opacity: The database has no knowledge of what’s inside the value — you can’t query on fields inside the value.
- Limited data modeling: All problems must be reduced to key lookups; complex data models are forced into this shape awkwardly.
Best-fit scenarios:
- Session data: User sessions keyed by session ID, read and written on every request.
- Caching: Frequently read, rarely written reference data (product catalog cache, config values).
- Shopping carts: Keyed by user ID, value is the cart state.
- Feature flags: Keyed by flag name.
- Rate limiting counters: Keyed by user + time window.
Examples: Redis, Memcached, DynamoDB (in its simplest usage), Riak.
3. Document Databases
Core data model: JSON or BSON documents stored as records. Each document is self-describing (schema-flexible), can have nested objects and arrays, and is addressed by a document ID. Collections (analogous to tables) group related documents.
Key strengths:
- Schema flexibility: Each document in a collection can have different fields. Adding a new field to some records doesn’t require a migration of all records.
- Nested / hierarchical data: Natural fit for data that has a hierarchical structure — a product with variants, a blog post with embedded comments, a customer with embedded addresses.
- Developer ergonomics: Documents map naturally to programming language objects; no ORM impedance mismatch.
- Horizontal scalability: Documents shard naturally (by document ID or a shard key).
- Rich query model: Query on any field within documents (with indexes), including nested fields and array elements.
Key weaknesses:
- Weak consistency options: Many document databases offer eventual consistency or require explicit configuration to get stronger guarantees.
- No joins: Documents in different collections must be joined in application code or via denormalization.
- Duplication by design: Embedding related data (e.g., embedding the author’s name in each article document) causes data duplication; updating the author’s name requires touching many documents.
- Schema anarchy risk: Schema flexibility is a double-edged sword; without discipline, collections become inconsistent.
Best-fit scenarios:
- Product catalogs: Products with highly variable attributes (a shirt has color and size; a camera has megapixels and lens mount — different fields).
- Content management: Blog posts, articles, pages — hierarchical content with embedded metadata.
- User profiles: Variable user attributes, preferences, embedded addresses.
- Event logs (when events have variable payloads).
- Any domain with schema-on-read requirements.
Examples: MongoDB, CouchDB, Firestore, Amazon DocumentDB.
4. Column Family Databases (Wide-Column Databases)
Core data model: Data is organized by rows, but each row can have a large and variable set of columns grouped into column families. Unlike relational tables (where every row has the same columns), rows in a column family database can have thousands of distinct columns, and different rows can have entirely different columns. Storage is optimized for reads and writes of entire column families at once.
Key strengths:
- Massive write throughput: Designed for high-volume append-heavy workloads. Writes are sequential (LSM-tree based), avoiding random I/O.
- High scalability: Designed to scale to petabytes across hundreds of nodes (Cassandra, HBase are used at planet-scale).
- Time-series and ordered data: Column families can store time-ordered columns efficiently; scanning a time range is fast.
- Fault tolerance: Multi-datacenter replication is a first-class feature.
- Sparse data: Only stores columns that have values for a row; no wasted storage for null columns.
Key weaknesses:
- Limited query model: Queries must be designed around the data model; ad-hoc queries are difficult. You design your schema around your query patterns (not the other way around, as in relational databases).
- No ACID transactions across rows (in most implementations): Transactional guarantees are typically at the single-row level.
- No joins: Application must join data or denormalize.
- Schema design complexity: Getting the partition key and clustering key right requires significant upfront design; wrong choices lead to hot partitions and poor performance.
- Eventual consistency: Most implementations default to eventual consistency (tunable in Cassandra).
Best-fit scenarios:
- Time-series data: Sensor readings, financial tick data, server metrics — where you write continuously and read by time range.
- Audit logs: High-volume append-only logs.
- Messaging/activity feeds: User activity timelines, chat history.
- IoT data ingestion: High write volume from many devices.
- Leaderboards/counters: Distributed counters with high write contention.
Examples: Apache Cassandra, HBase, Azure Cosmos DB (table API), Google Bigtable.
5. Graph Databases
Core data model: Data is represented as a graph of nodes (entities) and edges (relationships between entities). Both nodes and edges can have properties (key-value attributes). Queries traverse the graph by following edges, making relationship-heavy queries natural and fast.
Key strengths:
- Relationship traversal performance: Queries that follow many-hop relationships (find friends of friends who bought X in city Y) are orders of magnitude faster than equivalent recursive SQL JOINs, because the database stores adjacency directly rather than inferring it from foreign keys at query time.
- Flexible schema: Nodes and edges can have arbitrary properties without schema changes.
- Intuitive data modeling: Many real-world domains are naturally graph-shaped (social networks, org charts, supply chains, dependency graphs).
- Graph algorithms: Built-in support for shortest path, PageRank, community detection, centrality — algorithms that are painful in SQL.
Key weaknesses:
- Poor fit for non-graph data: If your data isn’t relationship-heavy, you pay the complexity cost with none of the benefit.
- Scalability challenges: Graph databases are harder to partition (shard) than key-value or document databases, because relationships span partitions.
- Limited ecosystem: Smaller community, fewer ORMs and tooling compared to relational or document databases.
- Learning curve: Graph query languages (Cypher, Gremlin, SPARQL) are unfamiliar to most developers.
Best-fit scenarios:
- Social networks: Friend connections, follower graphs, mutual connections.
- Recommendation engines: “People who bought X also bought Y” — collaborative filtering via graph traversal.
- Fraud detection: Finding rings of accounts connected through transactions.
- Knowledge graphs: Ontologies, semantic relationships.
- Network topology: IT infrastructure dependencies, call graphs.
- Access control: Role hierarchies, permission inheritance.
Examples: Neo4j, Amazon Neptune, ArangoDB (multi-model), TigerGraph.
6. NewSQL Databases
Core data model: A relational database interface (SQL, ACID transactions) built on top of a horizontally scalable distributed architecture. NewSQL databases provide the developer experience of a traditional RDBMS with the scalability of NoSQL systems.
Key strengths:
- SQL interface + horizontal scale: You write standard SQL but the database transparently shards data across nodes.
- Distributed ACID transactions: True ACID semantics across nodes — the key differentiator from NoSQL systems. No need to implement sagas for transactions.
- Familiar tooling: SQL, JDBC/ODBC drivers, familiar query planning concepts.
- High throughput OLTP: Designed for high-volume transactional workloads at distributed scale.
- Strong consistency: Often use consensus protocols (Raft, Paxos) for consistency.
Key weaknesses:
- Operational complexity: More complex to operate than a single-node RDBMS; requires deep expertise.
- Distributed transaction overhead: Distributed ACID comes with latency cost; not as fast as single-node transactions.
- Newer technology: Less mature than PostgreSQL/Oracle; smaller ecosystems and less operational history.
- Cost: Often commercial products or require significant infrastructure.
- Feature gaps: May lack some advanced SQL features (certain window functions, stored procedures, specific JOIN optimizations) present in mature RDBMS.
Best-fit scenarios:
- High-throughput OLTP workloads that have outgrown a single-node relational database.
- Global applications requiring data locality (data stored close to users geographically) with consistent reads/writes.
- Applications that need ACID semantics but also horizontal scalability — the classic NewSQL sweet spot.
- Financial services, e-commerce at scale.
Examples: CockroachDB, Google Spanner, TiDB, YugabyteDB, VoltDB.
7. Cloud Native Databases
Core data model: Managed database services offered by cloud providers, designed for cloud-first deployments. They abstract away infrastructure management entirely. The underlying data model varies (can be relational, document, or other), but the defining characteristic is deep integration with cloud services and fully managed operations.
Key strengths:
- Fully managed: No server provisioning, patching, backup configuration, or replication setup. The cloud provider handles it.
- Elastic scaling: Scale up/down automatically based on demand; pay for what you use.
- High availability built-in: Multi-AZ replication, automatic failover, point-in-time recovery are standard.
- Cloud service integration: Native integration with IAM, monitoring, event streams, and other cloud platform services.
- Reduced operational burden: Allows small teams to run production databases without dedicated DBA staff.
Key weaknesses:
- Vendor lock-in: Deep integration with a specific cloud provider makes migration expensive.
- Less control: Can’t tune low-level database parameters; can’t access the underlying OS.
- Potential cost at scale: Managed services can become expensive at high scale compared to self-managed alternatives.
- Feature gaps: Managed versions may lag behind the open-source version in features.
Best-fit scenarios:
- Cloud-first organizations: Teams building new applications on a cloud platform with no existing on-premises database infrastructure.
- Small-to-medium teams: Where the reduced operational burden is valuable.
- Variable workloads: Where elastic scaling saves money and prevents over-provisioning.
- Rapid development: Where speed of provisioning and ease of operations matters more than cost or control.
Examples: Amazon Aurora, Amazon DynamoDB, Google Cloud Spanner, Azure Cosmos DB, PlanetScale, Neon.
8. Time-Series Databases
Core data model: Optimized for storing and querying time-indexed data — sequences of measurements recorded at timestamps. The primary access pattern is: write streams of (timestamp, metric, value) tuples, then query by time range with aggregations (average, min, max, rate of change) over the time dimension.
Key strengths:
- Write performance for time-ordered data: Extremely high ingest rates; designed for millions of data points per second.
- Automatic data lifecycle management: Built-in retention policies (automatically delete data older than N days), downsampling (replace raw data with aggregated buckets over time to save storage).
- Time-optimized compression: Time-series data is highly compressible because sequential readings have small deltas; TSDBs exploit this.
- Time-aware query functions: Built-in functions for time-window aggregations, rate calculations, moving averages, and forecasting that would be cumbersome in SQL.
- Efficient time-range scans: Storage layout makes time-range queries fast.
Key weaknesses:
- Single access dimension: Optimized for time-based queries; non-time-based lookups are often poor or unsupported.
- No relationships: Not designed for relational data; no joins.
- Limited use case: Highly specialized; not a general-purpose database.
- Data immutability assumption: Designed for append-only workloads; updates to historical data are awkward.
Best-fit scenarios:
- Application metrics and monitoring: CPU usage, memory usage, request latency — classic DevOps observability data.
- IoT sensor data: Temperature, pressure, GPS coordinates recorded continuously.
- Financial market data: Stock prices, order book states, trade ticks.
- Business KPI tracking: Revenue per hour, signups per day — any business metric over time.
- Log analytics (when time is the primary query dimension).
Examples: InfluxDB, TimescaleDB (PostgreSQL extension), Prometheus, OpenTSDB, QuestDB, Amazon Timestream.
Polyglot Persistence Trade-offs
Polyglot persistence is the practice of using different database types for different services, choosing the best fit for each service’s data model and access patterns rather than using one database type everywhere.
Operational Comparison Matrix
| Dimension | Relational | Key-Value | Document | Column Family | Graph | NewSQL | Cloud Native | Time-Series |
|---|---|---|---|---|---|---|---|---|
| Data Model | Tables/rows | Key → value blob | JSON documents | Row + column families | Nodes + edges | Tables/rows | Varies | Timestamps + metrics |
| Schema | Rigid (enforced) | None | Flexible (per-doc) | Flexible (per-row) | Flexible (per-node) | Rigid (enforced) | Varies | Semi-structured |
| ACID Transactions | Yes (single DB) | No | Limited | No (row-level only) | Yes (Neo4j) | Yes (distributed) | Varies | No |
| Query Flexibility | Very high (SQL) | None (key only) | Medium (doc queries) | Low (partition key) | High (graph traversal) | High (SQL) | Varies | Medium (time-range) |
| Write Throughput | Medium | Very high | High | Very high | Medium | High | High | Very high |
| Read Performance | Medium-High | Very high (key) | High (by ID) | High (column scan) | High (traversal) | High | High | High (time range) |
| Horizontal Scale | Hard | Easy | Easy | Easy | Hard | Built-in | Built-in | Medium |
| Relationship Support | Strong (FK/join) | None | Embedded only | None | First-class | Strong | Varies | None |
| Operational Complexity | Low-Medium | Low | Low | High | Medium | High | Low | Low-Medium |
| Consistency Model | Strong | Eventual/Strong | Eventual (usually) | Tunable | Strong | Strong | Varies | Eventual |
| Best Scale | Millions of rows | Billions of keys | Millions of docs | Billions of rows | Millions of nodes | Billions of rows | Elastic | Billions of points |
Polyglot Persistence Costs
The book is explicit: polyglot persistence is not free. The trade-off table:
| Benefit | Cost |
|---|---|
| Each service uses the best database for its access pattern | Each database type requires specialized operational expertise |
| Independent scaling per service | More infrastructure to provision, monitor, backup, and patch |
| No cross-service schema coupling | No cross-service transactions or joins at the database layer |
| Technology evolution per service | Inconsistent developer experience across services |
| Fault isolation per database | More failure modes to handle and monitor |
The key question: Is the gain from using the optimal database type for a service large enough to justify the operational complexity of running and maintaining another database technology?
Trade-off Summary
Splitting Data: The Core Trade-off
| If you split… | You gain… | You lose… |
|---|---|---|
| Tables into separate databases | Independent deployability, scalability, fault isolation, right-fit DB type | Cross-service joins, ACID transactions spanning services, referential integrity |
| A shared reference table | Clear ownership, ability to evolve independently | Other services must call an API or duplicate data for reads |
| Early (Step 1-2 only) | Logical clarity | Nothing yet — still on one DB server |
| Late (Step 5) | Full operational independence | Must solve distributed transaction problem |
Data Disintegrators vs. Integrators: When Does Each Win?
| Force | Weight | When it dominates |
|---|---|---|
| Change control (D) | High | Many teams touching one schema; frequent deployments blocked by schema coordination |
| Connection limits (D) | Medium | Scaling to hundreds of service instances |
| Scalability (D) | High | Different services have very different read/write profiles |
| Fault tolerance (D) | High | Different availability SLAs across services |
| Architecture quantum (D) | High | Aiming for true independent deployability |
| DB type optimization (D) | Medium | One service’s access pattern is a very poor fit for relational |
| Data relationships (I) | Very high | Many cross-domain foreign keys; complex joins are core to business logic |
| ACID transactions (I) | Very high | Business requires atomic multi-entity updates; saga complexity is unacceptable |
Rule of thumb from the book: If you have more active disintegrators than integrators, and the integrators are manageable (few cross-domain relationships, transactions can be eventually consistent), then split. If the integrators are strong (deep relational coupling, strict transaction requirements), keep together — or split very carefully with a plan for handling the lost guarantees.
Decision Framework
How to Choose a Database Type
Work through these questions in order:
1. What is the primary access pattern?
- Lookup by a single unique key → Key-Value
- Look up and query within self-contained hierarchical records → Document
- Query relationships between entities across many hops → Graph
- Time-ordered metrics and aggregations → Time-Series
- High-volume append-only writes, wide rows → Column Family
- Complex queries, strong consistency, ACID across entities → Relational or NewSQL
- Cloud-first, managed operations priority → Cloud Native
2. Do you need ACID transactions across multiple entities?
- Yes, within a single node → Relational
- Yes, across distributed nodes → NewSQL
- No → any type fits
3. How will this service scale?
- Vertically (scale up one server) → Relational is fine
- Horizontally (many nodes) → Key-Value, Document, Column Family, NewSQL, or Cloud Native
4. How complex is the data schema, and how often does it change?
- Stable schema, well-understood → Relational
- Highly variable attributes, schema evolves per-record → Document
5. What is the team’s operational capacity?
- Small team, minimal DBA capacity → Cloud Native or managed services
- Dedicated ops team → any type
6. What are the consistency requirements?
- Strict consistency, no stale reads → Relational, NewSQL, or configured Cloud Native
- Eventual consistency acceptable → Key-Value, Document, Column Family
Quick Reference: Primary Use Case per Database Type
| Database Type | Primary Use Case Signal |
|---|---|
| Relational | ”I need ACID and complex queries” |
| Key-Value | ”I need the fastest possible single-key lookup” |
| Document | ”My records have variable structure and nested data” |
| Column Family | ”I have massive write volume and time-series or wide-row data” |
| Graph | ”Relationships between entities ARE the data” |
| NewSQL | ”I need SQL + ACID but my RDBMS can’t scale horizontally” |
| Cloud Native | ”I want managed infrastructure with elastic scaling” |
| Time-Series | ”I’m collecting timestamped metrics at high volume” |
Sysops Squad Saga
The Sysops Squad case study in Chapter 6 follows the team as they decompose their Oracle monolith and select polyglot databases.
The Starting Point
All data lives in a single Oracle database accessed by a monolithic Java application. Tables include: tickets, customers, billing, contracts, survey, knowledge-base, login, notification, reporting, and lookup/reference tables.
Domain Identification
The team identifies five primary data domains:
- Ticket domain:
ticket,ticket_history,ticket_notes,sysops_expert,sysops_expert_profile - Customer domain:
customer,customer_notification,customer_survey - Billing domain:
billing,billing_payment,contract - Survey domain:
survey,survey_question,survey_answer - Reporting domain:
reporting(read-heavy aggregation)
Key Cross-Domain Dependencies Identified
tickettable has a FK tocustomer→ after split, must be handled via API or application-level referencebillingtable has a FK to bothticketandcustomer→ two cross-domain dependencies- The
reportingdomain reads from nearly all other tables → biggest integration challenge
Database Type Selections
| Service/Domain | Selected Database | Rationale |
|---|---|---|
| Ticket | Relational (PostgreSQL) | Complex queries, status tracking, ACID for ticket state transitions |
| Customer | Relational (PostgreSQL) | Core reference data, strong consistency needed |
| Billing | Relational (PostgreSQL) | Financial data — ACID is non-negotiable |
| Survey | Document (MongoDB) | Survey schemas vary; questions have variable structure; schema flexibility needed |
| Reporting | Column Family (Cassandra) | High read volume, pre-aggregated wide rows, no complex transactions needed |
| Knowledge Base | Document (MongoDB) | Articles with variable structure, rich text, metadata |
| Notification | Key-Value (Redis) | Fast lookups of notification state by customer ID; high throughput; TTL useful |
Key Lessons from the Saga
-
Not all domains need a new database type: Ticket, Customer, and Billing stay relational — the disintegrators for these domains don’t include “database type optimization.” Splitting the database server provides the change control, connection management, and fault tolerance benefits without changing the database technology.
-
Survey is the clearest polyglot win: Surveys have genuinely variable schema (different question types, answer formats) — document storage is a natural fit, not a forced one.
-
Reporting is the hardest: Reporting traditionally wants to read from all domains. The solution: reporting either reads from dedicated read replicas or uses an event-driven architecture where each domain publishes data to a reporting store (this connects to the analytical data discussion in ch14-managing-analytical-data).
-
The cross-domain FK problem: The
ticket → customerFK cannot be enforced by the database after splitting. The team must enforce this in the application layer — specifically, the ticket service must call the customer service API to validate customer existence before creating a ticket, or tolerate orphaned tickets with application-level cleanup.
Key Takeaways
-
Data decomposition is required for true service independence. Services that share a database are coupled at the data layer regardless of how clean their service interfaces are. Independent deployability requires independent data ownership.
-
Data disintegrators and integrators are the analytical framework. Before deciding to split data, enumerate which forces are active. Strong integrators (rich relational data, strict ACID requirements) may outweigh disintegrators and argue for keeping data together.
-
The 5-step process provides an incremental path. You don’t have to do it all at once. Steps 1-3 can be done while keeping a single physical database, which reduces risk significantly.
-
Step 3 (separate connections before moving data) is the most important incremental step. By enforcing domain-level access controls on a single DB server, you discover hidden cross-domain dependencies safely, while the application is still on one database.
-
Losing cross-database foreign keys is a real cost. Application-layer referential integrity is weaker, slower, and easier to bypass than database-enforced constraints. This is the most frequently underestimated cost of data decomposition.
-
ACID transactions across services require a new approach. Once data is split, any business operation that previously used a single transaction must be re-designed using sagas, eventual consistency, or 2PC — each with significant tradeoffs.
-
Polyglot persistence is an option, not a requirement. When you split a shared database, you gain the option to choose a different database type per service. The choice should be driven by access pattern fit, not novelty. Using the wrong database “because it’s cool” adds complexity without benefit.
-
Column family databases (Cassandra, HBase) require schema-around-query-patterns design. This is opposite to relational thinking (schema-around-data). Getting partition keys and clustering keys wrong leads to hot spots and full table scans.
-
Graph databases win big on relationship traversal, but don’t scale like key-value or column family stores. Only use them when relationships are the core data access pattern.
-
Time-series databases are highly specialized and worth it for their niche. If you’re storing metrics, IoT data, or financial ticks, a TSDB provides built-in retention, compression, and time-window functions that would be complex to replicate in a relational database.
Related Resources
- ch02-discerning-coupling — Architectural quantum definition; static vs. dynamic coupling
- ch04-architectural-decomposition — Service decomposition (the service-level counterpart to this chapter’s data decomposition)
- ch05-component-based-decomposition-patterns — Tactical component patterns for decomposition
- ch09-data-ownership-distributed-transactions — What to do after data is split: who owns what, how to handle distributed writes
- ch10-distributed-data-access — How services access data they don’t own
- ch12-transactional-sagas — The primary pattern for replacing ACID transactions across split data
- ch14-managing-analytical-data — The analytical data equivalent: data mesh, data products
Last Updated: 2026-05-30