Chapter 14: Larger Testing

seg testing larger-testing integration-testing e2e-testing fidelity

Status: Notes complete


Overview

Chapter 14 addresses the testing layer that exists above unit tests — the broader category of larger tests that exercise multiple components, real infrastructure, or production-like environments. The chapter begins by making the case that larger tests are necessary: unit tests, despite their speed and precision, have fundamental gaps in what they can verify. It then provides a comprehensive taxonomy of larger test types used at Google, each with its own purpose, structure, and cost profile.

The central tension the chapter navigates is fidelity vs. cost. Larger tests are valuable because they test the system more like real users and environments do — they have higher fidelity. But they are also slower, harder to maintain, less hermetic, and more likely to be flaky. The chapter provides frameworks for deciding when the fidelity gain is worth the cost, and how to structure larger tests to maximize their value while controlling their downsides.

The chapter also covers the organizational dimension: who owns larger tests, how they integrate with the developer workflow, and why larger tests require deliberate investment in infrastructure and process — they do not emerge naturally from the same test-first practices that produce unit tests.


Core Concepts

Larger test: Any test that exercises more than a single unit in isolation — including integration tests, end-to-end tests, browser tests, performance tests, and manual tests. The category is defined by having higher fidelity (more like production) and higher cost (slower, harder to maintain) than unit tests.

Fidelity: The degree to which a test environment and test scenario match the real production environment and real user behavior. A unit test with stubs has low fidelity; a full end-to-end test running against a production-like environment has high fidelity.

System under test (SUT): The portion of the production system that a larger test exercises. The size and scope of the SUT is a key design decision — too narrow and the test misses integration bugs; too broad and the test becomes slow, flaky, and hard to diagnose.

Hermetic test: A test that sets up its own isolated environment — starting services, seeding test data, and tearing down cleanly — so that tests are independent and repeatable. Hermeticity is the primary defense against test flakiness in larger tests.

Flakiness: The property of a test that passes and fails nondeterministically without code changes. Larger tests are more prone to flakiness due to timing dependencies, shared state, network variability, and external service behavior. Flaky tests erode confidence in the entire test suite.

Test data: Data used to exercise the system under test. In larger tests, test data must be carefully managed — either created programmatically at test setup or loaded from a controlled fixture — to ensure tests are deterministic and isolated.


What Larger Tests Are and Why They Exist

The Gaps Unit Tests Cannot Fill

Unit tests are fast, precise, and well-understood. But they cannot verify several critical properties:

  1. Incorrect assumptions about dependencies: A unit test with stubs verifies that the code behaves correctly given the assumptions encoded in the stubs. If those assumptions are wrong — if the real dependency behaves differently — the unit test passes but the real system fails.

  2. Configuration and deployment correctness: Unit tests do not verify that services are correctly configured for deployment — that environment variables are set, that network routing is correct, that service accounts have necessary permissions.

  3. Emergent behavior from component interaction: Some bugs only manifest when components interact — timing issues, race conditions, unexpected state sharing, or mismatched API contracts between services.

  4. Integration with external dependencies: Third-party APIs, databases, and infrastructure components may behave differently than internal stubs assume.

  5. Real user workflows across multiple services: A user action that triggers a cascade of service calls can only be verified end-to-end by a test that exercises the full cascade.

Fidelity: The Core Trade-off

                   Fidelity
High               Medium               Low
  |                   |                   |
E2E Tests         Integration Tests    Unit Tests
  |                   |                   |
Slow              Moderate             Fast
Flaky             Somewhat hermetic    Hermetic
Hard to maintain  Moderate             Easy to maintain

The key insight: unit tests and larger tests are not substitutes — they are complements. Unit tests verify individual behaviors quickly and precisely. Larger tests verify that those behaviors compose correctly in a production-like environment. A test suite without larger tests has low confidence in integration; a test suite that relies entirely on larger tests is slow and unmaintainable.

Why Not Just Have Unit Tests?

The book addresses this directly: unit tests with mocked/stubbed dependencies verify a model of the system, not the system itself. As the gap between the model (stubs) and the reality (real implementations) grows, the test suite provides less and less useful signal about whether the real system works. Larger tests close this gap.

Additionally, some product requirements cannot be expressed as unit tests:

  • “The checkout flow completes successfully” requires real browser rendering
  • “The system handles 10,000 concurrent users without exceeding 500ms p99 latency” requires real load
  • “Deploying version X does not break existing clients” requires testing against real traffic

Larger Tests at Google

Scale and Investment

Google’s test infrastructure reflects the scale of its engineering organization:

  • Automated continuous integration runs thousands of tests per commit across millions of lines of code
  • Test infrastructure teams maintain specialized tooling for browser testing, load testing, and hermetic multi-service test environments
  • Test certification programs encourage engineers to maintain larger tests as first-class artifacts alongside code

At Google’s scale, the investment in larger test infrastructure is justified by the volume of defects that only larger tests can catch, and the cost of those defects reaching production (which serves billions of users).

The Time-Cost Relationship

Larger tests cost more in multiple dimensions:

  • Authoring: Harder to write and set up (requires environment configuration, test data management, multiple services)
  • Execution time: Slower by orders of magnitude (seconds to minutes vs. milliseconds for unit tests)
  • Maintenance: More surface area for breakage (any change in any component in the SUT can break the test)
  • Flakiness: More sources of nondeterminism

Google’s guidance: the investment is worthwhile for behaviors that unit tests cannot verify — integration correctness, configuration, emergent interaction, and production-realistic scenarios.


Structure of a Large Test

Every larger test has three structural components:

1. System Under Test (SUT)

The SUT is the set of services, components, and infrastructure that the test exercises. Key design decisions:

  • Scope: How many services does the SUT include? A narrower SUT (just two services) is faster and easier to maintain but may miss bugs at the boundaries of the excluded services. A broader SUT has higher fidelity but is slower and harder to diagnose.
  • Hermetic vs. shared: Is the SUT an isolated environment created for this test run, or is it a shared environment (staging, pre-production)? Hermetic is preferred for repeatability; shared environments may have higher fidelity.
  • Real vs. substituted components: Which components are real implementations and which are fakes or test doubles? Decisions here directly affect fidelity.

2. Test Data

Test data in larger tests must be:

  • Seeded deterministically: Tests should not depend on pre-existing state; they should create the data they need
  • Isolated: One test’s data should not affect another’s
  • Representative: Data should reflect realistic scenarios, not just trivial cases

Data management strategies:

  • Programmatic seeding: Test setup creates data via the system’s APIs (highest fidelity, tests the data creation path)
  • Fixtures: Pre-defined data sets loaded at environment setup time
  • Synthetic production data: Anonymized or sampled production data used in test environments

3. Verification

Larger test verification is more complex than unit test assertions:

  • Output verification: Check the response to an API call or the state of a database after operations
  • Behavioral verification: Check that workflows produce the correct outcomes end-to-end
  • Absence verification: Check that undesirable side effects did not occur (e.g., no duplicate charges)

Types of Larger Tests

Functional Testing of Interacting Binaries

Tests that launch multiple real services and exercise interactions between them. This is the most common form of integration testing at Google.

Purpose: Verify that service A and service B interact correctly — that A’s requests are correctly formed, that B processes them correctly, and that the integrated behavior matches the specification.

Structure: A hermetic test environment launches all services in the SUT, seeds test data, makes API calls as a client would, and verifies results.

When to use: For any integration boundary that is not covered by contract tests or unit tests with fakes. Critical integration paths (authentication, payments, data persistence) warrant dedicated integration tests.

Browser and Device Testing

Tests that exercise the full stack from the browser or device client through backend services.

Purpose: Verify that the UI renders correctly, that user interactions produce correct outcomes, and that the full client-server integration works.

Tools at Google: WebDriver, Selenium-based frameworks, device test farms for mobile.

When to use: For critical user workflows (checkout, login, signup, core product journeys) and for visual regression (detecting unintended UI changes).

Challenges: Browser tests are the most expensive to maintain — UI changes frequently, timing is non-trivial, and cross-browser compatibility adds surface area.

Performance, Load, and Stress Testing

Tests that measure the system’s behavior under varying levels of traffic.

  • Performance tests: Measure latency, throughput, and resource consumption under a defined load level. Establish baselines and catch regressions.
  • Load tests: Ramp traffic to production-like or above-production levels to verify the system sustains performance under expected load.
  • Stress tests: Push beyond expected load to find breaking points, verify degradation is graceful, and understand capacity limits.

Purpose: Verify that performance SLOs (Service Level Objectives) are met and identify bottlenecks before they affect users.

When to run: Before major releases, after architectural changes to critical paths, and on a regular cadence as the system evolves.

Deployment Configuration Testing

Tests that verify the deployment configuration itself — not the code, but the infrastructure, environment variables, IAM permissions, network rules, and service configurations that surround the code.

Purpose: Catch configuration errors before they reach production. A service that is correctly coded but incorrectly configured (wrong database endpoint, missing environment variable, insufficient permissions) will fail in production.

Examples:

  • Verify that a deployed service can reach its dependencies
  • Verify that environment variables are set correctly
  • Verify that a service account has the permissions it needs
  • Verify that health checks pass post-deployment

Why they matter: A significant proportion of production incidents are caused by configuration errors, not code bugs. Configuration testing is a high-leverage investment.

Exploratory Testing

Manual, unscripted testing where a skilled tester uses their judgment and creativity to find bugs that automated tests do not catch.

Purpose: Discover unexpected behaviors, edge cases, usability issues, and bugs in code paths that automated test suites do not exercise. Exploits human judgment in ways automated tests cannot.

When to use: Before major feature launches, when entering new product territory with limited automated coverage, and as a periodic health check on critical systems.

Limitations: Not repeatable, cannot scale with the rate of code change, and misses bugs that require specific state that a tester might not reach. Exploratory testing complements automated testing; it does not replace it.

A/B Diff Regression Testing

Tests that compare the output of two versions of a system (version A and version B) against the same inputs to detect unexpected behavioral changes.

Purpose: Catch regressions in behavior when neither version’s output is known to be definitively “correct” — particularly useful for complex outputs like rendered HTML, ML model outputs, or complex data transformations.

How it works:

  1. Run the same requests against both the old version (A) and the new version (B)
  2. Compare outputs — differences are flagged as potential regressions
  3. Engineers review flagged differences to classify them as intentional changes vs. regressions

When to use: For systems where the output is complex and hard to specify precisely in advance (e.g., search ranking, recommendation systems, rendering pipelines). Also useful when refactoring to verify that behavior is preserved.

Limitation: A/B diff catches changes, not bugs — if both versions are wrong in the same way, the diff finds nothing.

User Acceptance Testing (UAT)

Testing performed by real users or user representatives to verify that a system meets their needs and expectations before release.

Purpose: Validate that the product satisfies user requirements in real-world usage, catching usability issues and incorrect assumptions about user behavior that automated tests cannot detect.

Forms at Google:

  • Dogfooding: Google employees use products before they are released externally, providing feedback based on real usage
  • Trusted tester programs: A cohort of external users gets early access in exchange for structured feedback
  • Beta releases: Staged rollouts to subsets of users before full release

Probers and Canary Analysis

Probers are continuously running tests that exercise production systems using synthetic (test) traffic, verifying that the system is healthy from a user’s perspective.

  • Purpose: Detect when a service goes down or degrades before real users are affected
  • Frequency: Run continuously (every minute or more frequently)
  • What they test: Critical user journeys, API endpoints, key integrations
  • Alerting: Failures trigger immediate alerts to on-call engineers

Canary analysis is the practice of deploying a new version to a small subset of production traffic before rolling out fully, and comparing the behavior/metrics of the new version against the existing version.

  • Purpose: Catch bugs in production that were not caught in pre-production testing, before they affect all users
  • Metrics monitored: Error rates, latency, business metrics (e.g., conversion rate, crash rate)
  • Decision: If the canary shows degradation, the rollout is halted and the new version is rolled back

Together, probers and canary analysis form the production testing layer — the last line of defense before and after release.

Disaster Recovery and Chaos Engineering

Disaster recovery (DR) testing verifies that the system can recover from failures — that backups work, that failover is correct, that data integrity is maintained.

Chaos engineering (popularized by Netflix’s “Chaos Monkey”) proactively injects failures into production or production-like systems to verify that the system degrades gracefully and recovers correctly.

Google’s approach:

  • Game days: planned exercises where teams simulate failures (datacenter outage, critical service failure) and verify response procedures
  • DiRT (Disaster Recovery Testing): Google’s internal program for testing recovery from catastrophic events
  • Controlled failure injection: introducing latency, dropping requests, killing processes in test environments

Purpose: Move from “we hope the system survives failures” to “we have verified it does.” Chaos engineering reveals assumptions about resilience that were never validated.

User Evaluation

For products where quality is subjective or hard to measure objectively (e.g., search relevance, ML model quality, recommendation quality), user evaluation uses human raters to assess quality.

Purpose: Measure product quality along dimensions that automated metrics cannot capture — relevance, naturalness, helpfulness, user satisfaction.

Forms:

  • Side-by-side evaluation: Raters compare output A vs. output B and indicate preference
  • Absolute rating: Raters rate output quality on a defined scale
  • User studies: Structured sessions with representative users performing real tasks

When to use: For any ML or AI-powered product, for search and ranking systems, and for products where user perception is the primary quality dimension.


Large Tests and the Developer Workflow

Authoring Larger Tests

Larger tests are harder to write than unit tests because:

  • They require setting up and tearing down multi-component environments
  • They require managing test data across multiple services
  • They are more sensitive to timing and ordering issues
  • Test failures are harder to diagnose (which component failed? why?)

Best practices for authoring:

  • Define the SUT as narrowly as possible while still covering the integration being tested
  • Make test data setup explicit and programmatic (not relying on pre-existing state)
  • Write tests that assert on observable outputs, not implementation details
  • Make tests self-documenting — the test should explain the scenario it exercises and why

Running Larger Tests

Larger tests are typically run:

  • In continuous integration, but on a slower cadence than unit tests (hourly or daily rather than per-commit)
  • Before release as a gate
  • On demand by engineers investigating specific integration concerns

The cost of slow tests: If larger tests take too long, engineers stop running them locally and wait for CI results, extending the feedback loop. Investment in test speed (hermetic environments, parallelism, efficient test data setup) is worthwhile.

Owning Larger Tests

Ownership of larger tests is a critical organizational decision. Tests that are not owned by a specific team tend to:

  • Not be maintained when they fail
  • Be disabled rather than fixed when they become flaky
  • Fall out of sync with the features they test

Google’s model: Tests should be owned by the teams responsible for the features they test. Larger tests that span multiple services should have a designated owner — typically the team that owns the primary service or the integration.

Test flakiness as a signal: A flaky larger test is a signal that either the test is poorly designed (timing-dependent, shared state) or the system has reliability issues. Flaky tests should be fixed or removed — not left in a “known flaky” state — because they erode confidence in the entire test suite.


TL;DRs

  • Larger tests bridge the gap between unit tests and production — they provide higher fidelity by testing real integrations and production-like environments.
  • Unit tests cannot verify everything: configuration correctness, emergent behavior from component interaction, and the validity of stub assumptions all require larger tests.
  • The structure of a larger test consists of the system under test (SUT), test data, and verification.
  • The key trade-off in larger testing is fidelity vs. cost — higher fidelity comes with slower execution, more maintenance, and more flakiness.
  • A broad spectrum of larger test types exists: functional integration tests, browser/device tests, performance tests, deployment configuration tests, exploratory testing, A/B diff tests, UAT, probers, canary analysis, chaos engineering, and user evaluation.
  • Probers run continuously in production to detect service degradation before real users are affected; canary analysis catches production bugs during staged rollout.
  • Chaos engineering moves from hoping the system is resilient to verifying that it is through controlled failure injection.
  • Larger tests must be owned — unowned tests become flaky, go unrepaired, and lose their value.
  • Flaky tests erode confidence in the entire test suite and should be fixed or removed, not left in a known-flaky state.
  • At Google’s scale, investment in larger test infrastructure (hermetic environments, parallelism, test data management) is a prerequisite for feasibility.

Key Takeaways

  1. Larger tests are necessary complements to unit tests — they verify properties that unit tests structurally cannot: configuration correctness, real integration behavior, and emergent interactions between components.
  2. Fidelity vs. cost is the central trade-off — higher-fidelity tests are more expensive to write, slower to run, and more prone to flakiness; the investment is justified only for behaviors that unit tests cannot verify.
  3. The SUT scope is a key design decision — a narrower SUT is faster and more maintainable; a broader SUT catches more integration bugs; the right scope depends on what integration boundary is being verified.
  4. Hermeticity is the primary defense against flakiness — larger tests that set up and tear down their own isolated environments are repeatable and independent; tests that rely on shared state are fragile.
  5. Probers and canary analysis form the production testing layer — probers detect degradation continuously in production; canary analysis catches bugs before they affect all users during rollout.
  6. Chaos engineering moves resilience from assumption to verified fact — deliberately injecting failures reveals which resilience assumptions were never validated; Google’s DiRT program institutionalizes this practice.
  7. Test data management is a first-class concern — larger tests must create the data they need programmatically, isolate it from other tests, and make it representative of realistic scenarios.
  8. Unowned tests become flaky and lose value — larger tests must have designated owners who maintain them; tests that span services need explicit ownership assignments.
  9. A/B diff testing is powerful for complex or subjective outputs — when the correct output is hard to specify precisely, comparing old and new versions against the same inputs catches regressions that assertion-based tests cannot.
  10. Larger tests require organizational investment, not just technical investment — infrastructure teams, test ownership policies, and developer workflow integration are as important as the tests themselves.

Last Updated: 2026-06-02