Chapter 14 Flashcards — Larger Testing
flashcards seg testing larger-testing integration e2e
What is a larger test?
?
Any test that exercises more than a single unit in isolation — including integration tests, end-to-end tests, browser/device tests, performance tests, and manual tests. Larger tests are defined by having higher fidelity (more closely matching production) and higher cost (slower execution, harder to maintain, more prone to flakiness) than unit tests.
What is fidelity in the context of testing?
?
Fidelity is the degree to which a test environment and test scenario match the real production environment and real user behavior. A unit test with stubs has low fidelity (it tests a model of the system, not the system itself). A full end-to-end test running against a production-like environment has high fidelity. The central trade-off in larger testing is fidelity vs. cost.
What are the five main gaps that unit tests cannot fill?
?
- Incorrect stub assumptions: If real dependencies behave differently than stubs assume, unit tests pass but real systems fail.
- Configuration and deployment correctness: Unit tests don’t verify environment variables, network routing, or service account permissions.
- Emergent behavior from component interaction: Race conditions, timing issues, and mismatched API contracts only manifest when components interact.
- Integration with external dependencies: Third-party APIs and infrastructure may behave differently than internal stubs assume.
- Real user workflows across multiple services: End-to-end user journeys require exercising the full service cascade.
What are the three structural components of every larger test?
?
- System under test (SUT): The set of services, components, and infrastructure the test exercises — the scope and hermeticity of the SUT are key design decisions.
- Test data: The data used to exercise the SUT — must be deterministically seeded, isolated across tests, and representative of realistic scenarios.
- Verification: Assertions on observable outputs, behavioral outcomes, or the absence of undesirable side effects.
What is a hermetic test?
?
A test that sets up its own isolated environment — starting services, seeding test data, and tearing down cleanly — so that tests are independent and repeatable. Hermeticity is the primary defense against flakiness (nondeterministic test results) in larger tests. A hermetic test produces the same result every time it runs, regardless of what other tests ran before it or what state was left behind.
What is test flakiness, and why does it matter?
?
Flakiness is the property of a test that passes and fails nondeterministically without code changes. Larger tests are more prone to flakiness due to timing dependencies, shared state between tests, network variability, and external service behavior. Flaky tests are harmful because they erode confidence in the entire test suite — engineers learn to ignore test failures, meaning real failures also get ignored.
What is the purpose of functional testing of interacting binaries?
?
To verify that two or more real services interact correctly — that service A’s requests are correctly formed, that service B processes them correctly, and that the integrated behavior matches the specification. This is the most common form of integration testing at Google. The test launches real services in a hermetic environment, seeds test data, makes API calls as a client would, and verifies results.
What is the purpose of browser and device testing?
?
To verify that the UI renders correctly and that user interactions produce correct outcomes end-to-end through the full client-server stack. Browser tests exercise the complete system from client rendering through backend services. They are the most expensive to maintain — UI changes frequently, timing is non-trivial, and cross-browser compatibility adds surface area. Used for critical user workflows (login, checkout, core product journeys).
What is the difference between performance testing, load testing, and stress testing?
?
- Performance testing: Measures latency, throughput, and resource consumption under a defined load level — establishes baselines and catches regressions.
- Load testing: Ramps traffic to production-like or above-production levels to verify the system sustains performance under expected load.
- Stress testing: Pushes beyond expected load to find breaking points, verify graceful degradation, and understand capacity limits.
All three verify that performance SLOs are met; they differ in the load levels and goals.
What is deployment configuration testing and why is it high-leverage?
?
Tests that verify the deployment configuration — environment variables, IAM permissions, network routing, service accounts — not the code itself. A significant proportion of production incidents are caused by configuration errors, not code bugs. Configuration tests catch these before production: verifying a service can reach its dependencies, that environment variables are set, that health checks pass post-deployment. High-leverage because configuration errors are common and costly.
What is exploratory testing?
?
Manual, unscripted testing where a skilled tester uses judgment and creativity to find bugs that automated tests do not catch. Used to discover unexpected behaviors, edge cases, usability issues, and bugs in code paths that automated test suites don’t exercise. Exploits human judgment in ways automation cannot. Limitations: not repeatable, doesn’t scale with code change rate. Complements automated testing; does not replace it.
What is A/B diff regression testing?
?
A test approach that compares the output of two versions (A and B) of a system against the same inputs to detect unexpected behavioral changes. Useful when the correct output is hard to specify precisely — complex outputs like rendered HTML, ML model outputs, or ranking results. If the outputs differ, engineers review the differences to classify them as intentional changes or regressions. Limitation: if both versions are wrong in the same way, the diff finds nothing.
What is UAT (User Acceptance Testing) and what forms does it take at Google?
?
Testing performed by real users or user representatives to verify a system meets their needs before release. Validates that the product satisfies user requirements in real-world usage — catching usability issues and incorrect assumptions about user behavior that automated tests cannot detect. Forms at Google: dogfooding (employees use products before external release), trusted tester programs (early access cohorts), and beta releases (staged rollouts to subsets of users).
What are probers in the context of production testing?
?
Probers are continuously running tests that exercise production systems using synthetic (test) traffic, verifying system health from a user’s perspective. They run every minute or more frequently against critical user journeys and API endpoints. When failures occur, probers trigger immediate alerts to on-call engineers. Purpose: detect service degradation before real users are affected. Part of the production testing layer alongside canary analysis.
What is canary analysis?
?
The practice of deploying a new version to a small subset of production traffic before rolling out fully, and comparing behavior and metrics of the new version against the existing version. Metrics monitored: error rates, latency, business metrics (conversion rate, crash rate). If the canary shows degradation, the rollout is halted and rolled back. Purpose: catch production bugs that pre-production testing missed, before they affect all users.
How do probers and canary analysis differ in purpose?
?
Probers monitor the currently deployed system continuously using synthetic traffic — they detect when the existing production system degrades or goes down. Canary analysis monitors a new version being rolled out against a subset of real traffic — it detects whether the new version is worse than the current version before full rollout. Probers are ongoing health monitoring; canary analysis is a release safety mechanism.
What is chaos engineering and what is its purpose?
?
The practice of deliberately injecting failures into production or production-like systems (killing processes, introducing latency, dropping requests) to verify that the system degrades gracefully and recovers correctly. Purpose: move from “we hope the system is resilient” to “we have verified it is.” Popularized by Netflix’s “Chaos Monkey.” At Google, this takes the form of DiRT (Disaster Recovery Testing) — planned exercises simulating catastrophic events.
What is Google’s DiRT program?
?
DiRT (Disaster Recovery Testing) is Google’s internal program for testing recovery from catastrophic events — simulating datacenter outages, critical service failures, and other disaster scenarios to verify that response procedures work and systems recover correctly. DiRT is an institutionalized form of chaos engineering. It moves disaster recovery from a set of untested assumptions to a set of verified capabilities.
What is user evaluation and when is it used?
?
A testing approach that uses human raters to assess quality along dimensions that automated metrics cannot capture — relevance, naturalness, helpfulness, user satisfaction. Used for ML-powered products, search and ranking systems, and any product where quality is subjective. Forms: side-by-side evaluation (raters compare A vs. B), absolute rating (raters rate quality on a scale), and user studies (structured sessions with representative users performing real tasks).
What is the key trade-off when defining the scope of the system under test (SUT)?
?
Narrower SUT: Faster to execute, easier to maintain, more hermetic, failures are easier to diagnose — but may miss bugs at the boundaries of excluded services. Broader SUT: Higher fidelity, catches more integration bugs — but slower, harder to maintain, failures are harder to diagnose, and more surface area for flakiness. The right scope depends on which integration boundary is being verified. Define the SUT as narrowly as possible while still covering the integration being tested.
What are the three test data management strategies in larger tests?
?
- Programmatic seeding: Test setup creates data via the system’s APIs — highest fidelity (tests the data creation path), most hermetic.
- Fixtures: Pre-defined data sets loaded at environment setup time — faster but less realistic.
- Synthetic production data: Anonymized or sampled real production data used in test environments — highest realism, but requires careful anonymization and compliance.
Why does test ownership matter for larger tests?
?
Larger tests that are not owned by a specific team tend to: not be maintained when they fail, be disabled rather than fixed when flaky, and fall out of sync with the features they test. Google’s model: tests are owned by the teams responsible for the features they test. Tests spanning multiple services need a designated owner — typically the team owning the primary service or integration. Ownership is what keeps larger tests reliable over time.
What should be done about flaky larger tests?
?
Flaky tests should be fixed or removed — not left in a “known flaky” state. A flaky test that engineers learn to ignore is worse than no test, because it trains engineers to dismiss test failures, meaning real failures also get ignored. Flakiness in a larger test is a signal of either poor test design (timing-dependent, shared state) or real system reliability issues that should be investigated and fixed.
What are the three dimensions of cost in larger tests?
?
- Authoring cost: Harder to write and set up — requires environment configuration, multi-service orchestration, test data management across services.
- Execution cost: Slower by orders of magnitude — seconds to minutes vs. milliseconds for unit tests.
- Maintenance cost: More surface area for breakage — any change in any component in the SUT can break the test; flakiness requires ongoing attention.
Why is “why not just have unit tests?” a question worth answering directly?
?
Because unit tests with mocked/stubbed dependencies verify a model of the system, not the system itself. As the gap between the model (stubs) and reality (real implementations) grows, the test suite gives less useful signal. Unit tests cannot verify: configuration, emergent interaction bugs, whether stub assumptions are correct, or real user workflows. Larger tests close this gap. They are not redundant with unit tests — they test different properties that unit tests structurally cannot verify.
Total Cards: 25
Review Time: ~18 minutes
Priority: HIGH
Last Updated: 2026-06-02