Chapter 13: Test Doubles
seg testing test-doubles fakes stubs mocking
Status: Notes complete
Overview
Chapter 13 addresses one of the most consequential and commonly misused tools in a software engineer’s testing arsenal: test doubles. A test double is any object or function that replaces a real production dependency in a test — used to isolate the code under test from slow, nondeterministic, or difficult-to-configure external systems. The chapter does not treat test doubles as a single monolithic concept; instead it draws sharp distinctions between three fundamentally different techniques — faking, stubbing, and interaction testing (mocking) — and gives specific, opinionated guidance on when each is appropriate and when each should be avoided.
The chapter is grounded in Google’s real experience at scale. Google’s engineering culture evolved through a period of excessive mocking, learned its costs, and developed the heuristics documented here. The central message is that test doubles must be used with discipline: overuse leads to tests that are brittle, hard to maintain, and paradoxically less reliable than the behaviors they are supposed to verify.
Core Concepts
Test double: Any object, function, or system that replaces a real production dependency in a test. The term encompasses fakes, stubs, mocks, spies, and dummies. Borrowed from the film industry term “stunt double.”
Seam: A place in code where behavior can be changed without editing the code itself — typically by substituting a dependency. Seams are the insertion points for test doubles. Well-designed code with dependency injection has natural seams; code with hardcoded dependencies does not.
Mocking framework: A library that provides APIs for dynamically creating test doubles at runtime — generating objects that implement an interface and recording or stubbing method calls. Examples: Mockito (Java), unittest.mock (Python), Jasmine (JavaScript).
Fidelity: The degree to which a test double behaves like the real implementation. High-fidelity fakes closely mirror the real system’s behavior; low-fidelity stubs return hardcoded values with no behavioral logic. The right fidelity depends on the test’s goals.
State testing: Verifying the observable state or output of a system under test after exercising it — the primary recommended form of verification.
Interaction testing: Verifying that a function was called with specific arguments, or that it was called a certain number of times. Appropriate only when the interaction itself is the behavior being tested.
Test Doubles at Google: History and Current Approach
Google’s relationship with test doubles evolved significantly. In the early years, mocking frameworks were widely adopted, and engineers used them liberally — mocking virtually every dependency. Over time, two problems emerged:
- Brittle tests: Tests that used heavy interaction testing broke whenever an implementation changed, even when the observable behavior did not. The tests were testing the implementation, not the behavior.
- Fake safety: Tests that stubbed out all dependencies gave confidence that the code under test passed tests, but provided little assurance that the real system would work — because the stubs did not accurately model the real dependencies.
Google’s current approach reflects these lessons:
- Prefer real implementations where feasible (fast and deterministic)
- Use fakes for complex dependencies that are impractical in tests
- Use stubbing narrowly for returning specific values needed by the code under test
- Use interaction testing only when the interaction is the explicit contract being verified
Seams and Mocking Frameworks
Seams
A seam exists wherever a dependency can be substituted. In object-oriented code, seams are most naturally created through dependency injection — passing dependencies as constructor arguments or method parameters rather than instantiating them inside the class. Code without seams (using new to create dependencies internally, or calling global singletons directly) is difficult to test without modifying the production code.
Code without seams:
class PaymentProcessor {
process(order) {
new PaymentGatewayClient().charge(order.amount); // hardcoded — no seam
}
}
Code with seams:
class PaymentProcessor {
constructor(paymentGateway) {
this.gateway = paymentGateway; // injected — seam exists
}
process(order) {
this.gateway.charge(order.amount);
}
}
The second version allows tests to substitute a fake, stub, or mock for paymentGateway without modifying production code.
Mocking Frameworks: Example
A mocking framework typically allows:
// Stub: return a specific value when called
when(mockCreditCardService.charge(100)).thenReturn(SUCCESS);
// Interaction test: verify a method was called
verify(mockCreditCardService).charge(100);The ease of this API is part of the danger — it makes it trivially easy to over-use stubs and interaction tests without thinking about what is actually being verified.
The Three Techniques
1. Faking
A fake is a lightweight implementation of a production dependency that behaves similarly to the real thing but is implemented specifically for testing. Fakes have real logic — they are not just returning hardcoded values. A fake database might store data in an in-memory map; a fake email service might store sent emails in a list that tests can inspect.
Example: A fake file system that stores files in memory and supports the same read/write operations as the real file system, but operates entirely in RAM without touching disk.
Properties of a good fake:
- Has real behavior (not just hardcoded returns)
- Is simpler than the production implementation
- Has high fidelity to the real system for the behaviors that matter in tests
- Is maintained alongside the production code it replaces
- Is tested independently — the fake itself should have tests
When to use a fake:
- When the real implementation is too slow to use in tests (e.g., a database)
- When the real implementation has nondeterministic behavior (e.g., a clock, a random number generator)
- When the real implementation has side effects that are unacceptable in tests (e.g., sending real emails, charging real credit cards)
- When the real implementation requires complex external infrastructure
2. Stubbing
Stubbing is the practice of configuring a function to return a pre-programmed value when called during a test. Unlike a fake, a stub has no real logic — it simply returns whatever the test tells it to return.
// Stub: whenever getUser(123) is called, return this specific user object
when(mockUserService.getUser(123)).thenReturn(new User("Alice"));When stubbing is appropriate:
- When a test needs the code under test to receive a specific value from a dependency to exercise a particular code path
- When the dependency is simple and the interaction is narrow (one or two return values needed)
Dangers of overusing stubbing:
- Unclear test purpose: Tests with many stubs become hard to read — it is not clear which stubs are essential versus incidental
- Brittle to refactoring: If the internal implementation changes how it calls the dependency (e.g., calling
getUserin a different order or with a different argument), the stub breaks the test even though the behavior is unchanged - False confidence: Stubs do not verify that the real dependency would return that value; tests pass even if the real system would behave differently
- Implementation coupling: Stubbing specific method calls couples the test to implementation details, punishing refactoring
The key anti-pattern is over-stubbing: creating stubs for every dependency interaction, which results in tests that verify only that the code called the methods it was supposed to call, not that it produces correct behavior.
3. Interaction Testing (Mocking)
Interaction testing (also called mocking in common parlance) verifies that a function was called with specific arguments, or a specific number of times. It is used when the interaction itself — rather than any observable output — is the behavior being tested.
// Interaction test: verify the email service was called once with this address
verify(mockEmailService, times(1)).sendEmail("alice@example.com", "Welcome!");When interaction testing is appropriate:
- When verifying a side-effectful operation that produces no testable return value (e.g., “did we log this audit event?”, “did we send this notification?”)
- When the interaction is the explicit contract — for example, when testing that a rate limiter is consulted before every external API call
- When there is no other way to observe whether the behavior occurred
Why interaction testing should be the exception, not the rule:
- It tests the implementation of how a result is achieved, not the result itself
- If the implementation changes (e.g., replacing two API calls with one more efficient call that produces the same outcome), the interaction test breaks even though behavior is unchanged
- It leads to tests that are tightly coupled to the current implementation, making refactoring expensive
Prefer Real Implementations
The first question before reaching for a test double should be: can we use the real implementation?
If the real implementation is:
- Fast (runs in milliseconds)
- Deterministic (same inputs always produce same outputs)
- Has no unacceptable side effects in a test environment
…then it should be used. Test doubles introduce a gap between what is tested and what runs in production. Every test double is a hypothesis — “this behaves like the real thing” — and that hypothesis can be wrong.
Guidelines for preferring real implementations:
| Situation | Recommendation |
|---|---|
| Dependency is fast and deterministic | Use real implementation |
| Dependency hits a real database or network | Use a fake or in-memory substitute |
| Dependency sends real emails, charges real accounts | Use a fake |
| Dependency involves the system clock | Inject a fake clock |
| Dependency involves randomness | Inject a seeded random number generator |
| Dependency involves a complex external service | Hermetic test environment or fake |
Hermetic testing: Running tests against real implementations of services, but using isolated test instances that are pre-populated with known data and do not share state across tests. This is preferred over mocking at Google when feasible.
Faking in Depth
Why Fakes Are Important
Fakes occupy a privileged position in Google’s testing philosophy because they offer the best of both worlds: they avoid the costs of the real implementation (speed, nondeterminism, side effects) while providing far higher fidelity than stubs. A well-designed fake behaves like the real system for all behaviors that tests care about.
When to Write a Fake
The owners of a real API should provide a fake alongside it. If you own the UserDatabase class, you should also provide FakeUserDatabase. This co-location ensures:
- The fake stays in sync with the real API as the API evolves
- The fake is maintained by people who understand the real system’s behavior
- Users of the API don’t have to write their own fakes (which may have different assumptions)
Anti-pattern — Everyone Writes Their Own Fake: If there is no canonical fake for an API, each team will write its own. These DIY fakes will diverge from the real API over time, from each other, and from any shared expectations about behavior. Tests pass, but confidence in results is low.
Fidelity of Fakes
A fake does not need to perfectly reproduce every behavior of the real implementation. It needs to accurately reproduce the behaviors that are exercised by tests. The key question: does this fake behave consistently with the real implementation for the scenarios the tests care about?
Fakes that are too permissive (accepting invalid inputs, never failing) give false confidence. Fakes that are too strict (failing on valid inputs that the real system accepts) cause unnecessary test failures. The right level is: match the real system’s semantics for the inputs and operations that appear in tests.
Fakes Should Be Tested
A fake that is wrong is worse than no fake — it provides false confidence. Therefore, fakes should have their own tests, called contract tests or conformance tests:
For each behavior B of the real implementation:
- Test that the real implementation satisfies B
- Run the same test suite against the fake
- If both pass, the fake is conformant for B
This approach catches cases where the fake diverges from the real implementation as the real implementation evolves.
Stubbing in Depth
When Stubbing Is Appropriate
Stubbing is appropriate when:
- A test needs the code under test to receive a specific value from a dependency to exercise a particular branch
- The interaction is simple: one value is needed, the stub is obvious, and the test is clear without explanation
- There is no existing fake for the dependency
The Core Danger: Testing the Mock, Not the Behavior
The central failure mode of over-stubbing is that tests verify the code called the right methods with the right arguments — but do not verify that those method calls produce correct behavior. This is sometimes called testing the mock rather than the behavior.
Anti-pattern:
- Stub: when getUser(123) is called, return fakeUser
- Stub: when getPermissions(fakeUser) is called, return [READ, WRITE]
- Assert: updateDocument() returned SUCCESS
What the test actually verifies:
- The code called getUser and getPermissions in this sequence
- (It does NOT verify that the real user has WRITE permission on the real document)
If the code under test is refactored to check permissions differently (e.g., a single combined getUserWithPermissions call), the test breaks — even though the behavior is unchanged.
Overuse of Stubbing Leads to Unclear Tests
Tests with many stubs become difficult to understand:
- Which stubs represent realistic scenarios vs. arbitrary choices?
- If a stub is removed, does the test still make sense?
- What is being tested — the code’s behavior or its call sequence?
Guideline: If a test requires more than two or three stubs to set up, consider whether a fake would better serve the purpose.
Interaction Testing in Depth
Prefer State Testing Over Interaction Testing
State testing verifies the result of an operation — what the system produced, what state it is in, what it returned. Interaction testing verifies how the result was achieved. State testing is almost always preferable because:
- It is robust to implementation changes (how the result is achieved can change without breaking the test)
- It tests what the user of the code cares about (the outcome, not the mechanism)
- It is more readable — the assertion is clearly tied to the expected outcome
Example:
State test:
sendInvitation(user)
assert that user.invitationStatus == INVITED ← tests what happened
Interaction test:
sendInvitation(user)
verify(emailService).sendEmail(user.email, "You're invited!") ← tests how it happened
The state test remains valid if the implementation changes from email to in-app notification (assuming the status is updated either way). The interaction test breaks if the implementation changes, even if the user still receives an invitation.
When Interaction Testing Is Appropriate
-
Side-effectful operations with no testable output: If the only observable outcome is a side effect on an external system (e.g., an audit log entry was written), interaction testing may be the only option.
-
The interaction is the contract: If the requirement is specifically that a particular external system is called (e.g., “every payment must be logged to the audit system before processing”), verifying the call is appropriate.
-
Performance-sensitive behavior: If the requirement is that a costly operation is called only once (e.g., a database query must not be made more than once per request due to caching), interaction testing can verify the call count.
Best Practices for Interaction Testing
Avoid over-specification: Do not verify every argument to every method call unless those arguments are the point of the test. Over-specified interaction tests break when irrelevant details change.
// Over-specified (fragile):
verify(emailService).sendEmail(eq("alice@example.com"), eq("Welcome to the platform!"),
eq(EmailPriority.NORMAL), eq(Collections.emptyList()));
// Better (verifies what matters):
verify(emailService).sendEmail(eq("alice@example.com"), contains("Welcome"));Avoid verifying private methods: Interaction tests should only verify interactions with dependencies (external collaborators), not verify that internal methods of the class under test were called. Internal methods are implementation details.
Prefer verify over never: Asserting that something was NOT called (verify(x, never()).method()) is fragile and often tests a negative that is not meaningful. If the behavior is “do not call the payment system for free orders,” state testing (verify the order result) is usually better than interaction testing (verify the payment call was never made).
Anti-patterns Summary
| Anti-pattern | Description | Better Alternative |
|---|---|---|
| Excessive mocking | Mocking every dependency regardless of feasibility of real implementation | Prefer real implementations for fast/deterministic dependencies |
| Testing the mock | Stubbing all dependencies so tests only verify call sequences, not behavior | Use fakes or real implementations; test observable state |
| Brittle interaction tests | Verifying exact arguments for calls that aren’t the point of the test | Verify only the interactions that are the explicit contract |
| No canonical fake | Every team writes its own fake, leading to inconsistency | API owners provide and maintain canonical fakes |
| Untested fakes | Fakes that diverge from real implementations without detection | Write contract tests that run against both real and fake |
| Stubbing internal calls | Using stubs to verify internal method calls within the class under test | Only stub/verify calls to external dependencies |
TL;DRs
- A real implementation should be preferred over a test double when it is fast, deterministic, and has no unacceptable side effects.
- Test doubles must be used when the real implementation is not suitable for use in tests — for example, when it’s too slow, nondeterministic, or has undesirable side effects in test environments.
- The three techniques for using test doubles are faking, stubbing, and interaction testing.
- A fake is a lightweight implementation of the real API that behaves similarly to the real thing but is designed for testing. It is the most valuable type of test double.
- Stubbing is the practice of having a function return a hardcoded value. It is appropriate in limited cases but overuse leads to brittle, unclear tests.
- Interaction testing (mocking) verifies that a function was called with specific arguments. Prefer state testing; use interaction testing only when the interaction itself is the contract.
- A test double is only useful if it is a realistic substitute for the production dependency — an unrealistic double provides false confidence.
- Fakes should be tested with contract tests that verify the fake conforms to the real implementation’s behavior.
- Over-specification of interactions leads to brittle tests that break when implementations change without breaking behavior.
- Excessive mocking can mask real bugs by preventing integration of the code under test with its real dependencies.
Key Takeaways
- Test doubles are a tool, not a default — before using a double, ask whether the real implementation is fast, deterministic, and side-effect-free enough to use directly; unnecessary doubles reduce test confidence.
- Fakes are the most valuable type of test double — they have real behavioral logic, offer high fidelity to the production system, and remain valid through implementation changes; API owners should provide and maintain canonical fakes.
- Fakes must be tested — a fake that diverges from the real implementation provides false confidence; contract tests that run against both the real and fake implementation are the defense.
- Seams are required for testability — code that instantiates dependencies directly cannot accept doubles; dependency injection is the primary mechanism for creating seams.
- Stubbing is appropriate only in narrow circumstances — when a test needs a specific return value to exercise a code path; over-stubbing leads to tests that verify call sequences rather than behavior.
- Over-stubbing produces tests that test the mock, not the code — when every dependency is stubbed, passing tests prove only that the code calls the right methods, not that it produces correct outcomes.
- Prefer state testing over interaction testing — state tests are robust to implementation changes; interaction tests are coupled to implementation details and break when refactoring occurs.
- Interaction testing is appropriate only when the interaction is the contract — side-effectful operations with no testable output, or explicit requirements about which external systems are called, are the primary valid use cases.
- Avoid over-specification in interaction tests — verify only the arguments that matter for the test; verifying irrelevant details creates fragility without adding confidence.
- A test double that behaves differently from the real implementation is worse than no test double — it provides false confidence that real integration would not provide.
Related Resources
- ch11-testing-overview — Establishes the testing philosophy (small/medium/large tests) within which test doubles are used
- ch12-unit-testing — Covers the principles of good unit tests; test doubles are a key tool in unit testing
- ch14-larger-testing — Larger tests that use real implementations and hermetic environments instead of test doubles
- ch25-compute-as-a-service — Infrastructure context for understanding why hermetic testing matters at Google’s scale
Last Updated: 2026-06-02