Chapter 1: What Is Software Engineering?
seg software-engineering sustainability time-and-change scale trade-offs
Status: Notes complete
Overview
Chapter 1 establishes the foundational distinction between programming — the act of producing code — and software engineering — programming integrated over time. This distinction is not pedantic; it is load-bearing. Every subsequent chapter in the book rests on the premise that the challenges Google faces arise from the collision of scale (millions of engineers, billions of users, millions of lines of code) and time (systems that must evolve over decades). A program that runs correctly today but cannot be safely changed next year is not good engineering; it is deferred liability.
The chapter introduces three organizing axes — time and change, scale and efficiency, and trade-offs and costs — as the lens through which engineering decisions should be evaluated. It then argues that sustainability is the ultimate goal: the ability to react to necessary change over a codebase’s expected lifetime without paying increasingly prohibitive costs. The chapter closes with the book’s own TL;DRs, which ground every subsequent chapter in these three axes.
Core Concepts
Software engineering: Programming integrated over time. The full discipline includes not just writing code but all the policies, practices, and tools that allow an organization to maintain and evolve code over years or decades. Distinguished from programming, which is the immediate act of producing code without regard for its future.
Sustainability: The ability to react to necessary changes — to dependencies, compilers, infrastructure, security requirements — over a codebase’s expected life span. A codebase is sustainable if its engineers can make changes when they must, without the changes becoming prohibitively expensive. Unsustainability is often invisible until it becomes catastrophic.
Hyrum’s Law: “With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended upon by somebody.” Named after Hyrum Wright, one of the book’s co-authors. It generalizes the observation that consumers will depend on undocumented, accidental, or private behaviors if they are observable — making every observable behavior effectively public.
Shifting Left: The practice of moving problem-detection earlier in the development process — from production to staging, from staging to CI, from CI to local development. Every leftward shift reduces the cost of finding and fixing a problem, because defects compound: a bug found in production is more expensive than one found in code review by orders of magnitude.
Scalable policy: A policy or practice that remains feasible (or becomes more efficient) as the number of engineers, code changes, or systems grows. Contrasted with non-scalable policies that require linear or superlinear human effort per change, making them impossible to sustain at Google’s scale.
Software Engineering vs. Programming
The authors draw a sharp distinction that underpins the entire book:
| Dimension | Programming | Software Engineering |
|---|---|---|
| Time horizon | Now — make it work | Years/decades — keep it working |
| Team size | Usually individual | Large, evolving teams |
| Change pressure | Rare, controlled | Constant — dependencies shift, requirements change |
| Cost model | Cost of initial creation | Cost of creation + cost of ownership over lifetime |
| Risk profile | Does it compile? Does it pass tests? | What can break it over time? |
The distinction matters because programming skills (algorithms, data structures, design patterns) are necessary but not sufficient for software engineering. The additional challenges — upgrading dependencies, maintaining backward compatibility, managing technical debt across a codebase of billions of lines — require different skills, policies, and organizational structures.
The authors use a bridge-building analogy: a single-span footbridge and the Golden Gate Bridge both cross water, but the engineering disciplines required are not comparable. A footbridge can be built by a team over a weekend; the Golden Gate required decades of operation planning, materials science, and failure analysis. Similarly, a weekend script and a production system that must run for 20 years require fundamentally different engineering practices.
Time and Change
Code Life Spans
The authors introduce a thought-provoking framing: software has an expected life span that should influence every engineering decision. They identify a rough spectrum:
- Short-lived code (hours to days): exploratory scripts, one-off data migrations. Sustainability is not a concern because the code will be deleted before it can accumulate dependencies.
- Medium-lived code (months to years): most product features. Some sustainability investment is warranted.
- Long-lived code (decades): core infrastructure, foundational libraries, programming language runtimes. Sustainability is the dominant concern. Every decision made today will need to survive an environment the current engineers cannot predict.
The key insight: the cost of not designing for change accumulates over time. Code that was written without regard for changeability does not stay the same cost; it becomes progressively more expensive to modify as dependencies on its undocumented behaviors accumulate (see Hyrum’s Law) and as the surrounding ecosystem evolves away from its assumptions.
Hyrum’s Law in Depth
Hyrum’s Law is one of the book’s most important concepts and one of its most counterintuitive:
“With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended upon by somebody.”
Practical implications:
- Changing documented behavior is hard because users relied on the documentation. Changing undocumented behavior is also hard because users relied on the observation, not the documentation.
- Removing a feature or changing an output format — even one explicitly labeled “implementation detail” — will break someone’s code if enough people use the API.
- This is not a theoretical concern: Google engineers have observed that even the order of output in operations with no specified ordering (like hash iteration) becomes relied upon by downstream consumers.
The hash ordering example: Go’s map iteration order is explicitly randomized to prevent engineers from depending on it. Python randomized its dict hash seed. These are not accidents — they are deliberate defenses against Hyrum’s Law. If you allow observable ordering, users will depend on it, and you will never be able to change it.
Anti-pattern — Hyrum’s Law Trap: Providing an implementation detail as observable behavior, having users depend on it, then finding yourself unable to improve the implementation without breaking users. This is why Google’s approach to APIs is to minimize observable surface area and to actively test for violations of undocumented-but-observable behaviors.
Why Change Is Inevitable
The authors list categories of change that every long-lived codebase must absorb:
- Dependency upgrades: Languages, compilers, OS libraries, third-party packages all evolve. Security vulnerabilities require patches. Deprecated APIs must be migrated.
- Infrastructure changes: Hardware architectures change; cloud provider APIs evolve; networking stacks are replaced.
- Scale changes: A system designed for 1,000 users behaves differently under 1,000,000 users. The code may need to change even if the requirements do not.
- Security and compliance: Regulations evolve; cryptographic algorithms are deprecated; security threats emerge.
- Team and knowledge changes: The engineers who originally wrote the code leave; new engineers must be able to understand and modify it.
The authors’ conclusion: if you are not investing in sustainability, you are accumulating a debt whose interest compounds over time. The system will eventually demand payment — in the form of a crisis, a security incident, or an inability to ship new features — and the cost will be far higher than if sustainability had been addressed continuously.
Scale and Efficiency
The Organization as Input to Engineering
Google’s scale changes not just technical constraints but organizational ones. With thousands of engineers:
- A policy that requires a senior engineer to approve every change is a bottleneck that scales as O(changes), not O(engineers).
- A migration that touches a million lines of code cannot be done manually; it requires automated tooling.
- A manual security review process that works for 10 engineers fails at 10,000.
Scale insight: Policies and practices that require human time per change do not scale. At Google’s scale, every repeated manual process is a candidate for automation — not as an optimization but as a prerequisite for feasibility.
Scalable vs. Non-Scalable Policies
| Policy Type | Example | Scales? | Why |
|---|---|---|---|
| Scalable | Automated linting on every commit | Yes | Cost is O(1) per change in human time |
| Scalable | Automated migration tooling (sed/AST transforms) | Yes | One tool author, millions of changes |
| Non-scalable | Manual code style review | No | Requires reviewer time per change |
| Non-scalable | ”Email the security team for approval” | No | Reviewer bottleneck grows with changes |
| Non-scalable | ”Don’t upgrade dependencies until necessary” | No | Debt accumulates; later upgrades are exponentially harder |
The Compiler Upgrade Example
The book uses Google’s approach to compiler upgrades as a case study in scalable policy. Upgrading a compiler (e.g., moving from GCC 7 to GCC 9) across a codebase of billions of lines can introduce new warnings-as-errors, new undefined behavior detections, and new ABI incompatibilities. Done infrequently, each upgrade becomes a massive, risky, expensive project. Done continuously (or on a regular cadence), each upgrade is small, manageable, and the tooling to support it is already in place.
Anti-pattern — Deferred Dependency Upgrade: Postponing dependency upgrades to avoid short-term disruption. Each deferral increases the distance between current and target versions, increasing the probability of breaking changes and increasing the cognitive load required to perform the upgrade. This is a classic case where the sustainable policy (upgrade continuously) feels more expensive in the short term but is dramatically cheaper over the long term.
Shifting Left
Shifting Left is the principle that defects should be found and fixed as early as possible in the development pipeline, because the cost of a defect grows with the distance between where it was introduced and where it is detected:
Cost of fixing a defect:
Code review : ~1x (minutes)
CI (test failure) : ~5x (hours, including investigation)
Staging : ~20x (days, including reproduction)
Production : ~100x (days to weeks, including incident response)
Shifting Left means investing in fast, comprehensive testing, strong static analysis, and code review practices that catch defects at the leftmost possible stage. At Google’s scale, this investment pays for itself many times over because the volume of potential defects is enormous.
Trade-offs and Costs
The Decision Framework
The authors argue that software engineering is fundamentally about making decisions under uncertainty, and that the inputs to these decisions should be made explicit. The framework they propose:
- Identify all relevant inputs: what does this decision cost? what does it save? over what time horizon?
- Quantify where possible: convert abstract concepts (“this is risky”) to concrete estimates (“this has a 20% chance of causing a 2-hour outage, which costs ~$50k in engineering time”).
- Identify hidden costs: the cost of not deciding, the cost of the wrong decision, the maintenance cost of the chosen option.
- Consider reversibility: how hard is this decision to undo? irreversible decisions warrant more deliberation.
- Decide and document: make the decision explicitly, with its rationale, so it can be revisited.
Inputs to Decision-Making
The authors enumerate the types of costs that must be weighed:
| Cost Type | Examples |
|---|---|
| Financial costs | Infrastructure, licensing, tooling |
| Resource costs | Engineer-hours required to implement and maintain |
| Personnel costs | Onboarding complexity, cognitive load, specialist skill requirements |
| Transaction costs | Cost of coordination, approvals, migrations |
| Opportunity costs | What cannot be built because this was built instead |
| Societal costs | User impact, environmental impact (at scale) |
The authors emphasize that costs must be measured in context. A decision that costs 1 engineer-hour at a 10-person startup may cost 1,000 engineer-hours at Google because it needs to be applied across thousands of engineers or millions of lines of code.
The Distributed Builds Example
Google uses a distributed build system (Blaze/Bazel) that requires significant infrastructure investment — maintaining a build farm, ensuring network reliability, handling cache invalidation across distributed systems. A small company might reasonably use a local build system. For Google, the distributed build is not optional: build times on local hardware for a billion-line codebase would be measured in days, making continuous integration and rapid iteration impossible. The decision to invest in distributed builds is justified by the scale multiplier on the cost of the alternative.
This illustrates a general principle: decisions must be evaluated at the scale at which they will be applied. The cost/benefit calculation that makes a solution viable at 10 engineers may be completely inverted at 10,000.
Revisiting Decisions and Making Mistakes
A key insight in the chapter: the goal is not to make perfect decisions; it is to make good decisions that can be revisited. The authors explicitly advocate for:
- Accepting that mistakes will be made — at Google’s scale, with decisions affecting millions of users, perfection is not achievable. The question is not “how do we avoid mistakes?” but “how do we detect and correct them quickly?”
- Making decisions revisable — decisions that are hard to reverse warrant more deliberation. Decisions that are easy to reverse can be made more quickly and corrected if wrong.
- Not treating sunk cost as justification — the fact that a system was built a certain way is not a reason to keep it that way if the context has changed and a better option now exists.
Anti-pattern — Sunk Cost Architecture: Continuing to invest in a failing architectural decision because of the investment already made, rather than evaluating the current and future costs of alternatives. The correct question is always: “given where we are now, what is the best path forward?” — not “how do we justify the path we already took?”
TL;DRs
(Faithful reproduction from the book’s end-of-chapter TL;DR section)
- Software engineering is programming integrated over time.
- We suggest recognizing three distinct differences between programming and software engineering: time, scale, and the trade-offs at play.
- As an organization grows, it must start to focus on scaling the efficiency of its engineering efforts, rather than just the systems themselves.
- There are many factors that affect the quality of your software over time. Every decision you make today creates a debt that must eventually be paid.
- It is important to understand the difference between “it works” and “it is maintainable.”
- When considering how to make a decision or design a system, consider the trade-offs and the long-term implications — not just the immediate requirements.
- Software is sustainable when, for the expected life of the code, we are capable of responding to changes in dependencies, technology, or product requirements.
- It is also important to know what you do not know — in particular, what decisions you need to make and when.
Key Takeaways
- Software engineering is programming integrated over time — the discipline extends beyond making code work today to ensuring it can be safely changed over its expected lifetime, which may span decades.
- Hyrum’s Law is unavoidable at scale — any observable behavior of an API will be depended upon by someone; engineers must design APIs with minimal observable surface area and actively test against undocumented-but-observable behaviors.
- Sustainability is the goal, not a nice-to-have — a codebase is sustainable if its engineers can make necessary changes without paying prohibitively increasing costs; unsustainability compounds silently until it becomes catastrophic.
- Non-scalable policies are not just inefficient — they are infeasible at Google’s scale; every policy requiring linear human time per change must be automated or eliminated.
- Shifting Left dramatically reduces defect cost — a bug found in code review costs ~1x; the same bug found in production costs ~100x; the investment in early detection pays for itself many times over at scale.
- Deferred dependency upgrades are a false economy — deferral increases upgrade distance, increases breaking change probability, and ultimately costs far more than continuous incremental upgrades.
- All decisions involve trade-offs with real costs — financial, resource, personnel, opportunity, and societal costs must all be enumerated and weighed; hidden costs are the most dangerous because they accumulate unobserved.
- Decisions should be made revisable — irreversibility warrants more deliberation; reversible decisions can be made quickly and corrected; the goal is to avoid irreversible mistakes, not all mistakes.
- The distributed builds case demonstrates scale inversion — a solution that is overkill at small scale (distributed build farms) becomes the only feasible option at large scale; trade-off calculations are always scale-dependent.
- “It works” and “it is maintainable” are different claims — a system that works today but cannot be safely changed tomorrow is a liability, not an asset; engineering is the discipline of building systems that satisfy both claims over time.
Related Resources
- ch02-how-to-work-well-on-teams — Applies the sustainability and scale principles to team dynamics, establishing the human foundations on which engineering practices rest
- ch03-knowledge-sharing — Extends the sustainability argument to institutional knowledge: code is not the only thing that must remain healthy over time
- ch15-deprecation — The end-of-life view of sustainability: how to retire systems and APIs that can no longer be maintained
- ch16-version-control — How version control policy (monorepo vs. multi-repo, branching strategy) affects sustainability and the ability to make cross-cutting changes
Last Updated: 2026-06-02