Chapter 25: Compute as a Service
seg infrastructure compute containers caas borg kubernetes serverless
Status: Notes complete
Overview
Chapter 25 addresses the evolution of how software organizations manage the infrastructure their software runs on. The central argument is that as systems grow in scale and complexity, the naive approach — manually provisioning and configuring individual machines — collapses under its own weight. The solution Google arrived at, and that the industry has broadly converged on, is Compute as a Service (CaaS): a managed layer that abstracts away individual machines and presents engineers with a uniform environment in which to deploy and run workloads.
The chapter is structured as a journey from the problems of unmanaged compute through the mechanisms of containerization and managed scheduling, to the trade-offs engineers must make when choosing how much abstraction they want. It draws heavily on Google’s experience building and operating Borg — the internal cluster management system that predated and directly inspired Kubernetes — and uses that experience to illustrate both what CaaS enables and what it requires of the software that runs on it.
The chapter is ultimately about the intersection of organizational scale and technical architecture. CaaS is not just an operational concern; it changes what software should look like. Software written for a managed compute environment must be architected differently from software written for a dedicated server — it must expect to fail, must be stateless by preference, and must externalize state and connectivity. These constraints, applied consistently, produce systems that are more resilient, more scalable, and easier to operate at scale than those written against fixed infrastructure assumptions.
Core Concepts
Compute as a Service (CaaS): A managed infrastructure model in which a scheduling and orchestration system allocates compute resources (CPU, memory) to workloads on demand, abstracting away which physical or virtual machine a workload runs on. Engineers declare what they want to run; the CaaS layer decides where and how.
Container: A lightweight, portable unit of execution that bundles an application and its dependencies (libraries, runtime, configuration) into a single artifact that runs consistently across any host environment. Containers provide process-level isolation and are the fundamental unit of scheduling in modern CaaS systems.
Borg: Google’s internal cluster management and scheduling system, developed in the early 2000s. Borg runs workloads across Google’s fleet by allocating containers to machines based on resource availability. It is the direct predecessor of Kubernetes and the empirical foundation for most of the chapter’s claims.
Kubernetes: The open-source cluster orchestration system that emerged from Google’s experience with Borg, released in 2014. Kubernetes is the dominant CaaS substrate in the industry and implements many of the same concepts as Borg.
Serving job: A long-running process that handles requests — typically an HTTP or RPC server. Serving jobs are expected to run indefinitely, respond to incoming traffic, and minimize latency. Availability and reliability are primary concerns.
Batch job: A workload that runs to completion — it processes some input, produces some output, and terminates. Throughput and resource efficiency are primary concerns. Batch jobs are more tolerant of preemption and rescheduling than serving jobs.
Submitted configuration (job spec): A declarative description of what a workload needs — how much CPU and memory, how many replicas, which container image, what health-check endpoints. The CaaS system takes responsibility for making reality match the specification.
Multitenancy: Running workloads from multiple users or teams on shared physical hardware, managed by a scheduler that allocates resources fairly and prevents workloads from interfering with each other.
Taming the Compute Environment
The Problem: Manual Machine Management Does Not Scale
The naive approach to running software is to provision a machine (physical or virtual), configure it, install dependencies, and deploy the application directly to it. This works at small scale but degrades rapidly as the number of machines grows:
- Configuration drift: Machines diverge from each other over time as patches, manual changes, and one-off fixes accumulate. “It works on this machine but not that one” becomes a routine problem.
- Deployment toil: Deploying a new version across hundreds of machines requires manual coordination or fragile scripts. Rollbacks are difficult. Partial failures leave the fleet in inconsistent states.
- Poor resource utilization: Machines sized for peak load are idle most of the time. Without a scheduler to pack workloads, most of the purchased hardware is wasted.
- Operational overhead grows linearly with fleet size: Every additional machine is another thing to monitor, patch, and eventually replace. The ops team becomes a bottleneck.
The Solution: Automation and Abstraction
Google’s answer to these problems was to stop thinking about individual machines and start thinking about a fleet managed by an automated scheduler:
- Containerization: Package each application and its dependencies as a container. The container runs identically on any machine with a compatible runtime — eliminating configuration drift at the application level.
- Declarative scheduling: Instead of specifying “deploy version X to machine Y,” engineers specify “I need 100 replicas of this container with 2 CPUs and 4 GB RAM each.” The scheduler decides where they run.
- Multitenancy: Run many workloads on each machine, packing them to maximize utilization. The scheduler enforces resource limits to prevent workloads from starving each other.
- Automated recovery: When a machine fails or a container crashes, the scheduler automatically reschedules the workload on a healthy machine. Engineers do not need to intervene for routine failures.
Containerization and Multitenancy
Containers serve two distinct purposes in a CaaS environment:
Portability: A container built on a developer’s laptop runs identically in production. The runtime contract is the container image — everything inside the container is hermetic with respect to the host OS. This eliminates the entire class of “works on my machine” problems.
Isolation: Containers provide resource isolation using Linux kernel primitives (cgroups for resource limits, namespaces for process and network isolation). This allows many containers to share a physical machine without interfering with each other’s CPU, memory, or filesystem.
Together, these properties make multitenancy practical. A single physical machine can run dozens of containers belonging to different services, different teams, and even different priority levels — all managed safely by the scheduler.
Without CaaS:
Team A --> Machine A (partly idle)
Team B --> Machine B (partly idle)
Team C --> Machine C (partly idle)
Total utilization: ~20-30%
With CaaS + Multitenancy:
Team A, Team B, Team C --> Shared Fleet
Scheduler packs all workloads
Total utilization: ~60-80%
Writing Software for Managed Compute
Moving to a managed compute environment is not just an operational change — it changes what software should look like. The scheduler can kill any instance of your service at any time (to reclaim resources, to reschedule after a machine failure, or to rebalance the fleet). Software must be designed to survive this.
Architecting for Failure
The foundational design principle for managed compute is: assume your process will die at any time and design accordingly.
This means:
- Avoid in-process state that cannot be reconstructed: If your process is killed mid-operation, any state held only in memory is lost. Design for this loss to be tolerable — either the state is reproducible from durable storage, or the operation is idempotent so it can safely be retried.
- Make shutdown clean and fast: The scheduler sends a SIGTERM before killing a process, giving it a brief window (configurable, typically 30-90 seconds) to finish in-flight requests and close connections cleanly. Software that does not handle SIGTERM loses requests. Design graceful shutdown from the start.
- Embrace statelessness at the instance level: Prefer designs where any instance can handle any request. Sticky sessions (where a specific client must always reach a specific server instance) create coupling between the client and a specific process lifetime — coupling that the scheduler cannot honor across rescheduling events.
- Design for horizontal scaling: If your service can only run as a single instance, it becomes a single point of failure. The scheduler cannot help. Design services so that running more replicas improves both capacity and availability.
Anti-pattern — Instance Affinity: Designing a service so that a specific user’s requests must reach a specific server instance (e.g., because their session state is stored in that instance’s memory). This couples correctness to process identity. When the process is rescheduled, that user’s session is lost. The solution is to externalize session state to a distributed cache or database.
Batch vs. Serving: The Fundamental Distinction
Understanding whether a workload is a serving job or a batch job is essential to designing it correctly for managed compute.
| Dimension | Serving | Batch |
|---|---|---|
| Duration | Runs indefinitely | Runs to completion |
| Traffic model | Responds to incoming requests | Processes a defined input set |
| Primary concern | Latency, availability | Throughput, efficiency |
| Preemption tolerance | Low — losing an instance affects live users | High — work can be checkpointed and retried |
| Scaling trigger | Incoming request volume | Amount of work to process |
| Resource pricing | Steady-state reservation | Often lower-priority / spot resources |
Batch job design implications:
- Jobs should be checkpointable: if preempted mid-execution, the job should restart from the last checkpoint rather than from the beginning. Without checkpointing, long-running batch jobs become impractical on preemptible resources.
- Jobs should be idempotent: running a job twice (due to retry after failure) should produce the same result as running it once. Non-idempotent batch jobs that produce side effects (e.g., charging a credit card, sending an email) are dangerous in retry scenarios.
- Fan-out patterns: Large batch workloads can often be decomposed into many small independent tasks that run in parallel, completing faster and recovering from partial failures more cheaply.
Serving job design implications:
- Health checks are mandatory: The scheduler needs to know whether a serving instance is ready to receive traffic and whether it is alive. Serving software must implement health-check endpoints that the scheduler polls. An instance that does not respond to health checks is removed from the load-balancer pool.
- Graceful degradation: When a downstream dependency is unavailable, a serving job should degrade gracefully (returning partial results, cached data, or controlled errors) rather than failing completely. A service that hard-fails on every dependency outage is brittle at scale.
- Replica count management: Serving jobs need enough replicas to handle peak traffic with headroom for instance failures. Too few replicas mean any instance failure reduces capacity noticeably.
Managing State in Managed Compute
State is the hardest problem in managed compute. The scheduler treats instances as fungible — it can kill any instance, start a new one anywhere in the fleet, and the two are interchangeable. This is only true if instances do not hold state that is unique to them.
Guiding principle: Externalize all state that must survive beyond a single process lifetime.
Appropriate external state stores by category:
| State Type | Appropriate Store |
|---|---|
| Session state | Distributed cache (e.g., Redis, Memcached) |
| Durable application data | Distributed database (SQL or NoSQL) |
| Inter-job coordination | Distributed lock service or queue |
| Large binary artifacts (models, datasets) | Object storage (e.g., GCS, S3) |
| Configuration | Centralized config service or environment variables |
The acceptable forms of local state:
- Read-only caches: A local in-memory cache of data from an authoritative external store is safe — if the instance is killed, the cache is cold on restart but the authoritative data is intact.
- In-flight work: The state of a request currently being processed. This is acceptable because the load balancer will retry or the client will retry if the instance fails mid-request (with appropriate timeout and retry logic).
- Ephemeral scratch space: Temporary files or memory structures used purely for the duration of processing a single request.
Connecting to a Service in Managed Compute
In a managed compute environment, you cannot hardcode the IP address of a service dependency. The instances serving that dependency are constantly being scheduled, rescheduled, and replaced. Their IP addresses change.
The solution is service discovery: a mechanism by which a service registers its current address with a central directory, and clients look up the current address at request time.
Common patterns:
- DNS-based discovery: The scheduler registers each service under a stable DNS name. Clients resolve the DNS name at connection time, getting a current set of healthy IP addresses.
- Load balancer address: A stable virtual IP (VIP) fronts a service. Clients connect to the VIP; the load balancer routes to current healthy instances. The VIP does not change even as instances cycle.
- Service mesh: A sidecar proxy on each instance handles service-to-service communication, including discovery, load balancing, retries, and TLS. The application code connects to localhost; the sidecar handles routing.
The implication: application code should never hardcode addresses for dependencies. All dependencies are resolved through a stable, scheduler-aware indirection layer.
One-Off Code
Not all compute falls neatly into serving or batch patterns. One-off code — administrative scripts, data migrations, exploratory analysis, ad-hoc fixes — still needs to run somewhere.
In a managed compute environment, one-off code should:
- Run as an interactive job that can be scheduled into the same cluster, getting the same network access and credentials as production workloads without requiring special machine access
- Use the same container infrastructure as production, so its environment is consistent
- Not require SSH access to production machines, which creates audit and security concerns
Google’s approach is to allow engineers to submit one-off containers to the cluster with interactive terminals, treating even ad-hoc work as scheduled compute rather than as special-case machine access. This maintains a clean boundary between “what runs in production” (scheduled containers) and “what individual engineers can do” (also scheduled containers, but with appropriate scope and permissions).
CaaS Over Time and Scale
Containers as an Abstraction Layer
The key insight behind containers is that they shift the contract between the application and the infrastructure. Without containers:
Application depends on: specific OS version, specific library versions,
specific filesystem layout, specific user accounts, specific ports
With containers:
Application depends on: the container runtime (a thin, stable interface)
Infrastructure depends on: nothing about the application
This abstraction has profound organizational implications:
- Infrastructure can evolve independently: The cluster operator can upgrade the host OS, replace hardware, or change networking topology without coordinating with every application team.
- Applications can evolve independently: An application team can update their dependencies, change their runtime version, or modify their filesystem layout without coordinating with the cluster operator.
- Portability is real: A container built for development can be deployed to staging and production without changes. The environment is the container, not the host.
One Service to Rule Them All
A critical inflection point in the evolution of compute infrastructure is moving from per-team or per-product scheduling infrastructure to a single shared compute platform used by all workloads across the organization.
Benefits of a unified compute platform:
- Higher utilization: Diverse workloads (serving, batch, ML training) have different resource usage patterns. A batch job’s idle time can be filled by a serving job’s burst, and vice versa. A shared fleet captures this diversity premium; isolated fleets waste it.
- Simpler operations: One platform to monitor, patch, and upgrade is simpler than N platform variants with different configurations, tooling, and runbooks.
- Consistent tooling: Engineers moving between teams find familiar infrastructure. Training, debugging tools, and observability systems apply uniformly.
- Economy of scale: Centralized procurement and capacity planning is more efficient than distributed, per-team resource acquisition.
The cost: A unified platform imposes constraints. Teams with unusual requirements (very high-performance networking, specialized hardware, specific OS configurations) may find the shared platform cannot accommodate them. The trade-off between standardization and customization is fundamental to CaaS adoption decisions.
Submitted Configuration: Declarative vs. Imperative
Modern CaaS systems use declarative configuration — engineers describe the desired state, and the system reconciles reality to match.
Imperative approach (pre-CaaS):
Step 1: Start machine A
Step 2: Install dependency X
Step 3: Copy binary Y
Step 4: Start service Z
Step 5: If Z is not running, restart it
Imperative configuration describes how to achieve a state. If any step fails, the system may be left in a partially configured state. Reasoning about what state the system is in requires tracing every command that has been run.
Declarative approach (CaaS):
replicas: 10
container: my-service:v42
cpu: 2
memory: 4Gi
healthCheck: /healthzDeclarative configuration describes what state is desired. The CaaS system is responsible for taking whatever actions are needed to reach that state, and for continuously reconciling if reality drifts from the spec. The spec is the source of truth; the current state of the fleet is an implementation detail.
Benefits of declarative configuration:
- Self-healing: If an instance dies, the scheduler creates a new one to maintain the declared replica count. No human intervention required.
- Auditable: The configuration file is the complete record of what is deployed. Version-controlling it gives a full history of what was deployed when.
- Idempotent: Applying the same configuration twice produces the same result. This makes deployments safe to retry.
- Testable: Configuration can be validated (linted, schema-checked) before being applied to production.
Anti-pattern — Configuration Drift: Manually modifying the live state of a cluster (e.g., using kubectl exec to edit a running container) without updating the declarative configuration. The live state diverges from the spec; the next deployment overwrites the manual change; the manual change is lost and the reason for it is lost. All changes to the desired state should go through the configuration file.
Choosing a Compute Service
Centralization vs. Customization
The core trade-off in choosing a compute service is between:
Centralization (use the shared platform):
- Lower operational burden — the platform team handles infrastructure concerns
- Consistent tooling, monitoring, and debugging experience
- Higher utilization through resource sharing
- Cost: constraints on what workloads can look like
Customization (run your own infrastructure):
- Full control over hardware selection, OS configuration, and networking
- Can optimize for highly specific requirements (e.g., GPU clusters, latency-sensitive networking)
- Cost: full operational burden falls on the team; no economy of scale; knowledge is siloed
Google’s guidance is strongly in favor of centralization for the vast majority of workloads. The cases where customization is justified are narrow: workloads with requirements so unusual that the shared platform genuinely cannot serve them. For most workloads, the operational savings from centralization far outweigh the constraints it imposes.
Level of Abstraction: Serverless
Serverless represents the extreme end of the abstraction spectrum in compute. Rather than declaring “I need N replicas of this container with X CPU and Y RAM,” serverless asks only: “run this function when this event occurs.”
The serverless model:
- The platform manages all resource allocation — the engineer does not specify CPU, memory, or replica count
- Billing is per invocation or per unit of compute consumed, not per reserved capacity
- Scaling from zero to peak and back is the platform’s responsibility
- Cold starts (latency on first invocation after scale-to-zero) are the primary cost
Trade-offs of serverless:
| Benefit | Cost |
|---|---|
| No capacity planning required | Loss of control over resource configuration |
| Scale to zero (zero cost when idle) | Cold start latency |
| Minimal operational overhead | Platform lock-in (serverless APIs are not portable) |
| Very fast to deploy simple functions | Difficult to run complex, long-running, or stateful workloads |
Serverless is well-suited to: event-driven processing, infrequent or unpredictable workloads, simple functions that transform or route data, and teams with limited operational capacity. It is poorly suited to: latency-sensitive serving, workloads with stable high throughput, or anything requiring persistent local state.
The SEG authors treat serverless not as a replacement for container-based CaaS but as a point on the abstraction spectrum. The right choice depends on the workload’s characteristics and the team’s operational priorities.
Public vs. Private Compute
The final axis of decision is where the compute runs:
Public cloud (AWS, GCP, Azure):
- No capital expenditure; pay for what you use
- Global footprint for distributing workloads close to users
- Rich managed service ecosystem (databases, queues, ML platforms)
- Cost: vendor lock-in risk; networking egress costs; less control over hardware
Private compute (on-premises or private cloud):
- Full control over hardware, network topology, and physical security
- May be required for regulatory compliance (data residency, audit requirements)
- Cost: high capital expenditure; requires in-house expertise to operate; capacity planning risk
Hybrid: Many large organizations run both — sensitive or latency-critical workloads on private infrastructure, commodity or elastic workloads on public cloud.
For organizations starting fresh, the book’s guidance leans toward public cloud for most workloads: the operational complexity of running private infrastructure is underestimated, and public cloud CaaS products (GKE, EKS, AKS) bring most of the benefits of CaaS without the investment in building it.
Google’s Internal Compute: Borg
Google’s compute evolution is the empirical basis for Chapter 25. The system is Borg, and its history illustrates why the principles described in the chapter emerged.
What Borg Is
Borg is a cluster manager that runs hundreds of thousands of jobs across multiple clusters, each containing tens of thousands of machines. It was developed in the early 2000s as Google’s first machines-for-the-fleet abstraction and became the operational substrate for virtually all Google services.
Key Borg concepts:
- Cells: A Borg cluster (called a “cell”) is a set of machines managed as a single unit. Jobs are submitted to a cell; Borg decides which machines within the cell to run them on.
- Tasks: The individual containers running within a job. A job with replicas=100 creates 100 tasks.
- Allocs: Resource reservations that can be shared between multiple tasks — allowing related processes (e.g., a main service and a sidecar log collector) to be co-located on the same machine.
- Priority and preemption: Borg supports multiple priority levels. High-priority jobs (production serving) can preempt lower-priority jobs (batch, development) to reclaim resources during high load.
Borg’s Relation to Kubernetes
Kubernetes was created by Google engineers who had spent years building and operating Borg. Many Kubernetes concepts map directly to Borg concepts:
| Borg | Kubernetes |
|---|---|
| Cell | Cluster |
| Task | Pod |
| Job | Deployment / ReplicaSet |
| Alloc | Pod with multiple containers |
| Borglet (agent) | Kubelet |
| Borg Master | kube-apiserver + scheduler |
The primary difference: Kubernetes was designed to be open-source and usable by organizations without Google’s infrastructure, so it is more modular, more configurable, and makes fewer assumptions about the underlying hardware.
Lessons Learned from Borg
The chapter uses Borg’s operational history to ground its recommendations:
-
Containerization was the right bet: The portability and isolation properties of containers were essential for Borg to run diverse workloads on shared hardware. Without containers, multitenancy at Borg’s scale would have required impractical per-machine configuration management.
-
Declarative configuration scaled; imperative scripts did not: Early Borg jobs were described imperatively. As the fleet grew, this became unmaintainable. The shift to declarative job specifications (which became the proto-Kubernetes YAML model) was essential for operational sanity.
-
Architecting for failure is non-negotiable at Google’s scale: At the scale of tens of thousands of machines, hardware failures are not exceptional events — they are the normal operating condition. Every machine in the fleet will fail; the question is when, not whether. Software that is not architected for failure will experience regular outages. The CaaS principles described in this chapter emerged from hard operational experience with this reality.
-
Resource reclamation (overcommitment): Borg learned that many jobs request more resources than they actually use. Borg overcommits resources — scheduling jobs as if the cluster has more capacity than it does, banking on the fact that not all jobs will hit their reserved peaks simultaneously. This dramatically improves utilization but requires the scheduler to be able to preempt low-priority jobs when high-priority jobs need resources back.
TL;DRs
- Scale requires moving away from managing individual machines and toward a compute service that manages a fleet.
- CaaS tames the compute environment through automation, containerization, and multitenancy.
- Containers provide a stable abstraction layer between applications and the machines they run on, enabling portability and isolation.
- The key design challenge for software running in a managed compute environment is architecting for failure: any instance can be killed at any time.
- Serving and batch jobs have fundamentally different design requirements — latency/availability for serving, throughput/checkpointability for batch.
- State must be externalized to survive rescheduling; instance-local state is acceptable only for caches and in-flight work.
- Service discovery replaces hardcoded addresses in managed compute environments where instance addresses are not stable.
- Declarative configuration (describing desired state) is preferable to imperative scripting (describing how to achieve state) because it is self-healing, auditable, and idempotent.
- The choice between centralized and customized compute is a trade-off between operational simplicity and the ability to meet unusual requirements.
- Serverless is the highest abstraction level — useful for event-driven and sporadic workloads, costly for latency-sensitive or stable high-throughput workloads.
- Google’s Borg system, Kubernetes’s direct ancestor, validated these principles at scale across Google’s entire fleet.
Key Takeaways
- CaaS moves the unit of concern from machines to workloads — engineers declare what they need to run; the scheduler decides where it runs, replacing manual machine management with automated fleet management.
- Containers are the key abstraction — they decouple applications from host environments, enabling multitenancy, portability across environments, and independent evolution of infrastructure and application.
- Architecting for failure is a first-class design constraint — in a managed compute environment, process termination is routine; software that externalizes state, handles SIGTERM gracefully, and avoids instance affinity is resilient to rescheduling.
- Serving and batch jobs have opposite optimization targets — serving jobs optimize for latency and availability; batch jobs optimize for throughput and should be checkpointable and idempotent to survive preemption.
- State must be externalized — instance-local state that must survive process death should live in a distributed cache, database, or object store; local state is acceptable only for caches and in-flight work within a single request lifecycle.
- Service discovery is mandatory — hardcoded IP addresses are incompatible with managed compute; stable DNS names, virtual IPs, or service meshes provide the indirection layer that survives instance churn.
- Declarative configuration enables self-healing and auditability — specifying desired state (rather than imperative steps) allows the scheduler to continuously reconcile reality, version-control the deployment spec, and safely retry deployments.
- Unified compute platforms capture a utilization premium — diverse workloads running on a shared fleet allow the scheduler to fill idle capacity; isolated per-team fleets waste this opportunity.
- Serverless maximizes abstraction at the cost of control — it eliminates capacity planning and operational overhead but introduces cold-start latency, platform lock-in, and unsuitability for complex long-running workloads.
- Borg validated these principles at Google scale — Kubernetes carries forward Borg’s core design decisions, making the lessons of the chapter directly applicable to the dominant open-source CaaS system in the industry.
Related Resources
- ch13-test-doubles — References hermetic testing environments, which depend on the CaaS infrastructure described here
- ch21-dependency-management — Dependency management interacts with container image construction and artifact versioning in CaaS environments
- ch24-continuous-delivery — CD pipelines are the primary mechanism by which new container images are deployed to CaaS clusters
- Kubernetes documentation — Direct open-source descendant of Borg; the practical implementation of this chapter’s concepts
- “Large-scale cluster management at Google with Borg” (Verma et al., EuroSys 2015) — The Google research paper that formally described Borg’s design and the empirical basis for this chapter
Last Updated: 2026-06-02