Chapter 5: Encoding and Evolution

ddia-2e encoding serialization schema-evolution durable-execution event-driven

Status: Notes complete

Overview

Applications change continuously — fields are added, types change, features are removed. This chapter asks: how do you encode data in a way that lets the system evolve over time without breaking existing readers and writers? The core insight is that encoding format is not just a serialization detail; it is a contract between systems that must be explicitly managed. The 2nd edition significantly expands coverage beyond the 1st edition by adding Durable Execution (Temporal, AWS Step Functions) and a richer treatment of Event-Driven Architectures — both paradigms where encoding and evolution contracts are especially critical because messages outlive the code that wrote them.

The chapter covers three orthogonal concerns: what format to use for encoding (JSON, Protobuf, Avro), how to evolve schemas safely without downtime, and how encoded data flows through systems (databases, services, workflow engines, event streams).

Key Concepts

In-Memory vs On-Wire Representation

Programs keep data in memory as objects, structs, lists, hashmaps — all relying on pointers. When data must be written to disk or sent over a network, those pointers become meaningless. The data must be encoded (serialized/marshalled) into a self-contained byte sequence, and later decoded (deserialized/parsed/unmarshalled) back into memory.

This encoding step is where compatibility problems arise: if the encoder and decoder were built against different versions of the schema, the decode may fail or produce wrong results. The choice of encoding format determines how gracefully those mismatches are handled.

Language-Specific Formats

Java’s Serializable, Python’s pickle, Ruby’s Marshal — each language has a built-in serialization mechanism. These are convenient but dangerous for production data:

Language lock-in: A Java-serialized blob can only be decoded by Java. Cross-language interop is impossible.
Poor versioning: Adding a field to a Java class can silently break deserialization of older blobs.
Security vulnerability: Deserializing untrusted bytes can execute arbitrary code (well-known Java deserialization exploits).
Fragile to refactoring: Renaming a class or field path can permanently break existing stored data.

Rule: Never use language-specific formats for data that crosses process boundaries or will be stored persistently.

JSON, XML, and Binary Variants

JSON has become the dominant encoding for external APIs and configuration because it is human-readable and universally supported. Its limitations:

Numbers are ambiguous: no distinction between integer and float; integers > 2^53 lose precision in JavaScript.
No native binary type: binary data must be base64-encoded (adds ~33% overhead).
No enforced schema: any key name and value type is valid, making contracts implicit.
Verbose: field names are repeated in every record.

XML is even more verbose and has awkward distinctions between attributes and elements. It is largely confined to legacy enterprise systems (SOAP, RSS feeds).

CSV has no type system at all and is ambiguous around commas and newlines embedded in values. Useful for one-off bulk transfers but not for system-to-system communication.

MessagePack is a binary-compact variant of JSON: same data model, but values are encoded more compactly (~25% smaller than JSON). It retains JSON’s weakness of no enforced schema.

Protocol Buffers

Protocol Buffers (Protobuf) was developed at Google and is used for virtually all of Google’s internal communication. It requires writing a schema in an IDL (.proto file):

message Person {
  required string user_name = 1;
  optional int64  favorite_number = 2;
  repeated string interests = 3;
}

How Protobuf encoding works:

Each field in the wire format is identified by its field tag (the number 1, 2, 3) and a type annotation, NOT by its name.
The field name is purely a developer convenience — it can be changed freely without breaking wire compatibility.
Field tags are the canonical identity of a field. They must never be reused for different fields.

Backward and forward compatibility:

Add a new optional field with a new tag number → backward and forward compatible (old code ignores unknown tags; new code uses default if field absent).
Remove an optional field → backward compatible (old code that expects it gets nothing; new code ignores the old tag’s presence).
Never change a field’s tag number or type — old data encoded with the old tag becomes misinterpreted.
Never add required fields — old data was encoded without them; new code would reject valid old data.

Proto3 vs proto2: Proto3 removed the required/optional distinction and made all singular fields optional by default. This makes accidentally-breaking schema changes less likely, at the cost of losing the ability to distinguish “field absent” from “field present with zero value”.

Avro

Apache Avro was created for the Hadoop ecosystem and takes a fundamentally different approach to compatibility than Protobuf. There are no field tags in the wire format — Avro binary is just values in sequence, with no field identifiers at all. This makes Avro binaries even more compact than Protobuf, but means the schema must be known to decode the data.

The writer’s schema / reader’s schema mechanism:

The writer encodes data according to the writer’s schema (the schema in effect when the data was written).
The reader decodes using the reader’s schema (the schema the reader’s code was compiled against).
The Avro library compares the two schemas and resolves differences at read time, field by field, matched by name (not tag number).

Resolution rules:

Field in writer’s schema but not in reader’s: value is ignored.
Field in reader’s schema but not in writer’s: reader uses the field’s declared default value.
Both have the field: value is decoded and type-promoted if possible (e.g., int → long is safe).

Consequence: Renaming a field in Avro is a breaking change (unlike Protobuf where you can rename freely). Adding a field always requires a default value.

How the reader knows the writer’s schema: Two mechanisms:

Schema registry: For Kafka streams, each message carries a 4-byte schema ID; the reader fetches the full schema from a central schema registry.
Embedded in file: For Hadoop files (Avro container format), the schema is written once at the start of the file, then all records follow.

Avro IDL example:

{
  "type": "record",
  "name": "Person",
  "fields": [
    {"name": "userName",       "type": "string"},
    {"name": "favoriteNumber", "type": ["null", "long"], "default": null},
    {"name": "email",          "type": "string", "default": ""}
  ]
}

Encoding Efficiency Comparison

For a typical record with 3 fields (string, integer, repeated string):

Format	Approximate Size	Notes
JSON	~81 bytes	Field names in every record
MessagePack	~66 bytes	Binary JSON; same model
Thrift BinaryProtocol	~59 bytes	Field tags + type codes
Thrift CompactProtocol	~34 bytes	Varint-encoded tags
Protocol Buffers	~33 bytes	Field tags + varints
Avro	~32 bytes	No tags at all

At the scale of millions of events per second (Kafka, event logs), the 2.5x compaction ratio between JSON and Avro translates to meaningful cost savings in storage and network bandwidth.

The Merits of Schemas

It might seem that schema-less formats (JSON) are simpler. But explicit schemas provide important guarantees:

Documentation: The schema is the authoritative description of the data shape.
Validation: Encoders can reject malformed data before it propagates.
Code generation: Typed client/server stubs can be generated in any language from .proto or .avsc files.
Compatibility enforcement: Schema registries can reject schema changes that would break consumers.
Smaller encoding: Field names don’t need to be repeated in each record.

The industry has largely moved toward schema-first development for internal service-to-service communication, using either Protobuf (for gRPC) or Avro (for Kafka).

Modes of Dataflow

Dataflow Through Databases

When an application writes to a database, it encodes the data; when it reads back, it decodes. In long-running production systems this is never a clean one-to-one version mapping:

An older version of the application may write data; a newer version reads it (backward compatibility needed).
During a rolling upgrade, both old and new versions run simultaneously and both access the same database (forward compatibility also needed).

The unknown-field preservation problem: Suppose App v2 adds a new email field. App v1 reads a record written by v2, processes it, and writes it back — but App v1’s ORM has no email field in its model. If the ORM drops unknown fields on write-back, the email value is permanently lost. The correct behavior is to preserve and re-emit all fields, including those not understood by the current code version. Many ORMs and serialization libraries silently drop unknown fields, requiring careful configuration or explicit handling.

Schema migrations: ALTER TABLE in relational databases is often instantaneous on modern databases (MySQL 8, PostgreSQL 12+) for most changes. But adding columns with non-null constraints without defaults is always dangerous with rolling upgrades.

Dataflow Through Services: REST and RPC

REST (REpresentational State Transfer):

Builds on HTTP semantics: GET (safe, idempotent), POST (non-idempotent create), PUT/PATCH (update), DELETE
Resources identified by URLs; representation in JSON (usually) or XML
Stateless: each request carries all context needed
Native HTTP caching for GET responses
OpenAPI 3.1 for schema documentation and code generation
API versioning via URL path (/v2/users) or Accept header

SOAP:

XML-based protocol with WSDL for service description
Complex, verbose, now largely replaced by REST for new APIs
Still common in financial services and government systems

RPC (Remote Procedure Call):

Goal: make calling a remote service look like a local function call (transparent distribution)
The abstraction is leaky: network calls are different from local calls in fundamental ways:
- Unpredictable latency (milliseconds to seconds)
- Partial failure: request reached server but response was lost — did the operation execute?
- Idempotency: can the caller safely retry?
- Serialization/deserialization cost has no equivalent for local calls
Modern RPC frameworks (gRPC, Thrift, Finagle) acknowledge this leakiness and require explicit timeout/retry/idempotency handling.

gRPC is the dominant modern RPC framework:

Protocol Buffers over HTTP/2
Bidirectional streaming as first-class feature
Code generation in 10+ languages from .proto files
Used by Kubernetes, etcd, Envoy, and most cloud-native infrastructure

REST vs gRPC comparison:

Dimension	REST	gRPC
Protocol	HTTP/1.1 or HTTP/2	HTTP/2 only
Format	JSON (usually)	Protocol Buffers
Schema	Optional (OpenAPI)	Required (`.proto`)
Code generation	Optional	Built-in
Streaming	Limited (SSE, long-poll)	Bidirectional streaming
Caching	Native HTTP caching	Manual
Debugging	Easy (curl, browser)	Needs specialized tools (grpcurl)
Best for	Public APIs, web clients	Internal microservices

Durable Execution and Workflows

This is a major new section in the 2nd edition, addressing a class of problems that RPC and message queues don’t solve well.

The problem: Some operations take minutes, hours, or days — placing an order, onboarding a user, processing a payment batch. These multi-step workflows must survive process failures, server restarts, and deployments. Naively, you might write a long-running process or chain together async jobs — but these approaches require bespoke state management and are fragile.

What durable execution offers: A programming model where the code appears to run as an ordinary function but the execution engine automatically persists state to a database after each step. If the process crashes mid-workflow, execution resumes from the last checkpointed state rather than restarting from the beginning.

Key properties of durable execution:

Fault tolerance: Workflow survives process crashes, restarts, and deployments.
Exactly-once activity execution: Each activity (step) is guaranteed to execute at least once; the framework handles deduplication at the workflow level to achieve effectively-once semantics.
Long duration: A workflow can run for days, months, or even years.
Visibility: Execution history is persisted; you can query workflow state, get audit trails, and debug failures.
Deterministic replay: To resume a workflow after a crash, the engine replays the event history. Workflow code must be deterministic — given the same history, it must produce the same decisions. This means: no random numbers, no system clock reads, no direct I/O (all non-deterministic actions go through the framework).

How Temporal works:

Developer writes workflow code (Go, Java, Python, TypeScript) using the Temporal SDK.
Temporal server persists a complete event history of the workflow to a durable store (usually Cassandra or PostgreSQL).
Worker processes poll for tasks, execute workflow/activity code, and report results back to the server.
On failure, a new worker picks up the workflow and replays the event history to reconstruct in-memory state.
Activities (external calls, database writes, API calls) are retried with configurable retry policies.

Temporal Architecture:

  Client ──→ Temporal Server (event log in DB)
                    │
           ┌────────┴──────────┐
           ↓                   ↓
     Workflow Worker      Activity Worker
     (replay history)     (executes steps)

AWS Step Functions is the managed cloud alternative:

Workflows defined as JSON state machines (Amazon States Language)
States: Task, Choice, Wait, Parallel, Map, Pass, Fail, Succeed
Execution history stored in AWS infrastructure; no server to manage
Two modes: Standard (durable, up to 1 year) and Express (high-throughput, up to 5 minutes)
Activities invoke Lambda functions, ECS tasks, or any AWS service

Azure Durable Functions extends Azure Functions with workflow orchestration:

Durable orchestrator functions that call activity functions
State is transparently persisted to Azure Storage
Similar replay-based execution model to Temporal

When to use durable execution vs alternatives:

Scenario	Best Approach
Multi-step workflow, minutes to hours	Durable execution (Temporal)
Short-lived async job (<30 seconds)	Message queue (SQS, RabbitMQ)
Event fan-out to many consumers	Event streaming (Kafka)
Human-in-the-loop approval	Durable execution (long timers)
Periodic batch job	Cron + task queue
Real-time stream processing	Stream processor (Flink, Kafka Streams)

Durable execution and encoding: Because workflow state is persisted across versions, encoding/evolution issues are acute. If the event history stored from v1 of the workflow must be replayed by v2’s code, the encoded activity inputs and outputs must be backward compatible. Temporal uses JSON by default (with Codec API for custom encoding); Protobuf is recommended for production workflows with strict schema evolution.

Event-Driven Architectures

The book distinguishes three fundamentally different patterns that all go by the name “event-driven”, but have very different semantics:

Pattern 1: Event Notification

The simplest pattern. A service emits a small notification that something happened. The event carries minimal data — just enough to trigger the consumer to act. The consumer typically calls back the originating service to get full details.

Order Service ──→ [order.created {orderId: "123"}] ──→ Notification Service
                                                         ↓
                                           GET /orders/123 (fetches full order)

Characteristics:

Very low coupling: consumers only know the event type and a reference
Consumer must make an additional request to get data (coupling on the API)
Good for: triggering side effects, fan-out notifications
Weakness: if the source is unavailable, consumers can’t complete their work

Pattern 2: Event-Carried State Transfer

The event carries the full state — the complete record of the entity at the time of the event. Consumers are self-sufficient; they don’t need to call back.

Order Service ──→ [order.created {orderId: "123", items: [...], total: 49.99, ...}] ──→ Analytics Service
                                                                                          ↓
                                                                            Stores locally, no callback needed

Characteristics:

Consumer autonomy: no dependency on source API availability
Higher event size and bandwidth
Consumer may maintain a local replica of the source data (read model)
Good for: building read models, analytics, search indexes, reporting
Weakness: data duplication; schema evolution must be managed carefully

Pattern 3: Event Sourcing

All state changes are stored as an immutable, append-only log of events. The current state of an entity is derived by replaying its event history. This is a persistence pattern, not just a communication pattern.

Events (immutable log):
  order.created   {orderId: "123", items: [...]}
  order.item.removed {orderId: "123", itemId: "456"}
  order.payment.received {orderId: "123", amount: 49.99}
  order.shipped   {orderId: "123", trackingId: "T789"}

Current State = fold(events, initial_state)

Characteristics:

Complete audit trail: can reconstruct state at any point in time
Time travel: query past state by replaying events up to a timestamp
Multiple projections: derive different read models (SQL table, search index, graph) from the same event log
Enables CQRS (Command Query Responsibility Segregation): separate write model (events) from read model (projections)
Schema evolution challenge: old events must remain replayable as the domain model evolves — use upcasters (functions that transform old event formats to new)
Weakness: eventual consistency of projections; snapshot management for performance

Event-Driven vs Request-Driven (REST/RPC):

Dimension	Event-Driven	Request-Driven (REST/RPC)
Temporal coupling	Decoupled (fire-and-forget)	Coupled (synchronous wait)
Failure model	Producer doesn’t know if consumer fails	Producer gets error synchronously
Ordering	Depends on broker guarantees	Implicit (sequential calls)
Discoverability	Harder (what events exist?)	Easy (API docs)
Debugging	Harder (async causality)	Easier (request trace)
Scale	Natural fan-out	Requires load balancer
Consistency	Eventual by default	Can be synchronous

Message brokers as infrastructure:

Apache Kafka: distributed log; messages retained on disk; consumers track offsets; excellent for high-throughput event streams and event sourcing.
RabbitMQ / ActiveMQ / AWS SQS: traditional message queues; messages deleted after consumption; competing consumers; better for task queues.
AWS EventBridge: managed event router with content-based routing and schema registry.
Google Cloud Pub/Sub: managed global pub/sub with at-least-once delivery.

CloudEvents (CNCF standard): a vendor-neutral envelope format for events:

{
  "specversion": "1.0",
  "type": "com.example.order.created",
  "source": "/orders",
  "id": "A234-1234-1234",
  "time": "2026-01-01T12:00:00Z",
  "datacontenttype": "application/json",
  "data": { ... }
}

Comparison Tables

JSON vs Protocol Buffers vs Avro vs Thrift

Dimension	JSON	Protocol Buffers	Avro	Thrift
Human-readable	Yes	No	No	No
Schema required	No (optional)	Yes (`.proto`)	Yes (`.avsc`)	Yes (`.thrift`)
Binary encoding	No	Yes	Yes	Yes
Field identity in wire	Field name	Field tag number	None (by position/schema)	Field tag number
Backward compat	Manual discipline	Good (optional fields + tags)	Excellent (schema resolution)	Good (optional fields + tags)
Forward compat	Manual discipline	Good (ignore unknown tags)	Excellent	Good
Schema evolution mechanism	Conventions only	Field tags; optional fields	Writer/reader schema matching by name	Field tags; optional fields
Size (relative)	100% (baseline)	~40%	~38%	~42-70%
Schema registry needed	No	Optional (Buf)	Yes (for Kafka)	No
Code generation	Optional (openapi-gen)	Built-in (protoc)	Optional (avro-tools)	Built-in (thrift)
Streaming support	Manual	gRPC streaming	—	Thrift multiplexed transport
Primary use	REST APIs, config	gRPC, internal APIs	Kafka, Hadoop	Internal RPC (legacy)
Rename safety	N/A	Safe (name not in wire)	Breaks compatibility	Safe (name not in wire)

Backward vs Forward Compatibility

Operation	Backward Compatible?	Forward Compatible?	Notes
Add optional field with default	Yes	Yes	The safe default
Add required field	No	No	Never do this
Remove optional field	Yes	Yes	Old tag/name becomes unused
Remove required field	No	No	Old code expects it
Rename field (Protobuf)	Yes	Yes	Name not in wire format
Rename field (Avro)	No	No	Name is identity in Avro
Change field type (widening)	Sometimes	Sometimes	int→long safe; long→int not
Change field type (narrowing)	No	No	Data corruption
Change field tag (Protobuf)	No	No	Old data misinterpreted

Important Points Summary

Encoding format is a contract: Changing it without a compatibility strategy breaks consumers silently or noisily.
Field tags in Protobuf/Thrift are permanent: The name can change, the tag cannot. Reusing a deleted tag for a new field corrupts old data.
Avro uses field names, not tags: Schema resolution matches fields by name — renaming breaks compatibility. New fields require default values.
Both backward AND forward compatibility are needed simultaneously during rolling upgrades: new code runs alongside old code, and both versions access the same databases and message queues.
Durable execution solves the workflow reliability problem: It moves workflow state management from the application to a persistent execution engine, eliminating the “what if the process crashes halfway through” problem.
The three event-driven patterns are distinct: Event notification (small event, callback needed), event-carried state transfer (full state, consumer self-sufficient), event sourcing (immutable log, state derived by replay) — each has different trade-offs.
Schema registries enforce compatibility at publish time: Reject producers that try to register incompatible schemas before bad data ever reaches consumers.
Language-specific serialization formats are a security risk: Deserializing untrusted Java or Python objects can execute arbitrary code.
Unknown-field preservation: Applications that read and re-write records must preserve unknown fields, not silently drop them.
Durable execution requires deterministic workflow code: No random numbers, system clocks, or direct I/O in workflow functions — all non-deterministic actions must go through the framework’s activity mechanism.

Modern Context (2026)

Protobuf and gRPC dominance:

gRPC has become the standard internal RPC framework for cloud-native systems (Kubernetes, Istio, Envoy use it internally).
Buf CLI and Buf Schema Registry provide linting, breaking-change detection, and schema management for Protobuf teams — addressing the tooling gap that made Protobuf harder than Avro for schema governance.
ConnectRPC is an emerging alternative that makes gRPC-compatible services work over standard HTTP/1.1 and JSON, removing the requirement for HTTP/2 infrastructure.
Protobuf Editions (2023+) replaced the proto2/proto3 distinction with feature flags per field, giving more control over behavior.

Kafka ecosystem:

Confluent Schema Registry now supports Avro, JSON Schema, and Protobuf — enabling teams to choose their preferred encoding while getting schema governance.
AWS Glue Schema Registry provides a managed alternative in the AWS ecosystem.
Schema evolution is now a first-class engineering concern for Kafka-based systems, with CI pipelines that validate compatibility on every schema PR.

Temporal and durable execution growth:

Temporal has become the de facto open-source durable execution engine (2020–2026), with significant adoption at companies like Stripe, Netflix, Coinbase, and HashiCorp.
AWS Step Functions and Azure Durable Functions are the managed alternatives, preferred in teams that want to avoid operational overhead.
The durable execution paradigm is now considered a distinct category alongside queues, streams, and databases in the distributed systems toolbox.

Event-driven architecture maturity:

AsyncAPI 2.x/3.x provides the same OpenAPI-level contract definition for event-driven systems, enabling code generation, documentation, and schema validation for Kafka/WebSocket/AMQP APIs.
CloudEvents (CNCF standard) provides a common event envelope format, enabling interoperability across event sources and brokers.
CQRS/Event Sourcing patterns have matured significantly, with frameworks like Axon Framework (Java), EventStoreDB, and Marten (PostgreSQL-based event store) providing production-ready implementations.

Serialization innovations:

Apache Arrow IPC format: columnar in-memory format for analytics; enables zero-copy sharing between processes (e.g., Python and C++ in the same machine).
FlatBuffers / Cap’n Proto: zero-copy deserialization — the serialized form is directly usable in memory without a parsing step. Used in game engines and high-frequency trading.
Apache Parquet / ORC: columnar file formats for analytics workloads; excellent compression and predicate pushdown for OLAP queries.

Questions for Reflection

A team wants to migrate their Kafka topics from JSON to Avro to reduce message size. What sequence of steps should they follow to avoid breaking consumers? What compatibility mode should they set in the schema registry?
You have a Protocol Buffers schema in production. A product manager asks you to rename the user_name field to username. Is this safe? What if the same schema were Avro?
A workflow sends a payment, then sends a confirmation email. Without durable execution, what failure modes exist? How does Temporal’s approach address each?
A microservice using REST is being replaced by a gRPC service. What challenges arise for existing clients? How would you handle the migration?
Your event sourcing system uses Avro for events. Two years later, the OrderCreated event schema has changed significantly. Old events in the log must still be replayable. How would you handle this with schema evolution and upcasters?
Explain why durable execution workflow code must be deterministic. What happens if a developer accidentally puts time.now() directly in workflow code instead of using the framework’s timer API?

ch06-replication — Replication lag makes consistency guarantees relevant to encoding across distributed replicas
ch08-transactions — Transactions and encoding interact in database dataflow; schema migrations and transactions
ch12-stream-processing — Stream processing systems rely heavily on encoding formats (Kafka + Avro/Protobuf)
ch03-data-models-query-languages — Event sourcing as a data model; CQRS patterns
ch04-encoding-and-evolution — 1st edition coverage for comparison; Avro and Protobuf foundations

Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch05-encoding-and-evolution

Chapter 5: Encoding and Evolution

Overview

Key Concepts

In-Memory vs On-Wire Representation

Language-Specific Formats

JSON, XML, and Binary Variants

Protocol Buffers

Avro

Encoding Efficiency Comparison

The Merits of Schemas

Modes of Dataflow

Dataflow Through Databases

Dataflow Through Services: REST and RPC

Durable Execution and Workflows

Event-Driven Architectures

Pattern 1: Event Notification

Pattern 2: Event-Carried State Transfer

Pattern 3: Event Sourcing

Comparison Tables

JSON vs Protocol Buffers vs Avro vs Thrift

Backward vs Forward Compatibility

Important Points Summary

Modern Context (2026)

Questions for Reflection

Graph View

Table of Contents

Backlinks

Study Notes by Niladri & AI

Explorer

ch05-encoding-and-evolution

Chapter 5: Encoding and Evolution

Overview

Key Concepts

In-Memory vs On-Wire Representation

Language-Specific Formats

JSON, XML, and Binary Variants

Protocol Buffers

Avro

Encoding Efficiency Comparison

The Merits of Schemas

Modes of Dataflow

Dataflow Through Databases

Dataflow Through Services: REST and RPC

Durable Execution and Workflows

Event-Driven Architectures

Pattern 1: Event Notification

Pattern 2: Event-Carried State Transfer

Pattern 3: Event Sourcing

Comparison Tables

JSON vs Protocol Buffers vs Avro vs Thrift

Backward vs Forward Compatibility

Important Points Summary

Modern Context (2026)

Questions for Reflection

Related Resources

Graph View

Table of Contents

Backlinks