Chapter 5: Encoding and Evolution

ddia-2e encoding serialization schema-evolution durable-execution event-driven

Status: Notes complete


Overview

Applications change continuously — fields are added, types change, features are removed. This chapter asks: how do you encode data in a way that lets the system evolve over time without breaking existing readers and writers? The core insight is that encoding format is not just a serialization detail; it is a contract between systems that must be explicitly managed. The 2nd edition significantly expands coverage beyond the 1st edition by adding Durable Execution (Temporal, AWS Step Functions) and a richer treatment of Event-Driven Architectures — both paradigms where encoding and evolution contracts are especially critical because messages outlive the code that wrote them.

The chapter covers three orthogonal concerns: what format to use for encoding (JSON, Protobuf, Avro), how to evolve schemas safely without downtime, and how encoded data flows through systems (databases, services, workflow engines, event streams).


Key Concepts

In-Memory vs On-Wire Representation

Programs keep data in memory as objects, structs, lists, hashmaps — all relying on pointers. When data must be written to disk or sent over a network, those pointers become meaningless. The data must be encoded (serialized/marshalled) into a self-contained byte sequence, and later decoded (deserialized/parsed/unmarshalled) back into memory.

This encoding step is where compatibility problems arise: if the encoder and decoder were built against different versions of the schema, the decode may fail or produce wrong results. The choice of encoding format determines how gracefully those mismatches are handled.

Language-Specific Formats

Java’s Serializable, Python’s pickle, Ruby’s Marshal — each language has a built-in serialization mechanism. These are convenient but dangerous for production data:

  • Language lock-in: A Java-serialized blob can only be decoded by Java. Cross-language interop is impossible.
  • Poor versioning: Adding a field to a Java class can silently break deserialization of older blobs.
  • Security vulnerability: Deserializing untrusted bytes can execute arbitrary code (well-known Java deserialization exploits).
  • Fragile to refactoring: Renaming a class or field path can permanently break existing stored data.

Rule: Never use language-specific formats for data that crosses process boundaries or will be stored persistently.

JSON, XML, and Binary Variants

JSON has become the dominant encoding for external APIs and configuration because it is human-readable and universally supported. Its limitations:

  • Numbers are ambiguous: no distinction between integer and float; integers > 2^53 lose precision in JavaScript.
  • No native binary type: binary data must be base64-encoded (adds ~33% overhead).
  • No enforced schema: any key name and value type is valid, making contracts implicit.
  • Verbose: field names are repeated in every record.

XML is even more verbose and has awkward distinctions between attributes and elements. It is largely confined to legacy enterprise systems (SOAP, RSS feeds).

CSV has no type system at all and is ambiguous around commas and newlines embedded in values. Useful for one-off bulk transfers but not for system-to-system communication.

MessagePack is a binary-compact variant of JSON: same data model, but values are encoded more compactly (~25% smaller than JSON). It retains JSON’s weakness of no enforced schema.

Protocol Buffers

Protocol Buffers (Protobuf) was developed at Google and is used for virtually all of Google’s internal communication. It requires writing a schema in an IDL (.proto file):

message Person {
  required string user_name = 1;
  optional int64  favorite_number = 2;
  repeated string interests = 3;
}

How Protobuf encoding works:

  • Each field in the wire format is identified by its field tag (the number 1, 2, 3) and a type annotation, NOT by its name.
  • The field name is purely a developer convenience — it can be changed freely without breaking wire compatibility.
  • Field tags are the canonical identity of a field. They must never be reused for different fields.

Backward and forward compatibility:

  • Add a new optional field with a new tag number → backward and forward compatible (old code ignores unknown tags; new code uses default if field absent).
  • Remove an optional field → backward compatible (old code that expects it gets nothing; new code ignores the old tag’s presence).
  • Never change a field’s tag number or type — old data encoded with the old tag becomes misinterpreted.
  • Never add required fields — old data was encoded without them; new code would reject valid old data.

Proto3 vs proto2: Proto3 removed the required/optional distinction and made all singular fields optional by default. This makes accidentally-breaking schema changes less likely, at the cost of losing the ability to distinguish “field absent” from “field present with zero value”.

Avro

Apache Avro was created for the Hadoop ecosystem and takes a fundamentally different approach to compatibility than Protobuf. There are no field tags in the wire format — Avro binary is just values in sequence, with no field identifiers at all. This makes Avro binaries even more compact than Protobuf, but means the schema must be known to decode the data.

The writer’s schema / reader’s schema mechanism:

  • The writer encodes data according to the writer’s schema (the schema in effect when the data was written).
  • The reader decodes using the reader’s schema (the schema the reader’s code was compiled against).
  • The Avro library compares the two schemas and resolves differences at read time, field by field, matched by name (not tag number).

Resolution rules:

  • Field in writer’s schema but not in reader’s: value is ignored.
  • Field in reader’s schema but not in writer’s: reader uses the field’s declared default value.
  • Both have the field: value is decoded and type-promoted if possible (e.g., int → long is safe).

Consequence: Renaming a field in Avro is a breaking change (unlike Protobuf where you can rename freely). Adding a field always requires a default value.

How the reader knows the writer’s schema: Two mechanisms:

  1. Schema registry: For Kafka streams, each message carries a 4-byte schema ID; the reader fetches the full schema from a central schema registry.
  2. Embedded in file: For Hadoop files (Avro container format), the schema is written once at the start of the file, then all records follow.

Avro IDL example:

{
  "type": "record",
  "name": "Person",
  "fields": [
    {"name": "userName",       "type": "string"},
    {"name": "favoriteNumber", "type": ["null", "long"], "default": null},
    {"name": "email",          "type": "string", "default": ""}
  ]
}

Encoding Efficiency Comparison

For a typical record with 3 fields (string, integer, repeated string):

FormatApproximate SizeNotes
JSON~81 bytesField names in every record
MessagePack~66 bytesBinary JSON; same model
Thrift BinaryProtocol~59 bytesField tags + type codes
Thrift CompactProtocol~34 bytesVarint-encoded tags
Protocol Buffers~33 bytesField tags + varints
Avro~32 bytesNo tags at all

At the scale of millions of events per second (Kafka, event logs), the 2.5x compaction ratio between JSON and Avro translates to meaningful cost savings in storage and network bandwidth.

The Merits of Schemas

It might seem that schema-less formats (JSON) are simpler. But explicit schemas provide important guarantees:

  • Documentation: The schema is the authoritative description of the data shape.
  • Validation: Encoders can reject malformed data before it propagates.
  • Code generation: Typed client/server stubs can be generated in any language from .proto or .avsc files.
  • Compatibility enforcement: Schema registries can reject schema changes that would break consumers.
  • Smaller encoding: Field names don’t need to be repeated in each record.

The industry has largely moved toward schema-first development for internal service-to-service communication, using either Protobuf (for gRPC) or Avro (for Kafka).


Modes of Dataflow

Dataflow Through Databases

When an application writes to a database, it encodes the data; when it reads back, it decodes. In long-running production systems this is never a clean one-to-one version mapping:

  • An older version of the application may write data; a newer version reads it (backward compatibility needed).
  • During a rolling upgrade, both old and new versions run simultaneously and both access the same database (forward compatibility also needed).

The unknown-field preservation problem: Suppose App v2 adds a new email field. App v1 reads a record written by v2, processes it, and writes it back — but App v1’s ORM has no email field in its model. If the ORM drops unknown fields on write-back, the email value is permanently lost. The correct behavior is to preserve and re-emit all fields, including those not understood by the current code version. Many ORMs and serialization libraries silently drop unknown fields, requiring careful configuration or explicit handling.

Schema migrations: ALTER TABLE in relational databases is often instantaneous on modern databases (MySQL 8, PostgreSQL 12+) for most changes. But adding columns with non-null constraints without defaults is always dangerous with rolling upgrades.

Dataflow Through Services: REST and RPC

REST (REpresentational State Transfer):

  • Builds on HTTP semantics: GET (safe, idempotent), POST (non-idempotent create), PUT/PATCH (update), DELETE
  • Resources identified by URLs; representation in JSON (usually) or XML
  • Stateless: each request carries all context needed
  • Native HTTP caching for GET responses
  • OpenAPI 3.1 for schema documentation and code generation
  • API versioning via URL path (/v2/users) or Accept header

SOAP:

  • XML-based protocol with WSDL for service description
  • Complex, verbose, now largely replaced by REST for new APIs
  • Still common in financial services and government systems

RPC (Remote Procedure Call):

  • Goal: make calling a remote service look like a local function call (transparent distribution)
  • The abstraction is leaky: network calls are different from local calls in fundamental ways:
    • Unpredictable latency (milliseconds to seconds)
    • Partial failure: request reached server but response was lost — did the operation execute?
    • Idempotency: can the caller safely retry?
    • Serialization/deserialization cost has no equivalent for local calls
  • Modern RPC frameworks (gRPC, Thrift, Finagle) acknowledge this leakiness and require explicit timeout/retry/idempotency handling.

gRPC is the dominant modern RPC framework:

  • Protocol Buffers over HTTP/2
  • Bidirectional streaming as first-class feature
  • Code generation in 10+ languages from .proto files
  • Used by Kubernetes, etcd, Envoy, and most cloud-native infrastructure

REST vs gRPC comparison:

DimensionRESTgRPC
ProtocolHTTP/1.1 or HTTP/2HTTP/2 only
FormatJSON (usually)Protocol Buffers
SchemaOptional (OpenAPI)Required (.proto)
Code generationOptionalBuilt-in
StreamingLimited (SSE, long-poll)Bidirectional streaming
CachingNative HTTP cachingManual
DebuggingEasy (curl, browser)Needs specialized tools (grpcurl)
Best forPublic APIs, web clientsInternal microservices

Durable Execution and Workflows

This is a major new section in the 2nd edition, addressing a class of problems that RPC and message queues don’t solve well.

The problem: Some operations take minutes, hours, or days — placing an order, onboarding a user, processing a payment batch. These multi-step workflows must survive process failures, server restarts, and deployments. Naively, you might write a long-running process or chain together async jobs — but these approaches require bespoke state management and are fragile.

What durable execution offers: A programming model where the code appears to run as an ordinary function but the execution engine automatically persists state to a database after each step. If the process crashes mid-workflow, execution resumes from the last checkpointed state rather than restarting from the beginning.

Key properties of durable execution:

  • Fault tolerance: Workflow survives process crashes, restarts, and deployments.
  • Exactly-once activity execution: Each activity (step) is guaranteed to execute at least once; the framework handles deduplication at the workflow level to achieve effectively-once semantics.
  • Long duration: A workflow can run for days, months, or even years.
  • Visibility: Execution history is persisted; you can query workflow state, get audit trails, and debug failures.
  • Deterministic replay: To resume a workflow after a crash, the engine replays the event history. Workflow code must be deterministic — given the same history, it must produce the same decisions. This means: no random numbers, no system clock reads, no direct I/O (all non-deterministic actions go through the framework).

How Temporal works:

  1. Developer writes workflow code (Go, Java, Python, TypeScript) using the Temporal SDK.
  2. Temporal server persists a complete event history of the workflow to a durable store (usually Cassandra or PostgreSQL).
  3. Worker processes poll for tasks, execute workflow/activity code, and report results back to the server.
  4. On failure, a new worker picks up the workflow and replays the event history to reconstruct in-memory state.
  5. Activities (external calls, database writes, API calls) are retried with configurable retry policies.
Temporal Architecture:

  Client ──→ Temporal Server (event log in DB)
                    │
           ┌────────┴──────────┐
           ↓                   ↓
     Workflow Worker      Activity Worker
     (replay history)     (executes steps)

AWS Step Functions is the managed cloud alternative:

  • Workflows defined as JSON state machines (Amazon States Language)
  • States: Task, Choice, Wait, Parallel, Map, Pass, Fail, Succeed
  • Execution history stored in AWS infrastructure; no server to manage
  • Two modes: Standard (durable, up to 1 year) and Express (high-throughput, up to 5 minutes)
  • Activities invoke Lambda functions, ECS tasks, or any AWS service

Azure Durable Functions extends Azure Functions with workflow orchestration:

  • Durable orchestrator functions that call activity functions
  • State is transparently persisted to Azure Storage
  • Similar replay-based execution model to Temporal

When to use durable execution vs alternatives:

ScenarioBest Approach
Multi-step workflow, minutes to hoursDurable execution (Temporal)
Short-lived async job (<30 seconds)Message queue (SQS, RabbitMQ)
Event fan-out to many consumersEvent streaming (Kafka)
Human-in-the-loop approvalDurable execution (long timers)
Periodic batch jobCron + task queue
Real-time stream processingStream processor (Flink, Kafka Streams)

Durable execution and encoding: Because workflow state is persisted across versions, encoding/evolution issues are acute. If the event history stored from v1 of the workflow must be replayed by v2’s code, the encoded activity inputs and outputs must be backward compatible. Temporal uses JSON by default (with Codec API for custom encoding); Protobuf is recommended for production workflows with strict schema evolution.

Event-Driven Architectures

The book distinguishes three fundamentally different patterns that all go by the name “event-driven”, but have very different semantics:

Pattern 1: Event Notification

The simplest pattern. A service emits a small notification that something happened. The event carries minimal data — just enough to trigger the consumer to act. The consumer typically calls back the originating service to get full details.

Order Service ──→ [order.created {orderId: "123"}] ──→ Notification Service
                                                         ↓
                                           GET /orders/123 (fetches full order)

Characteristics:

  • Very low coupling: consumers only know the event type and a reference
  • Consumer must make an additional request to get data (coupling on the API)
  • Good for: triggering side effects, fan-out notifications
  • Weakness: if the source is unavailable, consumers can’t complete their work

Pattern 2: Event-Carried State Transfer

The event carries the full state — the complete record of the entity at the time of the event. Consumers are self-sufficient; they don’t need to call back.

Order Service ──→ [order.created {orderId: "123", items: [...], total: 49.99, ...}] ──→ Analytics Service
                                                                                          ↓
                                                                            Stores locally, no callback needed

Characteristics:

  • Consumer autonomy: no dependency on source API availability
  • Higher event size and bandwidth
  • Consumer may maintain a local replica of the source data (read model)
  • Good for: building read models, analytics, search indexes, reporting
  • Weakness: data duplication; schema evolution must be managed carefully

Pattern 3: Event Sourcing

All state changes are stored as an immutable, append-only log of events. The current state of an entity is derived by replaying its event history. This is a persistence pattern, not just a communication pattern.

Events (immutable log):
  order.created   {orderId: "123", items: [...]}
  order.item.removed {orderId: "123", itemId: "456"}
  order.payment.received {orderId: "123", amount: 49.99}
  order.shipped   {orderId: "123", trackingId: "T789"}

Current State = fold(events, initial_state)

Characteristics:

  • Complete audit trail: can reconstruct state at any point in time
  • Time travel: query past state by replaying events up to a timestamp
  • Multiple projections: derive different read models (SQL table, search index, graph) from the same event log
  • Enables CQRS (Command Query Responsibility Segregation): separate write model (events) from read model (projections)
  • Schema evolution challenge: old events must remain replayable as the domain model evolves — use upcasters (functions that transform old event formats to new)
  • Weakness: eventual consistency of projections; snapshot management for performance

Event-Driven vs Request-Driven (REST/RPC):

DimensionEvent-DrivenRequest-Driven (REST/RPC)
Temporal couplingDecoupled (fire-and-forget)Coupled (synchronous wait)
Failure modelProducer doesn’t know if consumer failsProducer gets error synchronously
OrderingDepends on broker guaranteesImplicit (sequential calls)
DiscoverabilityHarder (what events exist?)Easy (API docs)
DebuggingHarder (async causality)Easier (request trace)
ScaleNatural fan-outRequires load balancer
ConsistencyEventual by defaultCan be synchronous

Message brokers as infrastructure:

  • Apache Kafka: distributed log; messages retained on disk; consumers track offsets; excellent for high-throughput event streams and event sourcing.
  • RabbitMQ / ActiveMQ / AWS SQS: traditional message queues; messages deleted after consumption; competing consumers; better for task queues.
  • AWS EventBridge: managed event router with content-based routing and schema registry.
  • Google Cloud Pub/Sub: managed global pub/sub with at-least-once delivery.

CloudEvents (CNCF standard): a vendor-neutral envelope format for events:

{
  "specversion": "1.0",
  "type": "com.example.order.created",
  "source": "/orders",
  "id": "A234-1234-1234",
  "time": "2026-01-01T12:00:00Z",
  "datacontenttype": "application/json",
  "data": { ... }
}

Comparison Tables

JSON vs Protocol Buffers vs Avro vs Thrift

DimensionJSONProtocol BuffersAvroThrift
Human-readableYesNoNoNo
Schema requiredNo (optional)Yes (.proto)Yes (.avsc)Yes (.thrift)
Binary encodingNoYesYesYes
Field identity in wireField nameField tag numberNone (by position/schema)Field tag number
Backward compatManual disciplineGood (optional fields + tags)Excellent (schema resolution)Good (optional fields + tags)
Forward compatManual disciplineGood (ignore unknown tags)ExcellentGood
Schema evolution mechanismConventions onlyField tags; optional fieldsWriter/reader schema matching by nameField tags; optional fields
Size (relative)100% (baseline)~40%~38%~42-70%
Schema registry neededNoOptional (Buf)Yes (for Kafka)No
Code generationOptional (openapi-gen)Built-in (protoc)Optional (avro-tools)Built-in (thrift)
Streaming supportManualgRPC streamingThrift multiplexed transport
Primary useREST APIs, configgRPC, internal APIsKafka, HadoopInternal RPC (legacy)
Rename safetyN/ASafe (name not in wire)Breaks compatibilitySafe (name not in wire)

Backward vs Forward Compatibility

OperationBackward Compatible?Forward Compatible?Notes
Add optional field with defaultYesYesThe safe default
Add required fieldNoNoNever do this
Remove optional fieldYesYesOld tag/name becomes unused
Remove required fieldNoNoOld code expects it
Rename field (Protobuf)YesYesName not in wire format
Rename field (Avro)NoNoName is identity in Avro
Change field type (widening)SometimesSometimesint→long safe; long→int not
Change field type (narrowing)NoNoData corruption
Change field tag (Protobuf)NoNoOld data misinterpreted

Important Points Summary

  • Encoding format is a contract: Changing it without a compatibility strategy breaks consumers silently or noisily.
  • Field tags in Protobuf/Thrift are permanent: The name can change, the tag cannot. Reusing a deleted tag for a new field corrupts old data.
  • Avro uses field names, not tags: Schema resolution matches fields by name — renaming breaks compatibility. New fields require default values.
  • Both backward AND forward compatibility are needed simultaneously during rolling upgrades: new code runs alongside old code, and both versions access the same databases and message queues.
  • Durable execution solves the workflow reliability problem: It moves workflow state management from the application to a persistent execution engine, eliminating the “what if the process crashes halfway through” problem.
  • The three event-driven patterns are distinct: Event notification (small event, callback needed), event-carried state transfer (full state, consumer self-sufficient), event sourcing (immutable log, state derived by replay) — each has different trade-offs.
  • Schema registries enforce compatibility at publish time: Reject producers that try to register incompatible schemas before bad data ever reaches consumers.
  • Language-specific serialization formats are a security risk: Deserializing untrusted Java or Python objects can execute arbitrary code.
  • Unknown-field preservation: Applications that read and re-write records must preserve unknown fields, not silently drop them.
  • Durable execution requires deterministic workflow code: No random numbers, system clocks, or direct I/O in workflow functions — all non-deterministic actions must go through the framework’s activity mechanism.

Modern Context (2026)

Protobuf and gRPC dominance:

  • gRPC has become the standard internal RPC framework for cloud-native systems (Kubernetes, Istio, Envoy use it internally).
  • Buf CLI and Buf Schema Registry provide linting, breaking-change detection, and schema management for Protobuf teams — addressing the tooling gap that made Protobuf harder than Avro for schema governance.
  • ConnectRPC is an emerging alternative that makes gRPC-compatible services work over standard HTTP/1.1 and JSON, removing the requirement for HTTP/2 infrastructure.
  • Protobuf Editions (2023+) replaced the proto2/proto3 distinction with feature flags per field, giving more control over behavior.

Kafka ecosystem:

  • Confluent Schema Registry now supports Avro, JSON Schema, and Protobuf — enabling teams to choose their preferred encoding while getting schema governance.
  • AWS Glue Schema Registry provides a managed alternative in the AWS ecosystem.
  • Schema evolution is now a first-class engineering concern for Kafka-based systems, with CI pipelines that validate compatibility on every schema PR.

Temporal and durable execution growth:

  • Temporal has become the de facto open-source durable execution engine (2020–2026), with significant adoption at companies like Stripe, Netflix, Coinbase, and HashiCorp.
  • AWS Step Functions and Azure Durable Functions are the managed alternatives, preferred in teams that want to avoid operational overhead.
  • The durable execution paradigm is now considered a distinct category alongside queues, streams, and databases in the distributed systems toolbox.

Event-driven architecture maturity:

  • AsyncAPI 2.x/3.x provides the same OpenAPI-level contract definition for event-driven systems, enabling code generation, documentation, and schema validation for Kafka/WebSocket/AMQP APIs.
  • CloudEvents (CNCF standard) provides a common event envelope format, enabling interoperability across event sources and brokers.
  • CQRS/Event Sourcing patterns have matured significantly, with frameworks like Axon Framework (Java), EventStoreDB, and Marten (PostgreSQL-based event store) providing production-ready implementations.

Serialization innovations:

  • Apache Arrow IPC format: columnar in-memory format for analytics; enables zero-copy sharing between processes (e.g., Python and C++ in the same machine).
  • FlatBuffers / Cap’n Proto: zero-copy deserialization — the serialized form is directly usable in memory without a parsing step. Used in game engines and high-frequency trading.
  • Apache Parquet / ORC: columnar file formats for analytics workloads; excellent compression and predicate pushdown for OLAP queries.

Questions for Reflection

  1. A team wants to migrate their Kafka topics from JSON to Avro to reduce message size. What sequence of steps should they follow to avoid breaking consumers? What compatibility mode should they set in the schema registry?
  2. You have a Protocol Buffers schema in production. A product manager asks you to rename the user_name field to username. Is this safe? What if the same schema were Avro?
  3. A workflow sends a payment, then sends a confirmation email. Without durable execution, what failure modes exist? How does Temporal’s approach address each?
  4. A microservice using REST is being replaced by a gRPC service. What challenges arise for existing clients? How would you handle the migration?
  5. Your event sourcing system uses Avro for events. Two years later, the OrderCreated event schema has changed significantly. Old events in the log must still be replayable. How would you handle this with schema evolution and upcasters?
  6. Explain why durable execution workflow code must be deterministic. What happens if a developer accidentally puts time.now() directly in workflow code instead of using the framework’s timer API?

Last Updated: 2026-05-29