Chapter 4: Encoding and Evolution

Overview

Applications change over time—schemas evolve, new fields are added, old ones removed. This chapter covers how data is encoded (serialized) for storage or transmission, and how systems maintain backward and forward compatibility through schema evolution. The key insight is that encoding formats make very different trade-offs between human-readability, schema enforcement, storage efficiency, and schema evolution support.

Core Problem: How do you change your data format (schema) without breaking systems that were built against the old format?

Key Concepts

Formats for Encoding Data

In-memory vs on-wire representation:

In memory: objects, structs, lists, hashmaps with pointers — efficient for CPU
On wire/disk: must be encoded into byte sequence (no pointers, self-contained)
Encoding (serialization/marshalling): in-memory → bytes
Decoding (deserialization/parsing/unmarshalling): bytes → in-memory

Language-specific formats (Java Serializable, Python pickle, Ruby Marshal):

Tied to programming language: can’t read in another language
Poor versioning support
Security issues (deserializing untrusted data can execute arbitrary code)
Never use for anything that needs to be persistent or interoperable

Textual Formats: JSON, XML, CSV

JSON:

Human-readable, widely supported
Can’t distinguish integers from floats (number is just “number”)
No binary string support (must base64-encode binary data)
Optional schema support (JSON Schema)

XML:

Verbose, human-readable
Can’t distinguish numbers from strings
Attribute vs element ambiguity
Complex namespace handling

CSV:

Simple, widely supported
No schema, no type system
Commas/newlines in values need escaping rules

Common JSON/XML problems:

Ambiguity around numbers > 2^53 (JavaScript limitation)
No binary data support (must base64-encode)
Schema is optional and not enforced by format itself

Binary Encoding: Thrift, Protocol Buffers, Avro

Why binary?: Denser encoding (less space), faster to parse, enforced schema

Apache Thrift (Facebook):

IDL (Interface Definition Language) defines schema
Two binary formats: BinaryProtocol and CompactProtocol
Field tags (numbers) identify fields — not field names
Required/optional fields

Protocol Buffers (Google):

Very similar to Thrift
Each field has a field tag (number) and type annotation
Only field number + type in binary — not field name
Field tags are crucial for backward/forward compatibility

Avro (Apache, for Hadoop):

Different approach: no field tags in binary encoding at all
Relies on schema matching: writer’s schema must match reader’s schema
Schema transmitted separately (out of band), not with each record
Excellent for schema evolution via schema registry

Encoding efficiency comparison (for a sample record):

JSON: ~81 bytes
MessagePack (binary JSON): ~66 bytes
Thrift BinaryProtocol: ~59 bytes
Thrift CompactProtocol: ~34 bytes
Protocol Buffers: ~33 bytes
Avro: ~32 bytes

Schema Evolution and Compatibility

Backward compatibility: Newer code can read data written by older code

Add new fields with default values (old data has no value for them)
Never remove required fields; only optional ones

Forward compatibility: Older code can read data written by newer code

Old code ignores fields it doesn’t know about (unknown field tags)
Don’t change the meaning of existing field tags/numbers

Thrift/Protocol Buffers evolution rules:

Can add new optional fields (assign new field tag) → forward & backward compatible
Can remove optional fields (old code ignores unknown tags) → backward compatible
Cannot change field tags (would make existing data unreadable)
Cannot change data types (may truncate values)
Cannot add required fields (old code won’t send them)

Avro evolution:

Writer’s schema and reader’s schema are matched by field name
Different schema versions (writer ≠ reader) handled via schema registry
Avro resolves differences: new fields get default, missing fields ignored
Excellent for database dumps (all records use same schema)

Modes of Dataflow

1. Dataflow through databases:

Writing to DB = encoding; reading = decoding
Backward compatibility always needed (old code must read new data)
Forward compatibility needed if older version of app still running
Preservation of unknown fields: When newer code writes, older code reads and re-writes, unknown fields should be preserved

2. Dataflow through services (REST and RPC):

REST:

Uses HTTP protocol features (methods, status codes, content-type)
Resources and representations; stateless
API versioning via URL (/v1/users) or HTTP header
OpenAPI (Swagger) for documentation/client generation

SOAP:

XML-based protocol with WSDL
Complex, now largely replaced by REST

RPC (Remote Procedure Call):

Goal: make remote call look like local function call
Flawed abstraction: network is unreliable (latency, failures, timeouts)
Modern RPCs: gRPC (Protocol Buffers over HTTP/2), Thrift, Finagle
RPC frameworks must handle: retries, idempotency, timeouts, serialization

RPC vs REST:

RPC: more efficient, strongly typed, code generation; harder to debug
REST: human-readable, standard tooling, cache-friendly; more verbose

3. Dataflow through async message passing (message brokers):

Message broker (Kafka, RabbitMQ, ActiveMQ): producer → broker → consumer
Decouples producer from consumer (can be down independently)
Messages redelivered if consumer fails
Natural fan-out (one message, many consumers)
Actor model: Erlang, Akka — actors send async messages, single-threaded mailbox

Message queue vs message log:

Queue (RabbitMQ): message deleted after consumed, push model
Log (Kafka): messages retained, consumers track offset, pull model

The Importance of Schema Registries

Problem: Avro and binary formats require knowing the schema to decode data

Schema registry (Confluent Schema Registry for Kafka):

Central store of schemas with version IDs
Producer registers schema, gets ID; embeds ID in each message
Consumer looks up schema by ID to decode message
Enforces compatibility rules (backward/forward/full)

Why it matters:

Without schema registry, binary-encoded data is unreadable if schema is lost
Schema registry enables schema evolution with safety checks

Important Points

Field tags are more important than field names in Thrift/Protobuf: Never change them.
Avro uses field names, not tags: Must match names in schema resolution.
Backward vs forward compatibility is a two-way street: Both directions need to work simultaneously during rolling upgrades.
REST APIs are not RPC: HTTP semantics (caching, idempotency) are features, not limitations.
Message brokers decouple producers from consumers: Key for resilience in async systems.
Schema evolution requires a plan: Ad-hoc changes break compatibility; use required/optional correctly.
Avoid language-specific serialization: Tied to language, poor security, no versioning.

Examples & Case Studies

Protocol Buffers at Google
- Used for virtually all internal communication
- gRPC (Protocol Buffers + HTTP/2) now widely adopted externally
- Field tags enforce that schema changes are backward-compatible
Avro in Apache Kafka
- Confluent Schema Registry stores Avro schemas
- Each Kafka message has 4-byte schema ID prefix
- Schema evolution enforced at produce time
JSON APIs in Practice
- GitHub API: versioned via Accept header
- Twitter API: moved from XML to JSON (2012)
- Stripe: stable API with version dates in URL
gRPC adoption
- Used by Kubernetes, etcd, many cloud-native systems
- Protocol Buffers + HTTP/2 + code generation in 10+ languages
- Bidirectional streaming as first-class feature

Questions

Why shouldn’t you use Java Serializable or Python pickle for persistent data?
How does Protocol Buffers achieve backward and forward compatibility?
What is the difference between backward and forward compatibility?
When would you choose Avro over Protocol Buffers?
How do schema registries work and why are they needed?
What are the trade-offs between REST and RPC?
How do message brokers enable asynchronous dataflow?
What happens when you add a required field to a Protocol Buffers schema?

Modern Context (2026)

gRPC and Protobuf dominance:

gRPC has become the dominant internal microservice RPC framework
Buf Schema Registry as alternative to Confluent Schema Registry for Protobuf
ConnectRPC: gRPC-compatible protocol that works with standard HTTP/1.1

GraphQL maturity:

Sits between REST and RPC: strongly typed schema, client-driven queries
Schema evolution via @deprecated directive
Used for external APIs where clients have varying data needs

OpenAPI 3.x and AsyncAPI:

OpenAPI 3.1 (aligned with JSON Schema) now standard for REST documentation
AsyncAPI 2.x for event-driven/async APIs (Kafka, WebSocket)
Code generation from schemas in all major languages

Kafka Schema Registry evolution:

Confluent Schema Registry supports Avro, JSON Schema, Protobuf
Compatibility modes: BACKWARD, FORWARD, FULL, NONE
Schema evolution is now a first-class concern in event-driven architectures

Serialization in 2026:

FlatBuffers / Cap’n Proto: zero-copy deserialization (no parsing step)
Arrow IPC format: columnar in-memory format for analytics (shared memory, zero-copy)
MessagePack: compact JSON-compatible binary, popular for Redis payloads

Status: Notes complete
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch04-encoding-and-evolution