Chapter 4: Encoding and Evolution

Overview

Applications change over time—schemas evolve, new fields are added, old ones removed. This chapter covers how data is encoded (serialized) for storage or transmission, and how systems maintain backward and forward compatibility through schema evolution. The key insight is that encoding formats make very different trade-offs between human-readability, schema enforcement, storage efficiency, and schema evolution support.

Core Problem: How do you change your data format (schema) without breaking systems that were built against the old format?

Key Concepts

Formats for Encoding Data

In-memory vs on-wire representation:

  • In memory: objects, structs, lists, hashmaps with pointers — efficient for CPU
  • On wire/disk: must be encoded into byte sequence (no pointers, self-contained)
  • Encoding (serialization/marshalling): in-memory → bytes
  • Decoding (deserialization/parsing/unmarshalling): bytes → in-memory

Language-specific formats (Java Serializable, Python pickle, Ruby Marshal):

  • Tied to programming language: can’t read in another language
  • Poor versioning support
  • Security issues (deserializing untrusted data can execute arbitrary code)
  • Never use for anything that needs to be persistent or interoperable

Textual Formats: JSON, XML, CSV

JSON:

  • Human-readable, widely supported
  • Can’t distinguish integers from floats (number is just “number”)
  • No binary string support (must base64-encode binary data)
  • Optional schema support (JSON Schema)

XML:

  • Verbose, human-readable
  • Can’t distinguish numbers from strings
  • Attribute vs element ambiguity
  • Complex namespace handling

CSV:

  • Simple, widely supported
  • No schema, no type system
  • Commas/newlines in values need escaping rules

Common JSON/XML problems:

  • Ambiguity around numbers > 2^53 (JavaScript limitation)
  • No binary data support (must base64-encode)
  • Schema is optional and not enforced by format itself

Binary Encoding: Thrift, Protocol Buffers, Avro

Why binary?: Denser encoding (less space), faster to parse, enforced schema

Apache Thrift (Facebook):

  • IDL (Interface Definition Language) defines schema
  • Two binary formats: BinaryProtocol and CompactProtocol
  • Field tags (numbers) identify fields — not field names
  • Required/optional fields

Protocol Buffers (Google):

  • Very similar to Thrift
  • Each field has a field tag (number) and type annotation
  • Only field number + type in binary — not field name
  • Field tags are crucial for backward/forward compatibility

Avro (Apache, for Hadoop):

  • Different approach: no field tags in binary encoding at all
  • Relies on schema matching: writer’s schema must match reader’s schema
  • Schema transmitted separately (out of band), not with each record
  • Excellent for schema evolution via schema registry

Encoding efficiency comparison (for a sample record):

  • JSON: ~81 bytes
  • MessagePack (binary JSON): ~66 bytes
  • Thrift BinaryProtocol: ~59 bytes
  • Thrift CompactProtocol: ~34 bytes
  • Protocol Buffers: ~33 bytes
  • Avro: ~32 bytes

Schema Evolution and Compatibility

Backward compatibility: Newer code can read data written by older code

  • Add new fields with default values (old data has no value for them)
  • Never remove required fields; only optional ones

Forward compatibility: Older code can read data written by newer code

  • Old code ignores fields it doesn’t know about (unknown field tags)
  • Don’t change the meaning of existing field tags/numbers

Thrift/Protocol Buffers evolution rules:

  • Can add new optional fields (assign new field tag) → forward & backward compatible
  • Can remove optional fields (old code ignores unknown tags) → backward compatible
  • Cannot change field tags (would make existing data unreadable)
  • Cannot change data types (may truncate values)
  • Cannot add required fields (old code won’t send them)

Avro evolution:

  • Writer’s schema and reader’s schema are matched by field name
  • Different schema versions (writer ≠ reader) handled via schema registry
  • Avro resolves differences: new fields get default, missing fields ignored
  • Excellent for database dumps (all records use same schema)

Modes of Dataflow

1. Dataflow through databases:

  • Writing to DB = encoding; reading = decoding
  • Backward compatibility always needed (old code must read new data)
  • Forward compatibility needed if older version of app still running
  • Preservation of unknown fields: When newer code writes, older code reads and re-writes, unknown fields should be preserved

2. Dataflow through services (REST and RPC):

REST:

  • Uses HTTP protocol features (methods, status codes, content-type)
  • Resources and representations; stateless
  • API versioning via URL (/v1/users) or HTTP header
  • OpenAPI (Swagger) for documentation/client generation

SOAP:

  • XML-based protocol with WSDL
  • Complex, now largely replaced by REST

RPC (Remote Procedure Call):

  • Goal: make remote call look like local function call
  • Flawed abstraction: network is unreliable (latency, failures, timeouts)
  • Modern RPCs: gRPC (Protocol Buffers over HTTP/2), Thrift, Finagle
  • RPC frameworks must handle: retries, idempotency, timeouts, serialization

RPC vs REST:

  • RPC: more efficient, strongly typed, code generation; harder to debug
  • REST: human-readable, standard tooling, cache-friendly; more verbose

3. Dataflow through async message passing (message brokers):

  • Message broker (Kafka, RabbitMQ, ActiveMQ): producer → broker → consumer
  • Decouples producer from consumer (can be down independently)
  • Messages redelivered if consumer fails
  • Natural fan-out (one message, many consumers)
  • Actor model: Erlang, Akka — actors send async messages, single-threaded mailbox

Message queue vs message log:

  • Queue (RabbitMQ): message deleted after consumed, push model
  • Log (Kafka): messages retained, consumers track offset, pull model

The Importance of Schema Registries

Problem: Avro and binary formats require knowing the schema to decode data

Schema registry (Confluent Schema Registry for Kafka):

  • Central store of schemas with version IDs
  • Producer registers schema, gets ID; embeds ID in each message
  • Consumer looks up schema by ID to decode message
  • Enforces compatibility rules (backward/forward/full)

Why it matters:

  • Without schema registry, binary-encoded data is unreadable if schema is lost
  • Schema registry enables schema evolution with safety checks

Important Points

  • Field tags are more important than field names in Thrift/Protobuf: Never change them.
  • Avro uses field names, not tags: Must match names in schema resolution.
  • Backward vs forward compatibility is a two-way street: Both directions need to work simultaneously during rolling upgrades.
  • REST APIs are not RPC: HTTP semantics (caching, idempotency) are features, not limitations.
  • Message brokers decouple producers from consumers: Key for resilience in async systems.
  • Schema evolution requires a plan: Ad-hoc changes break compatibility; use required/optional correctly.
  • Avoid language-specific serialization: Tied to language, poor security, no versioning.

Examples & Case Studies

  1. Protocol Buffers at Google

    • Used for virtually all internal communication
    • gRPC (Protocol Buffers + HTTP/2) now widely adopted externally
    • Field tags enforce that schema changes are backward-compatible
  2. Avro in Apache Kafka

    • Confluent Schema Registry stores Avro schemas
    • Each Kafka message has 4-byte schema ID prefix
    • Schema evolution enforced at produce time
  3. JSON APIs in Practice

    • GitHub API: versioned via Accept header
    • Twitter API: moved from XML to JSON (2012)
    • Stripe: stable API with version dates in URL
  4. gRPC adoption

    • Used by Kubernetes, etcd, many cloud-native systems
    • Protocol Buffers + HTTP/2 + code generation in 10+ languages
    • Bidirectional streaming as first-class feature

Questions

  1. Why shouldn’t you use Java Serializable or Python pickle for persistent data?
  2. How does Protocol Buffers achieve backward and forward compatibility?
  3. What is the difference between backward and forward compatibility?
  4. When would you choose Avro over Protocol Buffers?
  5. How do schema registries work and why are they needed?
  6. What are the trade-offs between REST and RPC?
  7. How do message brokers enable asynchronous dataflow?
  8. What happens when you add a required field to a Protocol Buffers schema?

Modern Context (2026)

gRPC and Protobuf dominance:

  • gRPC has become the dominant internal microservice RPC framework
  • Buf Schema Registry as alternative to Confluent Schema Registry for Protobuf
  • ConnectRPC: gRPC-compatible protocol that works with standard HTTP/1.1

GraphQL maturity:

  • Sits between REST and RPC: strongly typed schema, client-driven queries
  • Schema evolution via @deprecated directive
  • Used for external APIs where clients have varying data needs

OpenAPI 3.x and AsyncAPI:

  • OpenAPI 3.1 (aligned with JSON Schema) now standard for REST documentation
  • AsyncAPI 2.x for event-driven/async APIs (Kafka, WebSocket)
  • Code generation from schemas in all major languages

Kafka Schema Registry evolution:

  • Confluent Schema Registry supports Avro, JSON Schema, Protobuf
  • Compatibility modes: BACKWARD, FORWARD, FULL, NONE
  • Schema evolution is now a first-class concern in event-driven architectures

Serialization in 2026:

  • FlatBuffers / Cap’n Proto: zero-copy deserialization (no parsing step)
  • Arrow IPC format: columnar in-memory format for analytics (shared memory, zero-copy)
  • MessagePack: compact JSON-compatible binary, popular for Redis payloads

Status: Notes complete
Last Updated: 2026-04-13