Chapter 4 Cheat Sheet - Encoding and Evolution

One-Line Summaries

ConceptOne-Liner
EncodingConverting in-memory objects to byte sequences for storage/transmission
Backward compatibilityNewer code can read data written by older code
Forward compatibilityOlder code can read data written by newer code
Field tagNumeric ID for a field in Protobuf/Thrift — must never change
AvroSchema-based binary format; no field tags; reader resolves against writer schema
Schema registryCentral store of versioned schemas; enables safe evolution with binary formats
DataflowHow encoded data flows through systems: DB, services, or message brokers
Rolling upgradeDeploy new code to some nodes while old code still runs on others

Encoding Format Comparison

FormatHuman-readableSchema enforcedBinaryEvolution supportUse case
JSON❌ (optional)ManualREST APIs, config
XML❌ (optional)ManualLegacy systems
CSVPoorBulk data export
ProtobufGood (field tags)gRPC, internal APIs
ThriftGood (field tags)Internal RPC
AvroExcellent (names)Kafka, Hadoop
Java SerializeVery poorAvoid

Compatibility Decision Tree

Is this a schema change?
│
├─ Adding a new field?
│  ├─ Add as optional with default → ✅ Backward + Forward compatible
│  └─ Add as required → ❌ Breaks backward (old data missing field)
│
├─ Removing a field?
│  ├─ Was optional? → ✅ Backward compatible (old data has value, new code ignores)
│  └─ Was required? → ❌ Don't do it; make it optional first
│
├─ Renaming a field?
│  ├─ Protobuf/Thrift: ✅ OK (uses field tag, not name)
│  └─ Avro: ❌ Breaks schema resolution (uses field name)
│
└─ Changing field type?
   └─ ❌ Usually breaks compatibility (may truncate/corrupt data)

Protocol Buffers / Thrift Wire Format

JSON (81 bytes):                          Protobuf (~33 bytes):
{                                         Field 1, type string, length N
  "userName": "Martin",                   M a r t i n
  "favoriteNumber": 1337,                 Field 2, type varint
  "interests": ["daydreaming", "hacking"] 1337 (varint encoded)
}                                         Field 3, type string, length N
                                          d a y d r e a m i n g
                                          Field 3, type string, length N
                                          h a c k i n g

Key: Field numbers used (not names) → renaming field is safe
     Field numbers never change → adding new numbers is safe

Avro Schema Resolution

Writer's Schema (v1)          Reader's Schema (v2)
───────────────────           ────────────────────
{                             {
  "name": "userName",           "name": "userName",
  "type": "string"              "type": "string"
},                            },
{                             {
  "name": "age",    ────────→   "name": "age",
  "type": "int"                 "type": "int",
}                               "default": 0      ← added with default
                              },
                              {
                                "name": "email",  ← new field
                                "type": "string",
                                "default": ""     ← REQUIRED for evolution
                              }

Avro uses field NAMES to match fields — field order doesn't matter
New fields must have defaults — Avro uses default when field absent in data
Extra fields in writer not in reader → ignored
Missing fields in writer present in reader → use default value

Three Modes of Dataflow

1. Through Databases
   ┌─────────┐  encode  ┌──────┐  decode  ┌─────────┐
   │ App v1  │ ──────→  │  DB  │ ──────→  │ App v2  │
   └─────────┘          └──────┘          └─────────┘
   Problem: Old and new versions may run simultaneously
   Key: New code must preserve unknown fields when re-writing records

2. Through Services (REST/RPC)
   ┌────────┐  request   ┌────────┐
   │ Client │ ─────────→ │ Server │
   │ (v1)   │ ←───────── │ (v2)   │
   └────────┘  response  └────────┘
   Key: API versioning; client and server may deploy independently

3. Through Message Brokers
   ┌──────────┐  publish  ┌────────┐  consume  ┌──────────┐
   │ Producer │ ────────→ │ Kafka  │ ────────→ │ Consumer │
   │          │           │ (log)  │           │          │
   └──────────┘           └────────┘           └──────────┘
   Key: Messages may be consumed long after production; must remain decodable

REST vs RPC Comparison

RESTRPC (gRPC)
ProtocolHTTP/1.1 or HTTP/2HTTP/2 (gRPC)
FormatJSON (usually)Protocol Buffers
SchemaOptional (OpenAPI)Required (proto file)
Code generationOptionalBuilt-in
StreamingLimitedBidirectional streaming
DiscoverabilityHigh (URLs, human-readable)Lower
CachingNative HTTP cachingManual
DebuggingEasy (curl, browser)Harder (binary)
Best forPublic APIs, web clientsInternal microservices

Key Trade-offs

DecisionProConWhen to Use
JSONHuman-readable, universalLarge size, weak typesPublic APIs, config
ProtobufSmall, typed, code genNot human-readableHigh-performance internal
AvroBest compression, good evolutionNeeds schema registryKafka, Hadoop
RESTStandard, debuggableMore verboseExternal APIs
gRPCFast, typed, streamingComplex setupInternal services
Async messagingDecoupled, resilientEventually consistentBackground jobs, events

Red Flags

❌ Using Java Serializable / Python pickle for network or persistent data
❌ Changing Protobuf field tag numbers
❌ Adding required fields to a schema with existing data
❌ Not preserving unknown fields on re-serialization
❌ Schema evolution without a compatibility strategy

Green Flags

✅ Always make new fields optional with defaults
✅ Use field tags (Protobuf) or schema registry (Avro) for safe evolution
✅ Version your REST APIs explicitly (/v2/)
✅ Use schema registry for Kafka/Avro to enforce compatibility
✅ Test backward AND forward compatibility before deploying schema changes

Modern Additions (2026)

gRPC ecosystem:
├─ ConnectRPC: gRPC-compatible over HTTP/1.1 + JSON
├─ Buf CLI: linting, breaking change detection for Protobuf
└─ Protobuf editions (2023+): replaces proto2/proto3 distinction

Async API contracts:
├─ AsyncAPI 2.x: OpenAPI equivalent for event-driven APIs
├─ CloudEvents: CNCF standard envelope for event metadata
└─ Schema evolution in event sourcing: append-only event log + upcasters

Zero-copy serialization:
├─ Apache Arrow: columnar in-memory, shared between processes
├─ Cap'n Proto / FlatBuffers: no parsing step needed
└─ Used in ML/analytics pipelines for performance

Interview Response Templates

When Asked About API Versioning

“I’d version the API explicitly — either in the URL (/v2/users) or via an Accept header. For internal services using Protobuf/gRPC, I’d rely on field tag evolution: new fields are optional, never remove or reuse field numbers. For Kafka messages with Avro, I’d use Confluent Schema Registry with BACKWARD compatibility mode so new consumers can read old messages.”

When Asked About Schema Evolution

“The key rules are: always add new fields as optional with a default value, never remove required fields without first making them optional, and never change field tag numbers in Protobuf. Forward compatibility means old consumers should ignore unknown fields; backward compatibility means new consumers handle missing old data. During rolling upgrades, both directions must work simultaneously.”


Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-13