Chapter 4 Cheat Sheet - Encoding and Evolution

One-Line Summaries

Concept	One-Liner
Encoding	Converting in-memory objects to byte sequences for storage/transmission
Backward compatibility	Newer code can read data written by older code
Forward compatibility	Older code can read data written by newer code
Field tag	Numeric ID for a field in Protobuf/Thrift — must never change
Avro	Schema-based binary format; no field tags; reader resolves against writer schema
Schema registry	Central store of versioned schemas; enables safe evolution with binary formats
Dataflow	How encoded data flows through systems: DB, services, or message brokers
Rolling upgrade	Deploy new code to some nodes while old code still runs on others

Encoding Format Comparison

Format	Human-readable	Schema enforced	Binary	Evolution support	Use case
JSON	✅	❌ (optional)	❌	Manual	REST APIs, config
XML	✅	❌ (optional)	❌	Manual	Legacy systems
CSV	✅	❌	❌	Poor	Bulk data export
Protobuf	❌	✅	✅	Good (field tags)	gRPC, internal APIs
Thrift	❌	✅	✅	Good (field tags)	Internal RPC
Avro	❌	✅	✅	Excellent (names)	Kafka, Hadoop
Java Serialize	❌	❌	✅	Very poor	Avoid

Compatibility Decision Tree

Is this a schema change?
│
├─ Adding a new field?
│  ├─ Add as optional with default → ✅ Backward + Forward compatible
│  └─ Add as required → ❌ Breaks backward (old data missing field)
│
├─ Removing a field?
│  ├─ Was optional? → ✅ Backward compatible (old data has value, new code ignores)
│  └─ Was required? → ❌ Don't do it; make it optional first
│
├─ Renaming a field?
│  ├─ Protobuf/Thrift: ✅ OK (uses field tag, not name)
│  └─ Avro: ❌ Breaks schema resolution (uses field name)
│
└─ Changing field type?
   └─ ❌ Usually breaks compatibility (may truncate/corrupt data)

Protocol Buffers / Thrift Wire Format

JSON (81 bytes):                          Protobuf (~33 bytes):
{                                         Field 1, type string, length N
  "userName": "Martin",                   M a r t i n
  "favoriteNumber": 1337,                 Field 2, type varint
  "interests": ["daydreaming", "hacking"] 1337 (varint encoded)
}                                         Field 3, type string, length N
                                          d a y d r e a m i n g
                                          Field 3, type string, length N
                                          h a c k i n g

Key: Field numbers used (not names) → renaming field is safe
     Field numbers never change → adding new numbers is safe

Avro Schema Resolution

Writer's Schema (v1)          Reader's Schema (v2)
───────────────────           ────────────────────
{                             {
  "name": "userName",           "name": "userName",
  "type": "string"              "type": "string"
},                            },
{                             {
  "name": "age",    ────────→   "name": "age",
  "type": "int"                 "type": "int",
}                               "default": 0      ← added with default
                              },
                              {
                                "name": "email",  ← new field
                                "type": "string",
                                "default": ""     ← REQUIRED for evolution
                              }

Avro uses field NAMES to match fields — field order doesn't matter
New fields must have defaults — Avro uses default when field absent in data
Extra fields in writer not in reader → ignored
Missing fields in writer present in reader → use default value

Three Modes of Dataflow

1. Through Databases
   ┌─────────┐  encode  ┌──────┐  decode  ┌─────────┐
   │ App v1  │ ──────→  │  DB  │ ──────→  │ App v2  │
   └─────────┘          └──────┘          └─────────┘
   Problem: Old and new versions may run simultaneously
   Key: New code must preserve unknown fields when re-writing records

2. Through Services (REST/RPC)
   ┌────────┐  request   ┌────────┐
   │ Client │ ─────────→ │ Server │
   │ (v1)   │ ←───────── │ (v2)   │
   └────────┘  response  └────────┘
   Key: API versioning; client and server may deploy independently

3. Through Message Brokers
   ┌──────────┐  publish  ┌────────┐  consume  ┌──────────┐
   │ Producer │ ────────→ │ Kafka  │ ────────→ │ Consumer │
   │          │           │ (log)  │           │          │
   └──────────┘           └────────┘           └──────────┘
   Key: Messages may be consumed long after production; must remain decodable

REST vs RPC Comparison

	REST	RPC (gRPC)
Protocol	HTTP/1.1 or HTTP/2	HTTP/2 (gRPC)
Format	JSON (usually)	Protocol Buffers
Schema	Optional (OpenAPI)	Required (proto file)
Code generation	Optional	Built-in
Streaming	Limited	Bidirectional streaming
Discoverability	High (URLs, human-readable)	Lower
Caching	Native HTTP caching	Manual
Debugging	Easy (curl, browser)	Harder (binary)
Best for	Public APIs, web clients	Internal microservices

Key Trade-offs

Decision	Pro	Con	When to Use
JSON	Human-readable, universal	Large size, weak types	Public APIs, config
Protobuf	Small, typed, code gen	Not human-readable	High-performance internal
Avro	Best compression, good evolution	Needs schema registry	Kafka, Hadoop
REST	Standard, debuggable	More verbose	External APIs
gRPC	Fast, typed, streaming	Complex setup	Internal services
Async messaging	Decoupled, resilient	Eventually consistent	Background jobs, events

Red Flags

❌ Using Java Serializable / Python pickle for network or persistent data
❌ Changing Protobuf field tag numbers
❌ Adding required fields to a schema with existing data
❌ Not preserving unknown fields on re-serialization
❌ Schema evolution without a compatibility strategy

Green Flags

✅ Always make new fields optional with defaults
✅ Use field tags (Protobuf) or schema registry (Avro) for safe evolution
✅ Version your REST APIs explicitly (/v2/)
✅ Use schema registry for Kafka/Avro to enforce compatibility
✅ Test backward AND forward compatibility before deploying schema changes

Modern Additions (2026)

gRPC ecosystem:
├─ ConnectRPC: gRPC-compatible over HTTP/1.1 + JSON
├─ Buf CLI: linting, breaking change detection for Protobuf
└─ Protobuf editions (2023+): replaces proto2/proto3 distinction

Async API contracts:
├─ AsyncAPI 2.x: OpenAPI equivalent for event-driven APIs
├─ CloudEvents: CNCF standard envelope for event metadata
└─ Schema evolution in event sourcing: append-only event log + upcasters

Zero-copy serialization:
├─ Apache Arrow: columnar in-memory, shared between processes
├─ Cap'n Proto / FlatBuffers: no parsing step needed
└─ Used in ML/analytics pipelines for performance

Interview Response Templates

When Asked About API Versioning

“I’d version the API explicitly — either in the URL (/v2/users) or via an Accept header. For internal services using Protobuf/gRPC, I’d rely on field tag evolution: new fields are optional, never remove or reuse field numbers. For Kafka messages with Avro, I’d use Confluent Schema Registry with BACKWARD compatibility mode so new consumers can read old messages.”

When Asked About Schema Evolution

“The key rules are: always add new fields as optional with a default value, never remove required fields without first making them optional, and never change field tag numbers in Protobuf. Forward compatibility means old consumers should ignore unknown fields; backward compatibility means new consumers handle missing old data. During rolling upgrades, both directions must work simultaneously.”

Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch04-cheatsheet