Chapter 4 Cheat Sheet - Encoding and Evolution
One-Line Summaries
| Concept | One-Liner |
|---|---|
| Encoding | Converting in-memory objects to byte sequences for storage/transmission |
| Backward compatibility | Newer code can read data written by older code |
| Forward compatibility | Older code can read data written by newer code |
| Field tag | Numeric ID for a field in Protobuf/Thrift — must never change |
| Avro | Schema-based binary format; no field tags; reader resolves against writer schema |
| Schema registry | Central store of versioned schemas; enables safe evolution with binary formats |
| Dataflow | How encoded data flows through systems: DB, services, or message brokers |
| Rolling upgrade | Deploy new code to some nodes while old code still runs on others |
Encoding Format Comparison
| Format | Human-readable | Schema enforced | Binary | Evolution support | Use case |
|---|---|---|---|---|---|
| JSON | ✅ | ❌ (optional) | ❌ | Manual | REST APIs, config |
| XML | ✅ | ❌ (optional) | ❌ | Manual | Legacy systems |
| CSV | ✅ | ❌ | ❌ | Poor | Bulk data export |
| Protobuf | ❌ | ✅ | ✅ | Good (field tags) | gRPC, internal APIs |
| Thrift | ❌ | ✅ | ✅ | Good (field tags) | Internal RPC |
| Avro | ❌ | ✅ | ✅ | Excellent (names) | Kafka, Hadoop |
| Java Serialize | ❌ | ❌ | ✅ | Very poor | Avoid |
Compatibility Decision Tree
Is this a schema change?
│
├─ Adding a new field?
│ ├─ Add as optional with default → ✅ Backward + Forward compatible
│ └─ Add as required → ❌ Breaks backward (old data missing field)
│
├─ Removing a field?
│ ├─ Was optional? → ✅ Backward compatible (old data has value, new code ignores)
│ └─ Was required? → ❌ Don't do it; make it optional first
│
├─ Renaming a field?
│ ├─ Protobuf/Thrift: ✅ OK (uses field tag, not name)
│ └─ Avro: ❌ Breaks schema resolution (uses field name)
│
└─ Changing field type?
└─ ❌ Usually breaks compatibility (may truncate/corrupt data)
Protocol Buffers / Thrift Wire Format
JSON (81 bytes): Protobuf (~33 bytes):
{ Field 1, type string, length N
"userName": "Martin", M a r t i n
"favoriteNumber": 1337, Field 2, type varint
"interests": ["daydreaming", "hacking"] 1337 (varint encoded)
} Field 3, type string, length N
d a y d r e a m i n g
Field 3, type string, length N
h a c k i n g
Key: Field numbers used (not names) → renaming field is safe
Field numbers never change → adding new numbers is safe
Avro Schema Resolution
Writer's Schema (v1) Reader's Schema (v2)
─────────────────── ────────────────────
{ {
"name": "userName", "name": "userName",
"type": "string" "type": "string"
}, },
{ {
"name": "age", ────────→ "name": "age",
"type": "int" "type": "int",
} "default": 0 ← added with default
},
{
"name": "email", ← new field
"type": "string",
"default": "" ← REQUIRED for evolution
}
Avro uses field NAMES to match fields — field order doesn't matter
New fields must have defaults — Avro uses default when field absent in data
Extra fields in writer not in reader → ignored
Missing fields in writer present in reader → use default value
Three Modes of Dataflow
1. Through Databases
┌─────────┐ encode ┌──────┐ decode ┌─────────┐
│ App v1 │ ──────→ │ DB │ ──────→ │ App v2 │
└─────────┘ └──────┘ └─────────┘
Problem: Old and new versions may run simultaneously
Key: New code must preserve unknown fields when re-writing records
2. Through Services (REST/RPC)
┌────────┐ request ┌────────┐
│ Client │ ─────────→ │ Server │
│ (v1) │ ←───────── │ (v2) │
└────────┘ response └────────┘
Key: API versioning; client and server may deploy independently
3. Through Message Brokers
┌──────────┐ publish ┌────────┐ consume ┌──────────┐
│ Producer │ ────────→ │ Kafka │ ────────→ │ Consumer │
│ │ │ (log) │ │ │
└──────────┘ └────────┘ └──────────┘
Key: Messages may be consumed long after production; must remain decodable
REST vs RPC Comparison
| REST | RPC (gRPC) | |
|---|---|---|
| Protocol | HTTP/1.1 or HTTP/2 | HTTP/2 (gRPC) |
| Format | JSON (usually) | Protocol Buffers |
| Schema | Optional (OpenAPI) | Required (proto file) |
| Code generation | Optional | Built-in |
| Streaming | Limited | Bidirectional streaming |
| Discoverability | High (URLs, human-readable) | Lower |
| Caching | Native HTTP caching | Manual |
| Debugging | Easy (curl, browser) | Harder (binary) |
| Best for | Public APIs, web clients | Internal microservices |
Key Trade-offs
| Decision | Pro | Con | When to Use |
|---|---|---|---|
| JSON | Human-readable, universal | Large size, weak types | Public APIs, config |
| Protobuf | Small, typed, code gen | Not human-readable | High-performance internal |
| Avro | Best compression, good evolution | Needs schema registry | Kafka, Hadoop |
| REST | Standard, debuggable | More verbose | External APIs |
| gRPC | Fast, typed, streaming | Complex setup | Internal services |
| Async messaging | Decoupled, resilient | Eventually consistent | Background jobs, events |
Red Flags
❌ Using Java Serializable / Python pickle for network or persistent data
❌ Changing Protobuf field tag numbers
❌ Adding required fields to a schema with existing data
❌ Not preserving unknown fields on re-serialization
❌ Schema evolution without a compatibility strategy
Green Flags
✅ Always make new fields optional with defaults
✅ Use field tags (Protobuf) or schema registry (Avro) for safe evolution
✅ Version your REST APIs explicitly (/v2/)
✅ Use schema registry for Kafka/Avro to enforce compatibility
✅ Test backward AND forward compatibility before deploying schema changes
Modern Additions (2026)
gRPC ecosystem:
├─ ConnectRPC: gRPC-compatible over HTTP/1.1 + JSON
├─ Buf CLI: linting, breaking change detection for Protobuf
└─ Protobuf editions (2023+): replaces proto2/proto3 distinction
Async API contracts:
├─ AsyncAPI 2.x: OpenAPI equivalent for event-driven APIs
├─ CloudEvents: CNCF standard envelope for event metadata
└─ Schema evolution in event sourcing: append-only event log + upcasters
Zero-copy serialization:
├─ Apache Arrow: columnar in-memory, shared between processes
├─ Cap'n Proto / FlatBuffers: no parsing step needed
└─ Used in ML/analytics pipelines for performance
Interview Response Templates
When Asked About API Versioning
“I’d version the API explicitly — either in the URL (/v2/users) or via an Accept header. For internal services using Protobuf/gRPC, I’d rely on field tag evolution: new fields are optional, never remove or reuse field numbers. For Kafka messages with Avro, I’d use Confluent Schema Registry with BACKWARD compatibility mode so new consumers can read old messages.”
When Asked About Schema Evolution
“The key rules are: always add new fields as optional with a default value, never remove required fields without first making them optional, and never change field tag numbers in Protobuf. Forward compatibility means old consumers should ignore unknown fields; backward compatibility means new consumers handle missing old data. During rolling upgrades, both directions must work simultaneously.”
Quick Revision Time: 5 minutes
Interview Prep: 15 minutes
Last Updated: 2026-04-13