Chapter 4 Flashcards - Encoding and Evolution

Basic Concepts

What is the difference between encoding and decoding?
?

Encoding (serialization / marshalling): Convert in-memory data structures (objects, structs, maps) into bytes for storage or transmission
Decoding (deserialization / parsing / unmarshalling): Convert bytes back into in-memory data structures
Why needed: In-memory pointers/references can’t be directly stored or sent over a network
Key concern: The encoding format determines how well the system handles schema changes over time

What is backward compatibility vs forward compatibility?
?

Backward compatibility: Newer code can read data written by older code
- Required when: you upgrade the app but old data still exists in DB
- Rule: New code must handle absence of fields added after old data was written
Forward compatibility: Older code can read data written by newer code
- Required when: rolling upgrades — old and new versions running simultaneously
- Rule: Old code must ignore unknown fields it doesn’t recognize

Both are needed simultaneously during rolling upgrades (new code deployed to some nodes while old code still runs on others)

Why should you never use language-specific serialization formats for persistent or interoperable data?
?

Language lock-in: Java Serializable can only be read by Java; Python pickle by Python
Poor versioning: No systematic way to evolve schema
Security risks: Deserializing untrusted data can execute arbitrary code (known attack vector)
Fragile: Minor class changes break deserialization of old data
Alternative: Use JSON, Protobuf, Avro, or Thrift — language-agnostic and well-versioned

Encoding Formats

What are the key weaknesses of JSON as a data encoding format?
?

No integer vs float distinction: 42 and 42.0 both just “numbers”
Large integers: Numbers > 2^53 lose precision in JavaScript (JSON was designed for JS)
No binary support: Binary data must be base64-encoded (33% size overhead)
No schema enforcement: Any consumer can receive unexpected field types
Verbose: Field names repeated in every record (no compression)
Despite weaknesses: JSON’s universality and human-readability make it dominant for public APIs

How does Protocol Buffers achieve backward and forward compatibility?
?

Each field has a field tag (number) and type annotation in the schema
Wire format contains field tag + value — NOT field name
This enables:
- Add new field (new tag number, optional): Forward compatible (old code ignores unknown tags), backward compatible (new code uses default if missing)
- Remove field: Old tag is no longer present in new data — old code that recognizes it gets nothing, new code ignores the old field
Rules:
- Never change a field’s tag number (old data would be misinterpreted)
- Never reuse a tag number for a different field
- New fields must be optional (or have a default) for backward compatibility

How does Avro differ from Thrift and Protocol Buffers in its approach to schema evolution?
?

Thrift/Protobuf: Field tag numbers in binary; reader uses tag to identify field
Avro: No field tags; binary contains only values in schema-defined order
Avro evolution mechanism:
- Writer embeds schema ID (not full schema) in each message/file
- Reader fetches writer’s schema from schema registry using ID
- Avro engine matches writer’s fields to reader’s fields by name
- Fields in writer but not in reader: ignored
- Fields in reader but not in writer: use default value
Implication: Renaming a field in Avro breaks compatibility; in Protobuf it doesn’t
Best use: Avro excels in Kafka (one schema per topic, schema registry)

What encoding size can you expect from JSON vs Protobuf for typical records?
?

JSON: ~60-100 bytes for a typical record (e.g., {"userName":"Martin","age":1337})
MessagePack (binary JSON): ~60-70 bytes (removes whitespace, encodes types compactly)
Thrift CompactProtocol: ~30-35 bytes
Protocol Buffers: ~30-35 bytes
Avro: ~25-35 bytes (no field names OR tags in wire format)
Key insight: Binary formats are ~2-3x more compact than JSON
For millions of events per second (Kafka), size difference = significant storage and bandwidth savings

Schema Evolution Rules

What are the safe schema evolution operations for Protocol Buffers?
?
Safe (backward + forward compatible):

✅ Add new optional field (new tag number, with default)
✅ Remove optional field (old tag becomes unused)
✅ Rename a field (tag unchanged, name just for code readability)

Unsafe:

❌ Add required field (old data doesn’t have it → new code fails)
❌ Remove required field (new data lacks it → old code fails)
❌ Change a field’s tag number (old data has different meaning)
❌ Reuse a deleted tag number for new field (old data still has values for it)
❌ Change field type incompatibly (e.g., int32 to string)

What is a schema registry and why is it required for Avro with Kafka?
?

Problem: Avro binary contains no field names or tags — without the schema, the bytes are unreadable
Schema registry: Central service that stores versioned schemas with IDs
Flow:
1. Producer registers schema → gets schema ID (e.g., 42)
2. Producer prepends 4-byte schema ID to each Kafka message
3. Consumer reads schema ID from message → fetches schema from registry
4. Consumer decodes message using writer’s schema + applies reader’s schema
Compatibility enforcement: Registry rejects schema versions that violate compatibility rules
Products: Confluent Schema Registry, AWS Glue Schema Registry, Buf Schema Registry

Modes of Dataflow

What are the three modes of dataflow described in DDIA Chapter 4?
?

Dataflow through databases
- Write = encode; read = decode
- Rolling upgrades: old and new code versions access same data simultaneously
- Key issue: preserve unknown fields when re-writing records
Dataflow through services (REST/RPC)
- Client sends request; server responds
- API versioning needed for independent deployments
- Older clients may talk to newer servers (or vice versa)
Dataflow through async message passing
- Producer sends to message broker (Kafka, RabbitMQ)
- Consumer reads at own pace; messages may be consumed hours/days later
- Key issue: messages must remain decodable long after production

What is the “preservation of unknown fields” problem in database dataflow?
?

Scenario: App v2 adds a new field email to the schema
App v1 (old code) reads and re-writes a record originally written by v2
Problem: App v1 doesn’t know about email; if it drops unknown fields on re-write, the email value is permanently lost
Solution: Applications should preserve and re-emit unknown fields even if they don’t understand them
In practice: ORMs and serialization libraries often silently drop unknown fields — needs explicit handling
This is why forward compatibility in databases is trickier than it sounds

What is the difference between REST and RPC, and when should you use each?
?
REST:

Uses HTTP features: verbs (GET/POST/PUT/DELETE), status codes, caching headers
Resources identified by URLs; stateless
Human-readable (JSON), easy to debug, browser-accessible
Schema optional (OpenAPI)
Best for: Public APIs, external clients, browser-accessible services

RPC (gRPC):

Makes remote call look like local function call
Protocol Buffers over HTTP/2; bidirectional streaming
Code generation in multiple languages from proto files
More efficient, strongly typed
Best for: Internal microservice-to-microservice communication

Key insight: RPC’s “looks like a local call” abstraction is leaky — networks are unreliable; retries, timeouts, and idempotency must be handled explicitly

Async Messaging

What are the advantages of message brokers over direct service calls?
?

Decoupling: Producer and consumer don’t need to be up simultaneously
Buffering: Broker absorbs traffic spikes; consumer processes at its own pace
Fan-out: One message, multiple independent consumers
Retry/redelivery: Broker retries if consumer fails
Ordering: Can preserve message order within a partition (Kafka)
Durability: Messages persisted until consumed (Kafka: retained indefinitely)

Trade-offs: Eventual consistency; harder to debug; more operational complexity

What is the difference between a message queue and a message log?
?
Message queue (RabbitMQ, SQS):

Push model: broker pushes to consumer
Message deleted after successful consumption
Each message consumed by one consumer (competing consumers)
Best for: task queues, work distribution

Message log (Kafka):

Pull model: consumers track their own offset
Messages retained on disk (configurable retention, e.g., 7 days)
Multiple consumer groups each read full log independently
Replay: consumers can re-read old messages
Best for: event streams, audit logs, stream processing, event sourcing

Modern Context (2026)

How has gRPC changed internal microservice communication since DDIA was written?
?

gRPC (2016) has become the dominant internal RPC framework
Why: Protocol Buffers + HTTP/2 = compact, typed, multiplexed, streaming
Code generation: auto-generated clients/servers in 10+ languages from .proto files
Bidirectional streaming: client and server can both stream (not possible in REST)
2026 additions:
- ConnectRPC: gRPC-compatible protocol that works with standard HTTP proxies
- Buf CLI + Buf Schema Registry: Protobuf schema management, linting, breaking change detection
- grpc-gateway: Automatically exposes gRPC as REST JSON API
Caveat: Debugging binary Protobuf is harder; need specialized tools (grpcurl, Postman)

What is AsyncAPI and how does it relate to OpenAPI?
?

OpenAPI (Swagger): Standard specification for synchronous REST APIs (request-response)
AsyncAPI: Equivalent standard for asynchronous/event-driven APIs (Kafka, WebSocket, AMQP, etc.)
AsyncAPI 2.x/3.x defines:
- Channels (topics/queues), messages, schemas, bindings per protocol
- Same expressive power as OpenAPI but for async messaging
CloudEvents: CNCF standard for event metadata envelope (source, type, time, id)
- Not a schema format but a standard wrapper for events
Why matters: As event-driven architectures proliferate, async contracts need the same rigor as REST contracts

Interview Scenarios

You’re designing an event streaming system with Kafka. How do you handle schema evolution?
?
Strategy: Avro + Schema Registry with BACKWARD compatibility

Setup:

Use Avro encoding for all Kafka messages
Deploy Confluent Schema Registry (or AWS Glue)
Configure topics with BACKWARD compatibility (new consumers can read old messages)

Evolution rules:

Adding fields: always with default values ✅
Removing fields: add to reader’s schema with default first (so old messages still work) ✅
Renaming: use schema aliases for Avro (or add new field + deprecate old)

Deployment order (for backward compatibility):

Deploy consumers first (with new schema that handles both old and new messages)
Deploy producers second (start sending new message format)

Why Avro > JSON here: Avro is 3x smaller, enforced schema, registry manages evolution

A microservice needs to be deployed without coordinating with all its callers. How do you version its API?
?
Pattern: Backward-compatible evolution + explicit versioning for breaking changes

Non-breaking changes (deploy anytime):

Add new optional fields to responses
Add new optional query parameters
Add new endpoints

Breaking changes (require versioning):

Remove fields from response
Change field semantics/type
Remove endpoints

Versioning approaches:

URL versioning: /v1/users, /v2/users — most visible, easy to route
Header versioning: Accept: application/vnd.myapi.v2+json — cleaner URLs
Feature flags: Use Accept-Version: 2026-01-01 date-based (Stripe’s approach)

Recommended: URL versioning for major versions + backward-compatible field additions without version bump

Quick Facts

What are the three levels of schema compatibility enforced by schema registries?
?

BACKWARD: New schema can read data written with previous schema
- Safe for consumers: deploy consumers first, then producers
FORWARD: Previous schema can read data written with new schema
- Safe for producers: deploy producers first, then consumers
FULL: Both BACKWARD and FORWARD
- Safest: can deploy in any order
- Most restrictive: only add/remove optional fields with defaults

NONE: No compatibility checking (dangerous for production)

What is the size difference between a Protobuf encoding and the same data as JSON?
?

Typical ratio: Protobuf is 2-3x smaller than JSON
Why:
- No field names in binary (just numeric tags)
- Integers encoded as varints (1-3 bytes for small numbers vs 1-10 chars in JSON)
- No whitespace, quotes, brackets overhead
Real example (Martin Luther with 3 fields):
- JSON: ~81 bytes
- Protobuf: ~33 bytes (~2.5x smaller)
Impact at scale: For 1 billion events/day, 2.5x smaller = ~50% storage cost reduction

Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

ch04-flashcards