Chapter 4 Flashcards - Encoding and Evolution
Basic Concepts
What is the difference between encoding and decoding?
?
- Encoding (serialization / marshalling): Convert in-memory data structures (objects, structs, maps) into bytes for storage or transmission
- Decoding (deserialization / parsing / unmarshalling): Convert bytes back into in-memory data structures
- Why needed: In-memory pointers/references can’t be directly stored or sent over a network
- Key concern: The encoding format determines how well the system handles schema changes over time
What is backward compatibility vs forward compatibility?
?
- Backward compatibility: Newer code can read data written by older code
- Required when: you upgrade the app but old data still exists in DB
- Rule: New code must handle absence of fields added after old data was written
- Forward compatibility: Older code can read data written by newer code
- Required when: rolling upgrades — old and new versions running simultaneously
- Rule: Old code must ignore unknown fields it doesn’t recognize
Both are needed simultaneously during rolling upgrades (new code deployed to some nodes while old code still runs on others)
Why should you never use language-specific serialization formats for persistent or interoperable data?
?
- Language lock-in: Java Serializable can only be read by Java; Python pickle by Python
- Poor versioning: No systematic way to evolve schema
- Security risks: Deserializing untrusted data can execute arbitrary code (known attack vector)
- Fragile: Minor class changes break deserialization of old data
- Alternative: Use JSON, Protobuf, Avro, or Thrift — language-agnostic and well-versioned
Encoding Formats
What are the key weaknesses of JSON as a data encoding format?
?
- No integer vs float distinction:
42and42.0both just “numbers” - Large integers: Numbers > 2^53 lose precision in JavaScript (JSON was designed for JS)
- No binary support: Binary data must be base64-encoded (33% size overhead)
- No schema enforcement: Any consumer can receive unexpected field types
- Verbose: Field names repeated in every record (no compression)
- Despite weaknesses: JSON’s universality and human-readability make it dominant for public APIs
How does Protocol Buffers achieve backward and forward compatibility?
?
- Each field has a field tag (number) and type annotation in the schema
- Wire format contains field tag + value — NOT field name
- This enables:
- Add new field (new tag number, optional): Forward compatible (old code ignores unknown tags), backward compatible (new code uses default if missing)
- Remove field: Old tag is no longer present in new data — old code that recognizes it gets nothing, new code ignores the old field
- Rules:
- Never change a field’s tag number (old data would be misinterpreted)
- Never reuse a tag number for a different field
- New fields must be optional (or have a default) for backward compatibility
How does Avro differ from Thrift and Protocol Buffers in its approach to schema evolution?
?
- Thrift/Protobuf: Field tag numbers in binary; reader uses tag to identify field
- Avro: No field tags; binary contains only values in schema-defined order
- Avro evolution mechanism:
- Writer embeds schema ID (not full schema) in each message/file
- Reader fetches writer’s schema from schema registry using ID
- Avro engine matches writer’s fields to reader’s fields by name
- Fields in writer but not in reader: ignored
- Fields in reader but not in writer: use default value
- Implication: Renaming a field in Avro breaks compatibility; in Protobuf it doesn’t
- Best use: Avro excels in Kafka (one schema per topic, schema registry)
What encoding size can you expect from JSON vs Protobuf for typical records?
?
- JSON: ~60-100 bytes for a typical record (e.g.,
{"userName":"Martin","age":1337}) - MessagePack (binary JSON): ~60-70 bytes (removes whitespace, encodes types compactly)
- Thrift CompactProtocol: ~30-35 bytes
- Protocol Buffers: ~30-35 bytes
- Avro: ~25-35 bytes (no field names OR tags in wire format)
- Key insight: Binary formats are ~2-3x more compact than JSON
- For millions of events per second (Kafka), size difference = significant storage and bandwidth savings
Schema Evolution Rules
What are the safe schema evolution operations for Protocol Buffers?
?
Safe (backward + forward compatible):
- ✅ Add new optional field (new tag number, with default)
- ✅ Remove optional field (old tag becomes unused)
- ✅ Rename a field (tag unchanged, name just for code readability)
Unsafe:
- ❌ Add required field (old data doesn’t have it → new code fails)
- ❌ Remove required field (new data lacks it → old code fails)
- ❌ Change a field’s tag number (old data has different meaning)
- ❌ Reuse a deleted tag number for new field (old data still has values for it)
- ❌ Change field type incompatibly (e.g., int32 to string)
What is a schema registry and why is it required for Avro with Kafka?
?
- Problem: Avro binary contains no field names or tags — without the schema, the bytes are unreadable
- Schema registry: Central service that stores versioned schemas with IDs
- Flow:
- Producer registers schema → gets schema ID (e.g.,
42) - Producer prepends 4-byte schema ID to each Kafka message
- Consumer reads schema ID from message → fetches schema from registry
- Consumer decodes message using writer’s schema + applies reader’s schema
- Producer registers schema → gets schema ID (e.g.,
- Compatibility enforcement: Registry rejects schema versions that violate compatibility rules
- Products: Confluent Schema Registry, AWS Glue Schema Registry, Buf Schema Registry
Modes of Dataflow
What are the three modes of dataflow described in DDIA Chapter 4?
?
-
Dataflow through databases
- Write = encode; read = decode
- Rolling upgrades: old and new code versions access same data simultaneously
- Key issue: preserve unknown fields when re-writing records
-
Dataflow through services (REST/RPC)
- Client sends request; server responds
- API versioning needed for independent deployments
- Older clients may talk to newer servers (or vice versa)
-
Dataflow through async message passing
- Producer sends to message broker (Kafka, RabbitMQ)
- Consumer reads at own pace; messages may be consumed hours/days later
- Key issue: messages must remain decodable long after production
What is the “preservation of unknown fields” problem in database dataflow?
?
- Scenario: App v2 adds a new field
emailto the schema - App v1 (old code) reads and re-writes a record originally written by v2
- Problem: App v1 doesn’t know about
email; if it drops unknown fields on re-write, theemailvalue is permanently lost - Solution: Applications should preserve and re-emit unknown fields even if they don’t understand them
- In practice: ORMs and serialization libraries often silently drop unknown fields — needs explicit handling
- This is why forward compatibility in databases is trickier than it sounds
What is the difference between REST and RPC, and when should you use each?
?
REST:
- Uses HTTP features: verbs (GET/POST/PUT/DELETE), status codes, caching headers
- Resources identified by URLs; stateless
- Human-readable (JSON), easy to debug, browser-accessible
- Schema optional (OpenAPI)
- Best for: Public APIs, external clients, browser-accessible services
RPC (gRPC):
- Makes remote call look like local function call
- Protocol Buffers over HTTP/2; bidirectional streaming
- Code generation in multiple languages from proto files
- More efficient, strongly typed
- Best for: Internal microservice-to-microservice communication
Key insight: RPC’s “looks like a local call” abstraction is leaky — networks are unreliable; retries, timeouts, and idempotency must be handled explicitly
Async Messaging
What are the advantages of message brokers over direct service calls?
?
- Decoupling: Producer and consumer don’t need to be up simultaneously
- Buffering: Broker absorbs traffic spikes; consumer processes at its own pace
- Fan-out: One message, multiple independent consumers
- Retry/redelivery: Broker retries if consumer fails
- Ordering: Can preserve message order within a partition (Kafka)
- Durability: Messages persisted until consumed (Kafka: retained indefinitely)
Trade-offs: Eventual consistency; harder to debug; more operational complexity
What is the difference between a message queue and a message log?
?
Message queue (RabbitMQ, SQS):
- Push model: broker pushes to consumer
- Message deleted after successful consumption
- Each message consumed by one consumer (competing consumers)
- Best for: task queues, work distribution
Message log (Kafka):
- Pull model: consumers track their own offset
- Messages retained on disk (configurable retention, e.g., 7 days)
- Multiple consumer groups each read full log independently
- Replay: consumers can re-read old messages
- Best for: event streams, audit logs, stream processing, event sourcing
Modern Context (2026)
How has gRPC changed internal microservice communication since DDIA was written?
?
- gRPC (2016) has become the dominant internal RPC framework
- Why: Protocol Buffers + HTTP/2 = compact, typed, multiplexed, streaming
- Code generation: auto-generated clients/servers in 10+ languages from
.protofiles - Bidirectional streaming: client and server can both stream (not possible in REST)
- 2026 additions:
- ConnectRPC: gRPC-compatible protocol that works with standard HTTP proxies
- Buf CLI + Buf Schema Registry: Protobuf schema management, linting, breaking change detection
- grpc-gateway: Automatically exposes gRPC as REST JSON API
- Caveat: Debugging binary Protobuf is harder; need specialized tools (grpcurl, Postman)
What is AsyncAPI and how does it relate to OpenAPI?
?
- OpenAPI (Swagger): Standard specification for synchronous REST APIs (request-response)
- AsyncAPI: Equivalent standard for asynchronous/event-driven APIs (Kafka, WebSocket, AMQP, etc.)
- AsyncAPI 2.x/3.x defines:
- Channels (topics/queues), messages, schemas, bindings per protocol
- Same expressive power as OpenAPI but for async messaging
- CloudEvents: CNCF standard for event metadata envelope (source, type, time, id)
- Not a schema format but a standard wrapper for events
- Why matters: As event-driven architectures proliferate, async contracts need the same rigor as REST contracts
Interview Scenarios
You’re designing an event streaming system with Kafka. How do you handle schema evolution?
?
Strategy: Avro + Schema Registry with BACKWARD compatibility
Setup:
- Use Avro encoding for all Kafka messages
- Deploy Confluent Schema Registry (or AWS Glue)
- Configure topics with BACKWARD compatibility (new consumers can read old messages)
Evolution rules:
- Adding fields: always with default values ✅
- Removing fields: add to reader’s schema with default first (so old messages still work) ✅
- Renaming: use schema aliases for Avro (or add new field + deprecate old)
Deployment order (for backward compatibility):
- Deploy consumers first (with new schema that handles both old and new messages)
- Deploy producers second (start sending new message format)
Why Avro > JSON here: Avro is 3x smaller, enforced schema, registry manages evolution
A microservice needs to be deployed without coordinating with all its callers. How do you version its API?
?
Pattern: Backward-compatible evolution + explicit versioning for breaking changes
Non-breaking changes (deploy anytime):
- Add new optional fields to responses
- Add new optional query parameters
- Add new endpoints
Breaking changes (require versioning):
- Remove fields from response
- Change field semantics/type
- Remove endpoints
Versioning approaches:
- URL versioning:
/v1/users,/v2/users— most visible, easy to route - Header versioning:
Accept: application/vnd.myapi.v2+json— cleaner URLs - Feature flags: Use
Accept-Version: 2026-01-01date-based (Stripe’s approach)
Recommended: URL versioning for major versions + backward-compatible field additions without version bump
Quick Facts
What are the three levels of schema compatibility enforced by schema registries?
?
- BACKWARD: New schema can read data written with previous schema
- Safe for consumers: deploy consumers first, then producers
- FORWARD: Previous schema can read data written with new schema
- Safe for producers: deploy producers first, then consumers
- FULL: Both BACKWARD and FORWARD
- Safest: can deploy in any order
- Most restrictive: only add/remove optional fields with defaults
NONE: No compatibility checking (dangerous for production)
What is the size difference between a Protobuf encoding and the same data as JSON?
?
- Typical ratio: Protobuf is 2-3x smaller than JSON
- Why:
- No field names in binary (just numeric tags)
- Integers encoded as varints (1-3 bytes for small numbers vs 1-10 chars in JSON)
- No whitespace, quotes, brackets overhead
- Real example (Martin Luther with 3 fields):
- JSON: ~81 bytes
- Protobuf: ~33 bytes (~2.5x smaller)
- Impact at scale: For 1 billion events/day, 2.5x smaller = ~50% storage cost reduction
Total Cards: 35
Estimated Review Time: 20-30 minutes
Recommended Frequency: Daily for first week, then spaced repetition
Last Updated: 2026-04-13