Chapter 5 Flashcards — Encoding and Evolution

flashcards ddia-2e chapter5 encoding schema-evolution durable-execution

Definitions and Mechanisms

What is encoding (serialization) and why is it necessary?
?
Encoding (also called serialization or marshalling) converts in-memory data structures (objects, structs, hashmaps) into a byte sequence suitable for storage or transmission over a network.
Why needed: In-memory structures rely on pointers and memory addresses that are meaningless outside the process. A byte sequence must be self-contained.
Decoding (deserialization / parsing / unmarshalling) is the reverse: bytes back into in-memory structures.
Key concern: The encoding format determines how gracefully the system handles schema mismatches between encoder and decoder.

What is backward compatibility? Give a concrete example.
?
Backward compatibility: newer code can read data written by older code.
Why needed: When you upgrade the application but old data still exists in the database.
Example: App v2 adds an email field. Old records in the DB were written by App v1 and have no email. App v2 must handle this gracefully — typically by treating absent email as null or empty string.
Rule: New fields must have sensible default values so that new code can process old records that lack them.

What is forward compatibility? Give a concrete example.
?
Forward compatibility: older code can read data written by newer code.
Why needed: During a rolling upgrade, old and new versions run simultaneously and may both access the same database or consume from the same Kafka topic.
Example: App v2 adds an email field and writes it to the DB. App v1 reads the record — it must not crash when it encounters the unknown email field. It should simply ignore fields it doesn’t recognize.
Key insight: Both backward AND forward compatibility must hold simultaneously during rolling upgrades.

Why should you never use Java Serializable, Python pickle, or Ruby Marshal for persistent or inter-service data?
?
Three critical problems:

Language lock-in: Bytes are only decodable in the same language (and often the same library version). Cross-language interop is impossible.
Security vulnerability: Deserializing untrusted bytes in Java/Python can execute arbitrary code — a well-documented attack vector.
Poor versioning: Adding or renaming a Java class field silently breaks deserialization of older blobs; no systematic schema evolution support.
Rule: Use language-agnostic formats (JSON, Protobuf, Avro, Thrift) for any data that crosses a process boundary or is stored persistently.

How does Protocol Buffers achieve backward and forward compatibility through schema evolution?
?
Mechanism: Each field in Protobuf is identified in the binary wire format by its field tag (an integer number), not by its field name.

Adding an optional field with a new tag number: backward compatible (new code uses default if field absent) AND forward compatible (old code ignores unknown tag numbers).
Removing an optional field: the tag simply stops appearing in new data; old code gets nothing for that field; new code ignores the old tag.
Rules you must never break:
Never change a field’s tag number — old data encoded with the old tag becomes misinterpreted.
Never reuse a deleted tag number for a new field — old data still has values for that tag.
Never add required fields — old data encoded without them fails validation in new code.
New fields must be optional (or have a default) for backward compatibility.

How does Avro’s approach to schema evolution differ from Protocol Buffers?
?
Fundamental difference: Avro has no field tags in the binary wire format. Binary Avro is just values in sequence — you cannot decode it without the schema.
Avro’s mechanism: At decode time, the writer’s schema (the schema active when the data was written) is compared to the reader’s schema (what the current code expects). Fields are matched by name.

Field in writer but not in reader: value is ignored.
Field in reader but not in writer: reader’s declared default value is used.
Consequence: Renaming a field in Avro breaks compatibility (unlike Protobuf where names are irrelevant). New Avro fields must always have a default value.
How reader gets writer’s schema: Via a schema registry (Kafka) or embedded in the file header (Avro container format).

What are the three levels of schema compatibility enforced by schema registries?
?

BACKWARD: New schema can read data written by the previous schema version.
- Deployment order: deploy consumers first (they handle both old and new messages), then producers.
FORWARD: Previous schema can read data written by the new schema version.
- Deployment order: deploy producers first, then consumers.
FULL: Both BACKWARD and FORWARD simultaneously.
- Most restrictive: only add/remove optional fields with defaults.
- Safest: deploy in any order.
NONE: No compatibility checking (dangerous in production).
Products: Confluent Schema Registry, AWS Glue Schema Registry, Buf Schema Registry.

Durable Execution

What problem does durable execution solve, and what is the key mechanism?
?
Problem: Multi-step workflows (charge card → send email → update inventory) must survive process crashes, restarts, and deployments. Without a solution, a crash between step 1 and step 2 leaves the system in a partial state requiring complex manual recovery.
Mechanism: A durable execution engine (Temporal, AWS Step Functions) automatically persists the full execution history (event log) to a durable store after each step. If the process crashes, a worker replays the event history to reconstruct in-memory state and resume execution from the last completed step.
Key properties:

Each activity (step) executes at-least-once with framework-managed retries.
Workflows can run for hours, days, or years.
Full visibility: execution history is queryable; complete audit trail.

How does Temporal work at an architectural level?
?
Components:

Temporal Server: Persists the complete event history for every workflow to a database (usually Cassandra or PostgreSQL). Manages task queues for workers.
Workflow Workers: Poll for workflow tasks, execute workflow code by replaying event history, report decisions back to the server.
Activity Workers: Poll for activity tasks, execute actual work (API calls, DB writes), report results.
Execution flow:

Client starts workflow → Server creates event log with WorkflowStarted.
Worker picks up → replays history → executes next step → reports ActivityScheduled.
Activity worker executes activity → reports ActivityCompleted → server appends to log.
On crash: new worker replays entire event log to reconstruct state → continues from next step.
Key constraint: Workflow code must be deterministic — given the same event history, it must make the same decisions every replay.

Why must durable execution workflow code be deterministic, and what does this mean in practice?
?
Why deterministic: On crash, a new worker replays the event history to reconstruct in-memory state. If the code is non-deterministic, the replay may diverge from the original execution — computing different decisions from the same events — making the resumed workflow inconsistent.
Practical rules:

No time.now() directly — use workflow.now() which returns the time from the event log.
No random.random() — use workflow.random() which is seeded from the event log.
No direct HTTP/DB calls — wrap in execute_activity() so results are persisted in the log.
No global mutable state — each replay starts fresh; any state must come from the event history.
No external library calls with side effects — only pure computation is safe in workflow code.
Consequence of violation: Non-deterministic bugs cause workflows to behave differently on replay, leading to stuck or corrupted workflows — extremely hard to debug.

When should you use durable execution vs message queues vs event streaming?
?

Scenario	Best Choice
Multi-step workflow, minutes to hours, crash safety needed	Durable execution (Temporal)
Short async job (<30 seconds), work distribution	Message queue (SQS, RabbitMQ)
High-throughput event fan-out to many consumers	Event streaming (Kafka)
Human-in-the-loop approvals with long waits	Durable execution (long timers)
Periodic batch job (nightly ETL)	Cron + task queue
Real-time stream processing	Flink / Kafka Streams
Key insight: Durable execution shines when a workflow has multiple external calls, must survive failures between them, and the developer wants to write ordinary code (not a saga state machine).

Event-Driven Architecture Patterns

What are the three event-driven architecture patterns, and how do they differ?
?
1. Event Notification: A small event signals that something happened. Consumer calls back to source for full data.

Event: {orderId: "123"} — minimal, no payload
Pro: low coupling, tiny events
Con: consumer still depends on source’s API being available

2. Event-Carried State Transfer: Event carries the full entity state at the time of the event.

Event: {orderId: "123", items: [...], total: 49.99, customer: {...}}
Pro: consumer is autonomous — no callback needed
Con: large events, data duplication, schema evolution is harder

3. Event Sourcing: All state changes are stored as an immutable log of events. Current state is derived by replaying the log.

Log: [order.created] [item.removed] [payment.received] [order.shipped]
State = fold(events) — recomputed by replaying history
Pro: full audit trail, time travel, multiple projections from same log
Con: eventual consistency of projections, snapshot complexity, upcasters for schema evolution

What is event sourcing, and what are CQRS and upcasters in that context?
?
Event sourcing: A persistence pattern where all changes to application state are stored as an immutable, append-only log of events. The current state of an entity is derived by replaying its event log from the beginning (or a snapshot).
CQRS (Command Query Responsibility Segregation): Separate the write model (events in the log) from the read model (denormalized projections). Writes go to the event log; reads query a projection built by consuming the event log.
Projection: A derived read model built by consuming the event log — e.g., a SQL table, a search index, or a graph. Multiple projections can be built from the same event log.
Upcaster: A function that transforms an old event format into the current format during replay. Required when the event schema has changed since the event was written — allows old events to remain in the log indefinitely while the domain model evolves.
Example: OrderCreatedV1 → upcaster → OrderCreatedV2 — the upcaster adds the email field with a default value for old events.

REST, RPC, and Service Communication

What are the fundamental differences between REST and gRPC, and when should you use each?
?

Dimension	REST	gRPC
Protocol	HTTP/1.1 or HTTP/2	HTTP/2 only
Format	JSON (usually)	Protocol Buffers
Schema	Optional (OpenAPI)	Required (.proto)
Streaming	Limited (SSE)	Bidirectional
Debugging	Easy (curl, browser)	Needs grpcurl/tooling
Caching	Native HTTP	Manual
Code gen	Optional	Built-in
Use REST when: public API, external clients (browsers/mobile), human debugging needed, HTTP caching valuable.
Use gRPC when: internal microservice-to-microservice communication, need bidirectional streaming, high-performance with typed contracts, code generation in multiple languages from a single `.proto` file.

Why is the “looks like a local call” abstraction of RPC considered leaky?
?
Remote calls are fundamentally different from local calls in ways that cannot be hidden:

Unpredictable latency: A local call takes nanoseconds; a remote call takes milliseconds to seconds.
Partial failure: The request may have reached the server but the response was lost. Did the operation execute? The caller doesn’t know.
Idempotency required: If you retry after a timeout, you may execute the operation twice. Local function calls don’t have this problem.
Serialization cost: Arguments and return values must be encoded/decoded — no analogue in local calls.
Network unreliability: Packet loss, routing changes, TCP connection failures have no local analogue.
Implication: RPC code must explicitly handle timeouts, retries, idempotency, and partial failures — pretending these concerns don’t exist leads to hard-to-debug production failures.

Encoding Format Details

What are the key weaknesses of JSON as a data format for distributed systems?
?

No integer/float distinction: 42 and 42.0 are both just “number” — integers > 2^53 lose precision in JavaScript (JSON was designed for JS).
No binary support: Binary data must be base64-encoded, adding ~33% overhead and making the JSON opaque.
No enforced schema: Any key name and value type is syntactically valid — contract violations are runtime errors, not compile-time.
Verbose: Field names are repeated in every record. For millions of events/day, this is significant storage and bandwidth waste.
Schema evolution is disciplined convention, not enforced: Adding/removing fields doesn’t break JSON parsing but may break application logic silently.
Despite weaknesses: JSON’s universality, human-readability, and universal tooling support make it the correct choice for external/public APIs.

How does a schema registry work with Kafka and Avro?
?
Problem: Avro binary is unreadable without the schema — and the schema changes over time.
Flow:

Producer loads schema, registers it with the registry → receives a 4-byte schema ID.
Producer encodes each Kafka message as: [4-byte schema ID][Avro binary payload].
Consumer reads message, extracts 4-byte schema ID.
Consumer fetches the writer’s schema from the registry using the ID.
Consumer decodes the Avro payload using writer’s schema and its own reader’s schema.
Avro resolution fills in defaults for new fields, ignores unknown fields.
Compatibility enforcement: Registry rejects registrations that violate the configured compatibility mode (BACKWARD/FORWARD/FULL), so incompatible schemas never reach producers.

Comparison and Trade-offs

Compare Protobuf vs Avro: when would you choose each?
?

Dimension	Protobuf	Avro
Field identity in wire	Tag number	None (schema-driven by name)
Schema registry needed	Optional (Buf)	Required for Kafka use
Rename field	Safe (tag unchanged)	Breaking change
Add field	Safe (new tag, optional)	Safe (requires default)
Code generation	Built-in (protoc)	Optional
Primary ecosystem	gRPC, internal APIs	Kafka, Hadoop
Compression	~40% of JSON	~38% of JSON
Choose Protobuf: internal services using gRPC, need code generation in multiple languages, want to rename fields safely, comfortable managing `.proto` files.
Choose Avro: Kafka pipelines where schema registry is already deployed, Hadoop/Spark data lake workflows, want maximum compression, schema evolution governed by registry not by code.

What is the unknown-field preservation problem in database dataflow?
?
Scenario: App v2 adds a new email field to the User schema.

App v2 writes a User record to the database with email populated.
App v1 (older version, still running during rolling upgrade) reads the record, processes it, and writes it back.
App v1’s ORM has no email field in its model.
Problem: If App v1’s ORM silently drops unknown fields on write-back, the email value is permanently lost from the database — silent data loss.
Correct behavior: Applications must preserve and re-emit all fields they read, even fields they don’t understand, maintaining them opaquely in the written record.
In practice: Most ORMs require explicit configuration to preserve unknown fields. This is a common source of subtle data corruption bugs during rolling upgrades.

Modern Context (2026)

What is ConnectRPC, and how does it improve on vanilla gRPC?
?
Problem with vanilla gRPC: Requires HTTP/2, which many HTTP proxies, load balancers, and CDNs don’t fully support. Also requires gRPC-specific tooling to debug (cannot use curl for binary Protobuf).
ConnectRPC: A protocol that is fully compatible with gRPC and gRPC-Web, but also supports plain HTTP/1.1 and JSON. Key features:

A ConnectRPC service can be called via gRPC, gRPC-Web, OR standard HTTP/1.1 + JSON — same server code, any client.
JSON encoding makes it debuggable with curl and browser developer tools.
Works with standard reverse proxies and CDNs without gRPC-specific configuration.
Adoption: Growing quickly in organizations that want gRPC’s type safety and code generation without the HTTP/2-only constraint.

What is Temporal, and why has it become significant in distributed systems (2026)?
?
Temporal is an open-source durable execution engine that provides workflow orchestration with automatic state persistence and crash recovery.
Why it became significant:

Released in 2019 (spun out of Uber’s Cadence project), it reached mainstream adoption by 2022-2024.
Solves the “saga complexity” problem: instead of implementing saga patterns with compensating transactions and complex state machines, developers write ordinary sequential code.
Adopted by Stripe, Netflix, Coinbase, HashiCorp, DoorDash for critical business workflows.
Supports Go, Java, Python, TypeScript, PHP SDKs.
Temporal Cloud provides a fully managed version, removing operational burden.
Position in the ecosystem: Temporal fills the gap between message queues (short async tasks) and full workflow engines (BPM systems) — it’s workflow orchestration for engineers, not business analysts.

What is AsyncAPI and how does it relate to OpenAPI?
?

OpenAPI (Swagger): Standard specification for synchronous REST APIs (request-response). Describes HTTP endpoints, request/response schemas, status codes.
AsyncAPI: The equivalent standard for asynchronous and event-driven APIs — Kafka topics, WebSocket channels, AMQP queues, MQTT topics, etc.
AsyncAPI 2.x/3.x defines:
- Channels (topics/queues), message schemas, protocol bindings, server details.
- Supports the same expressive power as OpenAPI but for async messaging.
CloudEvents: CNCF standard for a common event envelope format (specversion, type, source, id, time, data). Not a schema format — a standard metadata wrapper to enable interoperability across different event sources and brokers.
Why it matters: As event-driven architectures proliferate, async API contracts need the same rigor (documentation, code generation, schema validation) as REST contracts. AsyncAPI fills that gap.

Numbers and Precision

What are the approximate relative sizes of common encoding formats for the same data?
?
Reference: a typical record with 3 fields (string, integer, repeated string):

JSON: ~81 bytes (baseline, 100%)
MessagePack: ~66 bytes (~81%)
Thrift BinaryProtocol: ~59 bytes (~73%)
Thrift CompactProtocol: ~34 bytes (~42%)
Protocol Buffers: ~33 bytes (~41%)
Avro: ~32 bytes (~40%)
Rule of thumb: Binary formats are 2-2.5x more compact than JSON.
At scale: For a system processing 1 billion events/day at 100 bytes/event average, switching from JSON to Protobuf saves ~~500 GB/day in storage and bandwidth (~~$15-50/day at cloud storage prices — significant at scale).

What are the schema compatibility modes and which deployment order is safe for each?
?

Mode	Meaning	Safe Deployment Order
BACKWARD	New schema reads data from previous schema	Deploy consumers first, then producers
FORWARD	Previous schema reads data from new schema	Deploy producers first, then consumers
FULL	Both BACKWARD and FORWARD	Any order — safest
BACKWARD_TRANSITIVE	New schema reads ALL previous versions	Deploy consumers first (strongest)
FORWARD_TRANSITIVE	ALL previous schemas read new data	Deploy producers first (strongest)
FULL_TRANSITIVE	All of the above	Any order (most restrictive)
Most common choice: BACKWARD — it ensures new consumers can always read old messages (critical when messages are retained in Kafka for days/weeks).

Application and Failure Scenarios

A payment workflow crashes after charging the card but before sending the confirmation email. How do Temporal and a naive queue-based approach differ in handling this?
?
Naive queue-based approach:

Charge card: one queue job.
Send email: another queue job, triggered by the first completing.
On crash between jobs: the first job marked “complete” but the trigger for the second job was never sent.
Result: card charged, no email. Requires manual intervention or complex idempotency checks.

Temporal approach:

Single workflow function: charge_card() → send_email().
After charge_card() completes, Temporal persists ActivityCompleted to the event log.
On crash: new worker replays event log → sees charge_card already completed → skips to send_email() → executes it.
Result: workflow resumes exactly where it left off, no manual intervention.
The activity charge_card is marked with a retry policy and its result is idempotent to the workflow — even if it runs again, the workflow only uses the first recorded result.

You’re asked to design schema evolution strategy for a Kafka event streaming system. Walk through your approach.
?
Step 1: Choose encoding format

Use Avro or Protobuf (not JSON) — schema enforcement and 2-3x smaller messages.

Step 2: Deploy schema registry

Confluent Schema Registry or AWS Glue Schema Registry.
Configure BACKWARD compatibility mode: new consumers handle both old and new messages.

Step 3: Establish evolution rules

New fields: always optional with default values.
Removed fields: mark deprecated in documentation; add to reader’s schema with default before removing from writer.
Field renames (Avro): use schema aliases; never rename directly.

Step 4: Deployment order (BACKWARD mode)

Deploy consumers with new schema (handles both old and new messages).
Validate consumers running correctly.
Deploy producers with new schema (start emitting new message format).

Step 5: CI enforcement

Add schema compatibility check to CI pipeline (Buf CLI for Protobuf, avro-compatibility tool for Avro).
Reject PRs that introduce incompatible schema changes.

Total Cards: 25
Review Time: ~20-25 minutes
Priority: HIGH
Last Updated: 2026-05-29

Study Notes by Niladri & AI

Explorer

ch05-flashcards