Chapter 12 Flashcards — Transactional Sagas
flashcards saht sagas distributed-transactions eventual-consistency orchestration choreography
What is a saga in distributed systems?
?
A saga is a sequence of local transactions — one per participating service — where each step either completes successfully or triggers compensating transactions to undo the effects of prior completed steps. Sagas replace two-phase commit (2PC) with a loosely coupled alternative that respects service autonomy.
What are the three binary axes of saga classification in SAHT?
?
- Communication style: Synchronous (s) vs. Asynchronous (a)
- Consistency model: Atomic (a) vs. Eventual (e)
- Coordination style: Orchestrated (o) vs. Choreographed (c)
These three binary dimensions produce 2³ = 8 distinct saga patterns.
How is the three-letter saga code constructed?
?
[communication][consistency][coordination]
- First letter:
s(synchronous) ora(asynchronous) - Second letter:
a(atomic) ore(eventual) - Third letter:
o(orchestrated) orc(choreographed)
Example: sao = Synchronous + Atomic + Orchestrated = Epic Saga
What is synchronous communication in a saga context, and what are its key trade-offs?
?
Synchronous (s): The caller blocks and waits for a response before proceeding to the next step. Uses HTTP/REST or gRPC request-response.
- Advantages: Simpler error handling — the caller knows immediately whether a step succeeded or failed
- Disadvantages: Higher coupling; lower throughput; threads/connections held open during the call; the entire saga is as slow as its slowest step
What is asynchronous communication in a saga context, and what are its key trade-offs?
?
Asynchronous (a): The caller publishes a message and continues; results arrive later via events or callbacks. Uses message queues (Kafka, RabbitMQ, SQS) or event buses.
- Advantages: Temporal decoupling; higher throughput and scalability; callee can be temporarily unavailable
- Disadvantages: More complex error handling; failures discovered out-of-band; out-of-order events possible; idempotency and timeout handling required
What is atomic consistency in a saga, and what does it require?
?
Atomic (a): All saga steps are treated as a single logical unit — either all succeed or compensating transactions are triggered to undo all completed steps. Aims for ACID-like behavior across services.
Requires:
- A saga state machine tracking the status of every step
- Compensating transaction logic for every completed step that may need reversal
- Idempotency on both forward and compensating steps
- Timeout handling to detect unresponsive participants
What is eventual consistency in a saga, and when is it appropriate?
?
Eventual (e): The saga accepts that intermediate inconsistent states will temporarily exist. The system converges to a consistent state over time through retries and reconciliation — there is no all-or-nothing guarantee.
Appropriate when:
- Partial completion carries acceptable business risk (e.g., notification delays, analytics lag)
- Higher availability is more important than strict consistency
- Downstream systems are designed as “tolerant readers”
What is orchestrated coordination in a saga?
?
Orchestrated (o): A central orchestrator service knows the full workflow. It sends commands to participants, receives responses, tracks saga state, and decides what to do next — including triggering compensation on failure.
- Advantages: Clear visibility into saga state; easy to add/modify/compensate steps; workflow logic in one place
- Disadvantages: Orchestrator is a coupling point and potential bottleneck; single point of failure risk
What is choreographed coordination in a saga?
?
Choreographed (c): No central coordinator. Services react to events published by other services. Each service knows only its own role and what events to emit when done.
- Advantages: Maximum temporal decoupling; no central failure point; scales well
- Disadvantages: Workflow logic is distributed and implicit; end-to-end state is nearly invisible; distributed tracing infrastructure is required; very hard to debug
Name all 8 saga patterns with their 3-letter codes and the three attributes each represents.
?
| Pattern | Code | Communication | Consistency | Coordination |
|---|---|---|---|---|
| Epic Saga | sao | Synchronous | Atomic | Orchestrated |
| Phone Tag Saga | sac | Synchronous | Atomic | Choreographed |
| Fairy Tale Saga | seo | Synchronous | Eventual | Orchestrated |
| Time Travel Saga | sec | Synchronous | Eventual | Choreographed |
| Fantasy Fiction Saga | aao | Asynchronous | Atomic | Orchestrated |
| Horror Story | aac | Asynchronous | Atomic | Choreographed |
| Parallel Saga | aeo | Asynchronous | Eventual | Orchestrated |
| Anthology Saga | aec | Asynchronous | Eventual | Choreographed |
Epic Saga (sao): How does it work, what are its key trade-offs, and when should it be used?
?
How it works: A central orchestrator calls each participant synchronously, one at a time (or in parallel groups), and waits for each response. On any failure, the orchestrator issues synchronous compensating calls to all prior completed steps.
Key trade-offs: Strongest consistency; simplest state management; but very high coupling, very low throughput, very low fault tolerance (one unavailable service blocks everything), and high latency (sum of all step latencies).
When to use: Financial transactions requiring all-or-nothing semantics; low-volume, high-criticality workflows; when team needs maximum observability. Avoid in high-throughput systems or when services have variable latency profiles.
Phone Tag Saga (sac): How does it work and why is it problematic?
?
How it works: No orchestrator. Service A calls Service B synchronously; B calls C; C calls D. Each waits for the next’s response before replying to its own caller. On failure, compensation propagates back up the chain.
Why problematic:
- Tight coupling propagates through the entire chain — every service must know the next service’s interface
- State is distributed across the chain with no single place to observe it
- Each service must implement both forward logic AND compensation logic for downstream failures
- Debugging is extremely difficult
- Almost always better to add an orchestrator and use sao instead
Fairy Tale Saga (seo): How does it differ from Epic Saga, and when does it make sense?
?
How it differs from Epic Saga (sao): Same synchronous + orchestrated coordination, but no compensation on failure. The orchestrator accepts eventual consistency — failed steps are retried, flagged for later reconciliation, or simply tolerated.
When it makes sense:
- Order workflows where some steps (e.g., sending a confirmation email) can be safely retried later
- When the business accepts some lag in consistency but needs the visibility of orchestration
- Reduces implementation complexity significantly (no rollback machinery)
Trade-offs: Medium coupling (orchestrator), easy state management (centralized), medium-low scalability (synchronous), eventual consistency accepted.
Time Travel Saga (sec): What is it and when might it be justified?
?
What it is: Services call each other in a synchronous chain (like Phone Tag), but with no atomicity requirement — failures are handled by retries or reconciliation rather than compensating rollbacks.
When justified:
- Integrating legacy systems that don’t support async messaging (forces synchronous)
- Short 2-step chains where an orchestrator would be disproportionate overhead
- When eventual consistency is acceptable and synchronous legacy constraints exist
Key weakness: Inherits the chain-coupling problem of Phone Tag without compensating (pun intended) with strong consistency. The worst of both worlds in many scenarios.
Fantasy Fiction Saga (aao): What is the core challenge of combining async with atomicity?
?
Core challenge: The orchestrator publishes async commands to participants but must still guarantee all-or-nothing atomicity. Because responses arrive asynchronously and potentially out of order:
- The orchestrator must maintain a saga state machine that handles events arriving out of order
- It must detect timeouts when participants don’t respond (and decide: retry, fail, or escalate)
- Idempotency is critical — messages may be redelivered
- Compensating commands must also be sent asynchronously, requiring the same robustness
Why use it: When you need both high throughput (async) and strong consistency (atomic). High-throughput financial systems with variable-latency services. Requires mature saga state machine infrastructure.
Horror Story (aac): Why do the authors give it this name, and when (if ever) should it be used?
?
Why the name: aac (async + atomic + choreographed) is the worst combination: it requires each service to independently implement compensation logic based only on events it observes, with no orchestrator tracking overall state, over async messaging where events can arrive out of order or not at all.
Problems:
- Compensating transactions must be in every participant service
- No single source of truth for saga state
- Idempotency required for both steps and compensations
- Impossible to observe end-to-end workflow state
- Debugging production failures is extremely difficult
- Adding any workflow step requires changing multiple services
When to use: The authors strongly advise avoiding it. If you arrive at this pattern, seriously consider adding an orchestrator (switch to aao instead).
Parallel Saga (aeo): How does it achieve high scalability, and what is the scatter-gather pattern?
?
How it achieves scalability: The orchestrator publishes async commands to multiple participants simultaneously (fan-out). Services process their commands independently in parallel and publish result events. Since consistency is eventual, there’s no compensation and no need to wait for all steps before proceeding.
Scatter-gather: The orchestrator “scatters” commands to N services and then “gathers” their result events, aggregating them to determine the overall workflow outcome.
Best for: Workflows with independent parallel steps (notification fanouts, multi-channel processing, recommendation engines, analytics pipelines, batch processing). Delivers very high throughput and fault tolerance with manageable complexity.
Anthology Saga (aec): How does it work and what observability infrastructure does it require?
?
How it works: Fully event-driven with no orchestrator. Services subscribe to events on a shared event bus. Each service processes incoming events and publishes result events. No service knows about any other service’s existence — the workflow is emergent from the event chain.
Observability infrastructure required (non-optional):
- Distributed tracing (Jaeger, Zipkin, OpenTelemetry) with trace IDs propagated through all events
- Correlation IDs on every event so end-to-end flows can be reconstructed
- Centralized log aggregation (ELK, Splunk) so scattered service logs can be queried together
Without this infrastructure, debugging production issues in an Anthology Saga is nearly impossible.
What is a compensating transaction? How does it differ from a database rollback?
?
A compensating transaction is a business-level operation that semantically reverses the effect of a completed saga step. It is NOT a database rollback.
Key difference: The original transaction has already committed to the database. A compensating transaction creates a new transaction that undoes the business effect:
- Payment charged → issue a refund (new credit transaction)
- Inventory reserved → release the reservation (new update)
- Order created → cancel the order (mark as CANCELLED — do not delete, audit trail matters)
Compensating transactions must be designed by domain experts who understand what “undoing” a step means in business terms — not just technically.
What makes an operation non-compensatable in a saga, and how should saga designers handle it?
?
A non-compensatable operation has no meaningful business-level reversal:
- Sending an email (can’t unsend)
- Printing a shipping label (can’t unprint)
- Triggering an external webhook to a third party
How to handle:
- Pivot transaction pattern: Move non-compensatable steps to the END of the saga, after all compensatable steps. Only execute them once all compensatable steps have committed.
- Two-phase approach: “Reserve” the action first (e.g., draft the email), then “confirm” it (send) only after the compensatable steps succeed.
- Accept and document: Some operations cannot be compensated — design downstream reconciliation processes to handle this known risk.
What is the pivot transaction in a saga?
?
The pivot transaction is the last compensatable step in a saga — the boundary after which operations can no longer be undone.
[Step 1] [Step 2] [PIVOT — Step 3] [Step 4] [Step 5]
<-- compensatable --> <-- non-compensatable -->
Design principle: Saga designers should explicitly identify the pivot and ensure:
- All compensatable steps occur before the pivot
- All non-compensatable steps (email sends, external webhooks, etc.) occur after the pivot
- The pivot only commits once the architect is confident all downstream non-compensatable steps can proceed
Why is idempotency mandatory in saga implementations?
?
In distributed systems, failures cause retries. Retries cause the same message or command to be delivered multiple times. Without idempotency, this causes:
- Double-charges (payment step executed twice)
- Double-inventory deductions
- Double-compensation (refunding an already-refunded payment)
Techniques for idempotency:
- Idempotency keys: Each saga step message includes a unique ID; the receiver records processed IDs and skips re-processing duplicates
- Conditional updates: “Update only if current state = X” (optimistic locking / conditional writes)
- Event deduplication: Message broker or service layer tracks already-processed message IDs (e.g., Kafka consumer offsets, database deduplication tables)
What does a saga state machine track, and why is it needed?
?
A saga state machine is a persistent record of a saga’s progress, necessary because there is no global transaction manager across services.
Typically tracks:
- Saga ID (used as correlation ID on all messages)
- Current step / phase
- Status of each completed step: SUCCESS, FAILED, COMPENSATED
- Input data for each step (needed to construct compensating operations)
- Timestamps (for timeout detection)
- Overall saga status: IN_PROGRESS, COMPLETED, COMPENSATING, FAILED
Why needed: Without centralized state tracking, the orchestrator cannot know which steps need compensation if a failure occurs mid-saga, and cannot detect timeouts on async participants.
How should timeouts be handled in asynchronous sagas?
?
In async sagas, a participant may publish no response (network partition, crash, broker failure). The orchestrator must detect this via timeout and make a decision:
- Retry the step (idempotency makes this safe)
- Treat as failure and begin compensation of all prior completed steps
- Escalate to a human operator or dead-letter queue for manual resolution
Key insight: Timeout values are business decisions, not purely technical ones. “How long do we wait for payment confirmation before canceling the order?” is a product/business requirement with real customer-facing consequences — it is not merely an infrastructure parameter.
What is the first question in the saga pattern decision framework?
?
“What is the cost of partial completion?”
This determines the consistency axis (Atomic vs. Eventual):
- If partial completion is unacceptable (money movement, legal records, high-demand inventory): choose atomic patterns → sao, sac, aao, aac
- If partial completion is acceptable (notifications, analytics, secondary data sync): choose eventual patterns → seo, sec, aeo, aec
This is the most critical decision because atomic consistency carries the highest implementation cost and should not be chosen unless truly required.
What is the second question in the saga pattern decision framework?
?
“Do we need an immediate response, or can processing happen in the background?”
This determines the communication axis (Synchronous vs. Asynchronous):
- If a user is waiting for the result (e.g., blocking at a checkout page): choose synchronous → sao, sac, seo, sec
- If background processing is acceptable (async job submission, event-driven): choose asynchronous → aao, aac, aeo, aec
What is the third question in the saga pattern decision framework?
?
“Do we need visibility into the overall workflow? Does our team structure support a central coordinator?”
This determines the coordination axis (Orchestrated vs. Choreographed):
- If you need visibility into workflow state, have complex branching logic, or want simpler reasoning: choose orchestrated → sao, seo, aao, aeo
- If you need maximum decoupling between independently deployed service teams: choose choreographed → sac, sec, aac, aec
Note: Choreographed patterns require mature observability infrastructure (distributed tracing, correlation IDs) — this is a hidden prerequisite.
Give the full saga decision tree.
?
Is strict atomicity required?
├─ YES: Is synchronous communication required?
│ ├─ YES: Need orchestration?
│ │ ├─ YES → Epic Saga (sao) ← recommended
│ │ └─ NO → Phone Tag Saga (sac) ← usually avoid; add orchestrator
│ └─ NO: Need orchestration?
│ ├─ YES → Fantasy Fiction (aao) ← use if scale + atomicity both needed
│ └─ NO → Horror Story (aac) ← avoid; switch to aao
│
└─ NO: Is synchronous communication required?
├─ YES: Need orchestration?
│ ├─ YES → Fairy Tale Saga (seo) ← good middle ground
│ └─ NO → Time Travel Saga (sec) ← awkward; prefer seo or aeo
└─ NO: Need orchestration?
├─ YES → Parallel Saga (aeo) ← recommended for high throughput
└─ NO → Anthology Saga (aec) ← recommended for max decoupling
Which two saga patterns are most recommended for general use, and why?
?
Parallel Saga (aeo) and Anthology Saga (aec) are most commonly recommended:
Parallel Saga (aeo): Best balance of scalability, fault tolerance, and manageable complexity when eventual consistency is acceptable. Orchestrator provides visibility; async provides throughput; no compensation logic needed.
Anthology Saga (aec): Best for large-scale event-driven systems where service teams truly need to be independent. Maximum decoupling, highest fault tolerance, great scalability — but requires mature distributed tracing infrastructure.
Epic Saga (sao) is the third recommended option, specifically for scenarios requiring strong consistency with moderate throughput.
Compare coupling and complexity across all 8 patterns. Which has the highest complexity?
?
Highest complexity: Horror Story (aac) — async, atomic, and choreographed simultaneously. Compensation logic is distributed across all services with no orchestrator and out-of-order async events.
Coupling ranking (highest to lowest):
- Epic Saga (sao) — VH
- Phone Tag Saga (sac) — VH
- Fairy Tale Saga (seo) — H
- Time Travel Saga (sec) — H
- Fantasy Fiction Saga (aao) — M
- Parallel Saga (aeo) — M
- Horror Story (aac) — L temporal / VH behavioral
- Anthology Saga (aec) — VL
Complexity ranking (highest to lowest): aac > aao ≈ sac > aec > seo ≈ sec > aeo > sao
What are the key differences between the Parallel Saga (aeo) and the Anthology Saga (aec)?
?
Both are async + eventual, making them the most scalable patterns. The difference is coordination:
| Dimension | Parallel Saga (aeo) | Anthology Saga (aec) |
|---|---|---|
| Coordinator | Central orchestrator | None — fully choreographed |
| Workflow visibility | High — state in one place | Low — must trace across events |
| Service coupling | Medium — services known to orchestrator | Very Low — services know only events |
| Ease of modification | Moderate — change orchestrator | Hard — changing events affects all listeners |
| Team structure fit | One team owns orchestrator | Many independent teams |
Use Parallel Saga when you want scalability with observability; use Anthology Saga when service team independence is the top priority.
What is the difference between temporal coupling and behavioral coupling in choreographed sagas?
?
Temporal coupling: Whether one service must be available at the same time as another. Async choreographed patterns have low temporal coupling — services can be independently offline as long as a durable message broker buffers events.
Behavioral coupling: Whether one service’s behavior is implicitly dependent on another’s. Choreographed atomic sagas (aac) have very high behavioral coupling — every service must know (implicitly) what compensating events to listen for and emit, tied to the behavior of every other service in the saga. Changing one service’s compensation logic can break other services.
The Horror Story (aac) is deceptive: it appears loosely coupled (no direct service-to-service calls) but is actually very tightly coupled in terms of behavioral contracts.
What is the outbox pattern and why is it important for async sagas?
?
The outbox pattern solves the “dual write” problem in async sagas: how do you atomically write to your database AND publish an event to a message broker?
Problem: If you write to the DB then crash before publishing the event, the saga is stuck. If you publish first then crash before writing, you have a phantom event.
Solution: Write the event to an outbox table in the same local database transaction as your business data. A separate process (the outbox processor) reads from the outbox table and publishes to the broker. This guarantees at-least-once delivery without distributed transactions.
This is infrastructure-level prerequisite for reliable async saga step execution. See wiki-outbox-pattern.
How does the Sysops Squad case study apply the saga decision framework?
?
Scenario: Ticket assignment workflow — Create Ticket → Assign Technician → Notify Technician → Notify Customer → Update Billing.
Framework application:
- Atomicity required? Yes — ticket creation and billing must be consistent (financial)
- Synchronous required? No — user submits ticket; background processing is acceptable
- Orchestration preferred? Yes — complex branching logic for technician assignment; team wants visibility
Result: Fantasy Fiction Saga (aao) — async + atomic + orchestrated.
Pivot design: Non-compensatable steps (notifications) placed AFTER the pivot. Revised order:
Create Ticket → Assign Technician → Update Billing → [PIVOT] → Notify Technician → Notify Customer
Why is Chapter 12’s classification system described as the book’s “unique contribution” in Part II?
?
Prior to this taxonomy, practitioners described saga variants using informal terms (“orchestrated saga,” “event-driven saga,” “choreographed saga”) that conflated different axes and led to confusion. The SAHT three-axis framework:
- Separates orthogonal concerns: Communication style, consistency model, and coordination style are truly independent — any combination is theoretically possible
- Generates a complete, closed set: 2³ = 8 patterns; there are no others to discover
- Creates a decision framework: Because axes are independent, you can reason about each dimension separately and arrive at a pattern systematically
- Provides memorable names: Epic Saga, Horror Story, etc. make the patterns sticky and communicable across teams
This systematic approach replaces ad hoc pattern selection with a principled trade-off analysis — consistent with the book’s overall philosophy established in ch01-no-best-practices.
What are the three general strategies for dealing with failures in saga steps?
?
-
Compensating transaction: Execute a business-level reversal of the completed step (used in atomic sagas). Requires all prior steps to have compensating logic designed upfront.
-
Retry: Re-send the failed step’s command/request, relying on idempotency to prevent duplicate effects. Appropriate for transient failures (network blip, temporary service unavailability).
-
Ignore / flag for reconciliation: Accept the failure, record it, and let a background reconciliation process resolve the inconsistency later (used in eventual consistency sagas). Appropriate when the business can tolerate temporary inconsistency.
Which to choose: Determined by the consistency axis. Atomic sagas must compensate. Eventual sagas can retry or reconcile.
Total Cards: 33
Priority: HIGH
Last Updated: 2026-05-30