Chapter 11: Managing Distributed Workflows
saht distributed-workflows orchestration choreography workflow-state event-driven
Status: Notes complete
Overview
In a distributed system, business processes rarely live inside a single service. A “create order” workflow might touch an Order service, an Inventory service, a Payment service, a Notification service, and a Shipping service — each of which owns its own data and runs in its own process. The question Chapter 11 addresses is: how do you coordinate this multi-step workflow across services, and who is responsible for knowing where the workflow is at any given moment?
This is the distributed workflow problem, and it is more complex than it appears. In a monolith, the call stack is the workflow state: you know exactly what step you are on because you are inside a method call, and the runtime thread holds the state. In a distributed system, there is no shared thread, no shared call stack, and no shared memory. State must be managed explicitly — and every approach to managing it has different trade-offs.
The chapter presents two fundamental communication styles for distributed workflows:
- Orchestration — a central coordinator tells each service what to do and when
- Choreography — each service reacts to events from other services with no central coordinator
These are not merely implementation details. They are architectural decisions that determine how tightly services are coupled, how observable the workflow is, who owns workflow state, and how easy it is to add or change workflow steps without modifying other services.
The chapter also addresses the hardest part of choreography: how do you know the overall state of a workflow when no single service has the full picture?
The Sysops Squad Saga demonstrates how the team chooses between orchestration and choreography for the ticket lifecycle workflow.
Core Concepts
Workflow: A sequence of steps (involving one or more services) that together complete a business process. Examples: processing a customer order, onboarding a new employee, assigning and resolving a support ticket.
Mediator: In orchestration, the central component that coordinates workflow execution. It knows the sequence of steps, which service performs each step, and how to handle errors at each step.
Orchestration: A workflow communication style where a central orchestrator (mediator) explicitly directs each participant service: “Do step 1 now,” “Do step 2 now,” “Step 2 failed — compensate.” Participants respond only to explicit commands from the orchestrator.
Choreography: A workflow communication style where there is no central coordinator. Each service knows what to do when it receives a particular event. Services react to domain events published by other services. No single service has the full picture of the workflow — the workflow “emerges” from the interactions.
Workflow state: The current status of a particular workflow instance — which step has been completed, what the current step is, what the outcome was at each step, and whether the overall workflow succeeded or failed.
Front controller pattern: In choreography, a lightweight first service receives the initial request, persists workflow state, and publishes the first event — acting as the single entry point without becoming a full orchestrator.
Stamp coupling: Passing a data structure through a workflow that contains more fields than any single service needs, so that downstream services can add their own results to the structure. Used in choreography to thread workflow state through event messages.
Event sourcing: Reconstructing the current state of a workflow (or entity) by replaying all events that have been applied to it, rather than storing the current state directly.
Orchestration Communication Style
What It Is
A central orchestrator service (the mediator) explicitly controls the execution of a workflow. The orchestrator calls each participant service in sequence (or in parallel where the workflow allows), waits for responses, handles errors, applies retry logic, and decides how to proceed at each step.
Orchestration: "Ticket Processing Workflow"
Client Request
|
v
+------+-------+ "Validate ticket"
| Orchestrator | -------> Validation Service
| (Mediator) | <------- OK
| |
| | "Assign expert"
| | -------> Assignment Service
| | <------- Expert assigned
| |
| | "Notify expert"
| | -------> Notification Service
| | <------- Notification sent
| |
| | "Bill customer"
| | -------> Billing Service
| | <------- Invoice created
| |
| Workflow |
| state lives |
| here |
+------+-------+
|
v
Response to Client
How Orchestration Works in Practice
The orchestrator typically:
- Receives the initial trigger (API call, event, schedule)
- Persists the initial workflow state
- Calls Service 1, waits for response
- Updates workflow state with Step 1 result
- Calls Service 2 (possibly with data from Step 1’s response)
- Continues until all steps complete or a step fails
- On failure: executes compensation logic (calls rollback operations on previously successful steps)
- On success: reports workflow completion to the caller or publishes a completion event
Common implementation patterns:
- A dedicated orchestrator service (most common in microservices)
- A workflow engine or BPM tool (Camunda, Temporal, AWS Step Functions) that provides durability, retry, and state management
- A saga orchestrator specifically managing distributed transactions (see ch12-transactional-sagas)
Trade-offs of Orchestration
Advantages:
| Advantage | Explanation |
|---|---|
| Centralized workflow state | The orchestrator always knows the current state of every active workflow instance — easy to query, monitor, and debug |
| Error handling clarity | Compensation logic, retries, and failure handling are all in one place — the orchestrator — rather than distributed across services |
| Easy to add steps | Adding a new step means updating the orchestrator; participant services are unaware of the change |
| Observability | The orchestrator is a single point for workflow telemetry — dashboards, alerts, and SLAs can be monitored from one service |
| Explicit sequencing | The order of operations is explicit and readable in the orchestrator’s code — no need to trace event flows across multiple services |
Disadvantages:
| Disadvantage | Explanation |
|---|---|
| Single point of failure | If the orchestrator is unavailable, all workflows in progress stall — no other service can advance the workflow |
| Coupling to orchestrator | Participant services may need to implement specific APIs or response formats the orchestrator expects — behavioral coupling to the orchestrator |
| God service risk | Over time, business logic can creep into the orchestrator, turning it into a service with too much responsibility — “workflow logic” and “business logic” blur |
| Scalability bottleneck | All workflow traffic flows through the orchestrator — it must scale with the number of concurrent workflow instances |
| Orchestrator becomes a dependency | Every service that participates in any workflow must be available and compatible with the orchestrator’s protocol |
Orchestration Sequence Diagram
Client Orchestrator Validation Assignment Notification Billing
| | | | | |
|--request->| | | | |
| |--validate--->| | | |
| |<--ok---------| | | |
| |--assign-----------------> | | |
| |<--expert_id--------------| | |
| |--notify-------------------------------------->| |
| |<--sent-----------------------------------------| |
| |--bill----------------------------------------------------->|
| |<--invoice_id--------------------------------------------- |
|<--done----|
Choreography Communication Style
What It Is
In choreography, there is no central orchestrator. Instead, each service in the workflow knows what to do when it receives a particular event. Services publish events when they complete their work; other services subscribe to those events and react accordingly. The workflow “emerges” from the choreographed interactions between services.
Choreography: "Ticket Processing Workflow"
Client Request
|
v
+------+-------+
| Ticket | publishes: TicketCreated
| Service |
+---------------+
|
v
+-------+--------+
| Validation | consumes: TicketCreated
| Service | publishes: TicketValidated
+-------+--------+
|
v
+-------+--------+
| Assignment | consumes: TicketValidated
| Service | publishes: ExpertAssigned
+-------+--------+
|
v
+-------+--------+
| Notification | consumes: ExpertAssigned
| Service | publishes: ExpertNotified
+-------+--------+
|
v
+-------+--------+
| Billing | consumes: ExpertNotified
| Service | publishes: CustomerBilled
+-------+--------+
No single service knows the full workflow state!
How Choreography Works in Practice
- The initial trigger (API call) causes the first service to perform its action and publish a domain event to a message broker
- The broker delivers the event to all subscribed services
- Each subscribed service performs its action and publishes its own event
- The chain continues until the workflow completes (no more services react to the final event, or a terminal event is published)
- On failure: the failing service publishes a failure event; other services that subscribed to failure events may perform compensating actions
Infrastructure required: A reliable message broker (Kafka, RabbitMQ, AWS SQS/SNS, Google Pub/Sub) that guarantees event delivery, ordering (where needed), and replay capability.
Trade-offs of Choreography
Advantages:
| Advantage | Explanation |
|---|---|
| Highly decoupled | No service knows about any other service — only about event types and its own behavior |
| Independent deployability | Any service can be replaced, updated, or scaled without the others knowing, as long as event contracts are preserved |
| No single point of failure | There is no orchestrator to fail; the workflow continues as long as the message broker and participant services are available |
| Easier to add consumers | A new service can be added to the workflow simply by subscribing to the relevant event — existing services do not change |
| Scales naturally | Each service scales independently based on its own load |
Disadvantages:
| Disadvantage | Explanation |
|---|---|
| Distributed workflow state | No single service has the full picture of where a workflow instance is at any given moment — the hard part |
| Difficult to debug | Tracing a specific workflow instance across multiple services and events requires distributed tracing tooling (Jaeger, Zipkin) |
| Error handling complexity | Compensation and rollback logic must be distributed across services — each service must know what to do if “its” event is a failure event |
| Implicit sequencing | The order of operations is implicit — you must trace event subscriptions across multiple services to understand the full workflow |
| Workflow changes require coordination | Changing the order of steps may require updating multiple services’ event subscriptions |
Choreography Sequence Diagram
Client Ticket Svc Broker Validation Assignment Notification Billing
| | | | | | |
|--req-->| | | | | |
| |--publish-->| | | | |
| | (TicketCreated) | | | |
|<--ack--| |--event->| | | |
| | | (TicketCreated) | | |
| | | |--publish--->| | |
| | | | (TicketValidated) | |
| | | | |--publish-->| |
| | | | | (ExpertAssigned) |
| | | | | |--publish->|
| | | | | | (ExpertNotified)
| | | | | | |
| | | | | | (CustomerBilled)
Workflow State Management in Choreography
This is the “hard part” the chapter title promises. In orchestration, workflow state is trivial — the orchestrator holds it. In choreography, no single service has the full picture. How can you answer “What is the status of workflow instance #4729?”
The chapter presents three main approaches to managing workflow state in choreography:
Approach 1: Front Controller Pattern
A lightweight “front controller” service receives the initial request and is responsible for:
- Persisting the initial workflow record (with a unique workflow ID)
- Publishing the first event to start the choreography chain
- Subscribing to the final completion/failure events to update the workflow record
The front controller is not a full orchestrator — it does not coordinate each step. It only captures the beginning and end of the workflow, plus the workflow ID that all events carry so they can be correlated.
+---------------------+
| Front Controller |
| - Creates workflow | -----publishes TicketCreated (with workflowId)----->
| - Stores status |
| - Subscribes to | <----receives CustomerBilled (with workflowId)------
| terminal events |
| - Updates status |
+---------------------+
Workflow DB:
workflowId | status | created_at | completed_at
4729 | completed | 2026-05-30 09:00:00 | 2026-05-30 09:00:05
Trade-off: The front controller has partial visibility — it knows the workflow started and whether it ultimately completed or failed, but not which intermediate step the workflow is currently at. It is a light-touch solution appropriate when course-grained status tracking is sufficient.
Approach 2: Stamp Coupling for State
Each event in the choreography chain carries a data envelope (the “stamp”) that contains not just the payload for the current step but also accumulated state from previous steps. Each service adds its own contribution to the stamp before publishing the next event. The final event in the chain contains the complete workflow history.
TicketCreated event:
{
"workflowId": "4729",
"ticketId": "T-001",
"customerId": "C-999",
"description": "Laptop won't start"
}
TicketValidated event:
{
"workflowId": "4729",
"ticketId": "T-001",
"customerId": "C-999",
"description": "Laptop won't start",
"validationResult": "ok", <-- added by Validation Service
"priority": "HIGH" <-- added by Validation Service
}
ExpertAssigned event:
{
"workflowId": "4729",
"ticketId": "T-001",
"customerId": "C-999",
"priority": "HIGH",
"expertId": "E-042", <-- added by Assignment Service
"scheduledTime": "2026-05-31" <-- added by Assignment Service
}
Advantages:
- No shared database needed for state — state travels with the events
- Each service has full context it needs from previous steps
- Replay is straightforward — the full state is in the event stream
Disadvantages:
- Events grow larger as the workflow progresses — bandwidth and serialization cost
- Services receive data they do not need (they only process their own step but see the full stamp)
- Schema changes to early-stage fields ripple through all downstream events
- Workflow state is only visible to a service when it processes an event — no way to query “current state” without processing all events
When to use: When the workflow data is small, the number of steps is bounded, and the team values simplicity over query capability.
Approach 3: Event Sourcing for State
A dedicated workflow state service (or event store) subscribes to all workflow events across all services. It maintains a complete record of every event that occurred for every workflow instance. Workflow state at any point can be reconstructed by replaying the event log.
All workflow events ---> +--------------------+
| Workflow State |
| Service / |
| Event Store |
| |
| workflowId: 4729 |
| events: |
| - TicketCreated |
| - TicketValidated |
| - ExpertAssigned |
| - ExpertNotified |
| (CustomerBilled |
| not yet seen) |
+--------------------+
Query: "What is the state of workflow 4729?"
Answer: Replay events -> "Step 4 of 5 complete, waiting for billing"
Advantages:
- Complete workflow state is always queryable
- Full audit trail — every state transition is recorded
- Can reconstruct state at any historical point
- Natural fit for saga compensation — you can always see what succeeded and needs reversal
Disadvantages:
- Significant infrastructure investment — event store, event schema governance
- State reconstruction requires replaying events — can be slow for long-running workflows with many events
- Eventual consistency in state queries — the state service may not have processed all events yet
- Complex to implement correctly — event ordering, idempotency, and compaction are non-trivial
When to use: When auditability is required (financial, compliance, healthcare), when workflows are long-running, when compensating actions are complex, or when workflow observability is a high priority.
Head-to-Head: Orchestration vs. Choreography
Comprehensive Comparison Table
| Dimension | Orchestration | Choreography |
|---|---|---|
| Workflow state visibility | HIGH — orchestrator holds all state | LOW to MEDIUM — distributed or reconstructed |
| Coupling between services | MEDIUM — participants coupled to orchestrator’s API | LOW — participants only coupled to event schemas |
| Independent deployability | MEDIUM — orchestrator must be updated for changes | HIGH — services deploy independently |
| Error handling | CENTRALIZED — all in orchestrator | DISTRIBUTED — each service handles its own failures |
| Observability / debuggability | HIGH — single point of monitoring | LOW — requires distributed tracing tools |
| Scalability bottleneck | POSSIBLE — orchestrator may bottleneck | NONE — no central coordinator |
| Single point of failure | YES — orchestrator failure stops all workflows | NO — broker failure is contained; partial failure is possible |
| Adding new workflow steps | SIMPLE — update orchestrator | MEDIUM — new service subscribes to existing events |
| Changing step order | SIMPLE — update orchestrator logic | COMPLEX — may require changing multiple services’ subscriptions |
| Testing | EASIER — mock participant services against orchestrator | HARDER — must trace event chains across services |
| Team autonomy | LOWER — participant teams must coordinate with orchestrator team | HIGHER — teams are more independent |
| Implicit vs. explicit workflow | EXPLICIT — readable in orchestrator code | IMPLICIT — workflow must be reconstructed from event subscriptions |
When to Choose Orchestration
- The workflow involves complex conditional branching (if X then Y else Z)
- Error compensation is complex and involves multiple rollback paths
- Business stakeholders need real-time workflow visibility and status reporting
- The team is small and works on both the orchestrator and participant services
- Strong consistency and deterministic sequencing are required
- The workflow is long-running and must survive process restarts (durable orchestration)
- You are implementing a saga with compensation steps (see ch12-transactional-sagas)
When to Choose Choreography
- Services are owned by different teams who need strong deployment independence
- The workflow is relatively simple and linear (few conditional branches)
- Eventual consistency is acceptable for workflow state visibility
- New participants will be added over time and must not require changes to existing services
- High scalability is required and the orchestrator would become a bottleneck
- You want to maximize the architectural benefit of the event-driven pattern already used elsewhere
Combining Orchestration and Choreography
Real-world systems rarely use one style exclusively. A common pattern is to use orchestration within a bounded context and choreography between bounded contexts:
Bounded Context: Ticket Management (internal — orchestration)
+----------------------------------------+
| Orchestrator |
| -> Validation Service (call) |
| -> Assignment Service (call) |
| -> Notification Service (call) |
+----------------------------------------+
|
publishes: TicketResolved (domain event)
|
+---v------------------------------------+
| Bounded Context: Billing (external |
| — choreography via event) |
| Billing Service subscribes to |
| TicketResolved and creates invoice |
+----------------------------------------+
This hybrid approach gives:
- The clarity and control of orchestration within a domain where one team controls all participants
- The decoupling and independence of choreography across domain and team boundaries
Workflow State Management: Summary Comparison
| Approach | Visibility | Infrastructure Cost | Query Capability | Best For |
|---|---|---|---|---|
| Front Controller | START/END only | LOW | Coarse-grained status | Simple workflows, enough to know done/failed |
| Stamp Coupling | Per-event state | LOW (travels with events) | None (must process events) | Small payloads, bounded steps |
| Event Sourcing | COMPLETE history | HIGH | Full audit/replay | Compliance, complex sagas, long-running |
Sysops Squad Saga: Managing Ticket Workflows
The problem: The Sysops Squad ticket lifecycle involves multiple services:
- Customer submits ticket (Ticket service)
- Ticket is validated (Validation service)
- Expert is assigned (Assignment service)
- Expert is notified (Notification service)
- Expert resolves the ticket (Resolution service)
- Customer is billed (Billing service)
- Survey is sent (Survey service)
The question: Should this workflow be orchestrated (one central coordinator) or choreographed (services reacting to events)?
What the team considers:
The team maps out the forces acting on this decision:
Forces favoring orchestration:
- Business operations team needs to see ticket status at any moment — “Is ticket #4729 at the assignment step or the billing step?”
- There are complex error scenarios — if assignment fails (no available expert), there are retry rules and escalation paths that require conditional logic
- SLAs require knowing exact elapsed time at each step
Forces favoring choreography:
- The Billing, Survey, and Notification teams are separate and want deployment independence
- The system processes thousands of tickets per hour — an orchestrator could bottleneck
- The team wants to add a new “Quality Review” step later without touching existing services
Decision: The team adopts a hybrid approach:
- Orchestration for the core operational workflow (Ticket → Validation → Assignment → Resolution): high visibility, complex error handling needed, single team owns these services
- Choreography for post-resolution steps (Billing, Survey, Notification of completion): these teams are independent, the steps are simple reactions to the
TicketResolvedevent, and adding new post-resolution steps in the future should not require changes to the core workflow
Workflow state management: For the choreography portion, the team uses the Front Controller Pattern: the Ticket service records that a TicketResolved event was published and subscribes to a terminal WorkflowCompleted event from the last downstream service. This gives coarse-grained status visibility without the overhead of full event sourcing.
Key Takeaways
- Distributed workflow management is a fundamental architectural problem: without a shared call stack, workflow state must be designed explicitly — it does not emerge automatically.
- Orchestration gives a central coordinator (the mediator) full control and full visibility of workflow state, but introduces a single point of failure and coupling between participants and the orchestrator.
- Choreography achieves maximum decoupling and independent deployability but distributes workflow state across services — the “hard part” is that no single service knows the overall status of an in-flight workflow instance.
- The question “Where is my order?” (or “What step is ticket #4729 at?”) is trivial in orchestration and genuinely hard in choreography — this is the central trade-off of the chapter.
- Three approaches to choreography state management: Front Controller (lightweight, coarse-grained), Stamp Coupling (state travels with events, no shared DB), and Event Sourcing (full audit trail, highest infrastructure cost).
- Stamp coupling is a deliberate design choice — passing more data than any single service needs so that downstream services have the context they need without calling back upstream.
- Choreography’s implicit workflow (workflow logic spread across event subscriptions) makes it harder to understand, test, and change than orchestration’s explicit workflow (logic concentrated in the orchestrator).
- Real-world architectures often combine both styles: orchestration within a bounded context (where one team controls all participants) and choreography between bounded contexts (where team independence is paramount).
- Distributed tracing tools (Jaeger, Zipkin, OpenTelemetry) are not optional in a choreography-based architecture — they are the primary mechanism for answering “what happened to workflow instance X?”
- The choice between orchestration and choreography should be driven primarily by: (a) how important real-time workflow state visibility is, (b) how complex error compensation is, and (c) whether participant services are owned by the same team or different teams.
Related Resources
- ch09-data-ownership-distributed-transactions — Data ownership principles that inform which service is authoritative for workflow state
- ch10-distributed-data-access — Data access patterns used within workflows (services often need to read each other’s data during workflow execution)
- ch12-transactional-sagas — The transactional implementation of orchestrated and choreographed workflows when distributed transaction correctness is required; sagas are a specific application of orchestration/choreography to distributed transactions
- ch02-coupling — The coupling dimensions (runtime, semantic, behavioral) that orchestration and choreography affect differently
- ch07-service-granularity — Service granularity decisions affect whether a workflow fits inside one service (and is thus not distributed) or must span multiple services
Last Updated: 2026-05-30