Chapter 11: Managing Distributed Workflows

saht distributed-workflows orchestration choreography workflow-state event-driven

Status: Notes complete


Overview

In a distributed system, business processes rarely live inside a single service. A “create order” workflow might touch an Order service, an Inventory service, a Payment service, a Notification service, and a Shipping service — each of which owns its own data and runs in its own process. The question Chapter 11 addresses is: how do you coordinate this multi-step workflow across services, and who is responsible for knowing where the workflow is at any given moment?

This is the distributed workflow problem, and it is more complex than it appears. In a monolith, the call stack is the workflow state: you know exactly what step you are on because you are inside a method call, and the runtime thread holds the state. In a distributed system, there is no shared thread, no shared call stack, and no shared memory. State must be managed explicitly — and every approach to managing it has different trade-offs.

The chapter presents two fundamental communication styles for distributed workflows:

  1. Orchestration — a central coordinator tells each service what to do and when
  2. Choreography — each service reacts to events from other services with no central coordinator

These are not merely implementation details. They are architectural decisions that determine how tightly services are coupled, how observable the workflow is, who owns workflow state, and how easy it is to add or change workflow steps without modifying other services.

The chapter also addresses the hardest part of choreography: how do you know the overall state of a workflow when no single service has the full picture?

The Sysops Squad Saga demonstrates how the team chooses between orchestration and choreography for the ticket lifecycle workflow.


Core Concepts

Workflow: A sequence of steps (involving one or more services) that together complete a business process. Examples: processing a customer order, onboarding a new employee, assigning and resolving a support ticket.

Mediator: In orchestration, the central component that coordinates workflow execution. It knows the sequence of steps, which service performs each step, and how to handle errors at each step.

Orchestration: A workflow communication style where a central orchestrator (mediator) explicitly directs each participant service: “Do step 1 now,” “Do step 2 now,” “Step 2 failed — compensate.” Participants respond only to explicit commands from the orchestrator.

Choreography: A workflow communication style where there is no central coordinator. Each service knows what to do when it receives a particular event. Services react to domain events published by other services. No single service has the full picture of the workflow — the workflow “emerges” from the interactions.

Workflow state: The current status of a particular workflow instance — which step has been completed, what the current step is, what the outcome was at each step, and whether the overall workflow succeeded or failed.

Front controller pattern: In choreography, a lightweight first service receives the initial request, persists workflow state, and publishes the first event — acting as the single entry point without becoming a full orchestrator.

Stamp coupling: Passing a data structure through a workflow that contains more fields than any single service needs, so that downstream services can add their own results to the structure. Used in choreography to thread workflow state through event messages.

Event sourcing: Reconstructing the current state of a workflow (or entity) by replaying all events that have been applied to it, rather than storing the current state directly.


Orchestration Communication Style

What It Is

A central orchestrator service (the mediator) explicitly controls the execution of a workflow. The orchestrator calls each participant service in sequence (or in parallel where the workflow allows), waits for responses, handles errors, applies retry logic, and decides how to proceed at each step.

Orchestration: "Ticket Processing Workflow"

  Client Request
       |
       v
+------+-------+         "Validate ticket"
| Orchestrator |  -------> Validation Service
|  (Mediator)  |  <------- OK
|              |
|              |         "Assign expert"
|              |  -------> Assignment Service
|              |  <------- Expert assigned
|              |
|              |         "Notify expert"
|              |  -------> Notification Service
|              |  <------- Notification sent
|              |
|              |         "Bill customer"
|              |  -------> Billing Service
|              |  <------- Invoice created
|              |
|  Workflow    |
|  state lives |
|  here        |
+------+-------+
       |
       v
  Response to Client

How Orchestration Works in Practice

The orchestrator typically:

  1. Receives the initial trigger (API call, event, schedule)
  2. Persists the initial workflow state
  3. Calls Service 1, waits for response
  4. Updates workflow state with Step 1 result
  5. Calls Service 2 (possibly with data from Step 1’s response)
  6. Continues until all steps complete or a step fails
  7. On failure: executes compensation logic (calls rollback operations on previously successful steps)
  8. On success: reports workflow completion to the caller or publishes a completion event

Common implementation patterns:

  • A dedicated orchestrator service (most common in microservices)
  • A workflow engine or BPM tool (Camunda, Temporal, AWS Step Functions) that provides durability, retry, and state management
  • A saga orchestrator specifically managing distributed transactions (see ch12-transactional-sagas)

Trade-offs of Orchestration

Advantages:

AdvantageExplanation
Centralized workflow stateThe orchestrator always knows the current state of every active workflow instance — easy to query, monitor, and debug
Error handling clarityCompensation logic, retries, and failure handling are all in one place — the orchestrator — rather than distributed across services
Easy to add stepsAdding a new step means updating the orchestrator; participant services are unaware of the change
ObservabilityThe orchestrator is a single point for workflow telemetry — dashboards, alerts, and SLAs can be monitored from one service
Explicit sequencingThe order of operations is explicit and readable in the orchestrator’s code — no need to trace event flows across multiple services

Disadvantages:

DisadvantageExplanation
Single point of failureIf the orchestrator is unavailable, all workflows in progress stall — no other service can advance the workflow
Coupling to orchestratorParticipant services may need to implement specific APIs or response formats the orchestrator expects — behavioral coupling to the orchestrator
God service riskOver time, business logic can creep into the orchestrator, turning it into a service with too much responsibility — “workflow logic” and “business logic” blur
Scalability bottleneckAll workflow traffic flows through the orchestrator — it must scale with the number of concurrent workflow instances
Orchestrator becomes a dependencyEvery service that participates in any workflow must be available and compatible with the orchestrator’s protocol

Orchestration Sequence Diagram

Client   Orchestrator   Validation   Assignment   Notification   Billing
  |           |              |             |             |           |
  |--request->|              |             |             |           |
  |           |--validate--->|             |             |           |
  |           |<--ok---------|             |             |           |
  |           |--assign----------------->  |             |           |
  |           |<--expert_id--------------|             |           |
  |           |--notify-------------------------------------->|      |
  |           |<--sent-----------------------------------------|    |
  |           |--bill----------------------------------------------------->|
  |           |<--invoice_id--------------------------------------------- |
  |<--done----|

Choreography Communication Style

What It Is

In choreography, there is no central orchestrator. Instead, each service in the workflow knows what to do when it receives a particular event. Services publish events when they complete their work; other services subscribe to those events and react accordingly. The workflow “emerges” from the choreographed interactions between services.

Choreography: "Ticket Processing Workflow"

  Client Request
       |
       v
+------+-------+
| Ticket        |  publishes: TicketCreated
| Service       |
+---------------+
                        |
                        v
                +-------+--------+
                | Validation     | consumes: TicketCreated
                | Service        | publishes: TicketValidated
                +-------+--------+
                        |
                        v
                +-------+--------+
                | Assignment     | consumes: TicketValidated
                | Service        | publishes: ExpertAssigned
                +-------+--------+
                        |
                        v
                +-------+--------+
                | Notification   | consumes: ExpertAssigned
                | Service        | publishes: ExpertNotified
                +-------+--------+
                        |
                        v
                +-------+--------+
                | Billing        | consumes: ExpertNotified
                | Service        | publishes: CustomerBilled
                +-------+--------+

  No single service knows the full workflow state!

How Choreography Works in Practice

  1. The initial trigger (API call) causes the first service to perform its action and publish a domain event to a message broker
  2. The broker delivers the event to all subscribed services
  3. Each subscribed service performs its action and publishes its own event
  4. The chain continues until the workflow completes (no more services react to the final event, or a terminal event is published)
  5. On failure: the failing service publishes a failure event; other services that subscribed to failure events may perform compensating actions

Infrastructure required: A reliable message broker (Kafka, RabbitMQ, AWS SQS/SNS, Google Pub/Sub) that guarantees event delivery, ordering (where needed), and replay capability.

Trade-offs of Choreography

Advantages:

AdvantageExplanation
Highly decoupledNo service knows about any other service — only about event types and its own behavior
Independent deployabilityAny service can be replaced, updated, or scaled without the others knowing, as long as event contracts are preserved
No single point of failureThere is no orchestrator to fail; the workflow continues as long as the message broker and participant services are available
Easier to add consumersA new service can be added to the workflow simply by subscribing to the relevant event — existing services do not change
Scales naturallyEach service scales independently based on its own load

Disadvantages:

DisadvantageExplanation
Distributed workflow stateNo single service has the full picture of where a workflow instance is at any given moment — the hard part
Difficult to debugTracing a specific workflow instance across multiple services and events requires distributed tracing tooling (Jaeger, Zipkin)
Error handling complexityCompensation and rollback logic must be distributed across services — each service must know what to do if “its” event is a failure event
Implicit sequencingThe order of operations is implicit — you must trace event subscriptions across multiple services to understand the full workflow
Workflow changes require coordinationChanging the order of steps may require updating multiple services’ event subscriptions

Choreography Sequence Diagram

Client  Ticket Svc   Broker   Validation  Assignment  Notification   Billing
  |        |            |         |             |            |           |
  |--req-->|            |         |             |            |           |
  |        |--publish-->|         |             |            |           |
  |        |  (TicketCreated)     |             |            |           |
  |<--ack--|            |--event->|             |            |           |
  |        |            |  (TicketCreated)      |            |           |
  |        |            |         |--publish--->|            |           |
  |        |            |         |  (TicketValidated)       |           |
  |        |            |         |             |--publish-->|           |
  |        |            |         |             |  (ExpertAssigned)      |
  |        |            |         |             |            |--publish->|
  |        |            |         |             |            | (ExpertNotified)
  |        |            |         |             |            |           |
  |        |            |         |             |            |  (CustomerBilled)

Workflow State Management in Choreography

This is the “hard part” the chapter title promises. In orchestration, workflow state is trivial — the orchestrator holds it. In choreography, no single service has the full picture. How can you answer “What is the status of workflow instance #4729?”

The chapter presents three main approaches to managing workflow state in choreography:


Approach 1: Front Controller Pattern

A lightweight “front controller” service receives the initial request and is responsible for:

  1. Persisting the initial workflow record (with a unique workflow ID)
  2. Publishing the first event to start the choreography chain
  3. Subscribing to the final completion/failure events to update the workflow record

The front controller is not a full orchestrator — it does not coordinate each step. It only captures the beginning and end of the workflow, plus the workflow ID that all events carry so they can be correlated.

+---------------------+
| Front Controller    |
| - Creates workflow  |  -----publishes TicketCreated (with workflowId)----->
| - Stores status     |
| - Subscribes to     |  <----receives CustomerBilled (with workflowId)------
|   terminal events   |
| - Updates status    |
+---------------------+

Workflow DB:
 workflowId | status    | created_at          | completed_at
 4729       | completed | 2026-05-30 09:00:00 | 2026-05-30 09:00:05

Trade-off: The front controller has partial visibility — it knows the workflow started and whether it ultimately completed or failed, but not which intermediate step the workflow is currently at. It is a light-touch solution appropriate when course-grained status tracking is sufficient.


Approach 2: Stamp Coupling for State

Each event in the choreography chain carries a data envelope (the “stamp”) that contains not just the payload for the current step but also accumulated state from previous steps. Each service adds its own contribution to the stamp before publishing the next event. The final event in the chain contains the complete workflow history.

TicketCreated event:
{
  "workflowId": "4729",
  "ticketId": "T-001",
  "customerId": "C-999",
  "description": "Laptop won't start"
}

TicketValidated event:
{
  "workflowId": "4729",
  "ticketId": "T-001",
  "customerId": "C-999",
  "description": "Laptop won't start",
  "validationResult": "ok",         <-- added by Validation Service
  "priority": "HIGH"                <-- added by Validation Service
}

ExpertAssigned event:
{
  "workflowId": "4729",
  "ticketId": "T-001",
  "customerId": "C-999",
  "priority": "HIGH",
  "expertId": "E-042",              <-- added by Assignment Service
  "scheduledTime": "2026-05-31"     <-- added by Assignment Service
}

Advantages:

  • No shared database needed for state — state travels with the events
  • Each service has full context it needs from previous steps
  • Replay is straightforward — the full state is in the event stream

Disadvantages:

  • Events grow larger as the workflow progresses — bandwidth and serialization cost
  • Services receive data they do not need (they only process their own step but see the full stamp)
  • Schema changes to early-stage fields ripple through all downstream events
  • Workflow state is only visible to a service when it processes an event — no way to query “current state” without processing all events

When to use: When the workflow data is small, the number of steps is bounded, and the team values simplicity over query capability.


Approach 3: Event Sourcing for State

A dedicated workflow state service (or event store) subscribes to all workflow events across all services. It maintains a complete record of every event that occurred for every workflow instance. Workflow state at any point can be reconstructed by replaying the event log.

All workflow events ---> +--------------------+
                         | Workflow State     |
                         | Service /          |
                         | Event Store        |
                         |                    |
                         | workflowId: 4729   |
                         | events:            |
                         |  - TicketCreated   |
                         |  - TicketValidated |
                         |  - ExpertAssigned  |
                         |  - ExpertNotified  |
                         |  (CustomerBilled   |
                         |   not yet seen)    |
                         +--------------------+

Query: "What is the state of workflow 4729?"
Answer: Replay events -> "Step 4 of 5 complete, waiting for billing"

Advantages:

  • Complete workflow state is always queryable
  • Full audit trail — every state transition is recorded
  • Can reconstruct state at any historical point
  • Natural fit for saga compensation — you can always see what succeeded and needs reversal

Disadvantages:

  • Significant infrastructure investment — event store, event schema governance
  • State reconstruction requires replaying events — can be slow for long-running workflows with many events
  • Eventual consistency in state queries — the state service may not have processed all events yet
  • Complex to implement correctly — event ordering, idempotency, and compaction are non-trivial

When to use: When auditability is required (financial, compliance, healthcare), when workflows are long-running, when compensating actions are complex, or when workflow observability is a high priority.


Head-to-Head: Orchestration vs. Choreography

Comprehensive Comparison Table

DimensionOrchestrationChoreography
Workflow state visibilityHIGH — orchestrator holds all stateLOW to MEDIUM — distributed or reconstructed
Coupling between servicesMEDIUM — participants coupled to orchestrator’s APILOW — participants only coupled to event schemas
Independent deployabilityMEDIUM — orchestrator must be updated for changesHIGH — services deploy independently
Error handlingCENTRALIZED — all in orchestratorDISTRIBUTED — each service handles its own failures
Observability / debuggabilityHIGH — single point of monitoringLOW — requires distributed tracing tools
Scalability bottleneckPOSSIBLE — orchestrator may bottleneckNONE — no central coordinator
Single point of failureYES — orchestrator failure stops all workflowsNO — broker failure is contained; partial failure is possible
Adding new workflow stepsSIMPLE — update orchestratorMEDIUM — new service subscribes to existing events
Changing step orderSIMPLE — update orchestrator logicCOMPLEX — may require changing multiple services’ subscriptions
TestingEASIER — mock participant services against orchestratorHARDER — must trace event chains across services
Team autonomyLOWER — participant teams must coordinate with orchestrator teamHIGHER — teams are more independent
Implicit vs. explicit workflowEXPLICIT — readable in orchestrator codeIMPLICIT — workflow must be reconstructed from event subscriptions

When to Choose Orchestration

  • The workflow involves complex conditional branching (if X then Y else Z)
  • Error compensation is complex and involves multiple rollback paths
  • Business stakeholders need real-time workflow visibility and status reporting
  • The team is small and works on both the orchestrator and participant services
  • Strong consistency and deterministic sequencing are required
  • The workflow is long-running and must survive process restarts (durable orchestration)
  • You are implementing a saga with compensation steps (see ch12-transactional-sagas)

When to Choose Choreography

  • Services are owned by different teams who need strong deployment independence
  • The workflow is relatively simple and linear (few conditional branches)
  • Eventual consistency is acceptable for workflow state visibility
  • New participants will be added over time and must not require changes to existing services
  • High scalability is required and the orchestrator would become a bottleneck
  • You want to maximize the architectural benefit of the event-driven pattern already used elsewhere

Combining Orchestration and Choreography

Real-world systems rarely use one style exclusively. A common pattern is to use orchestration within a bounded context and choreography between bounded contexts:

Bounded Context: Ticket Management (internal — orchestration)
+----------------------------------------+
| Orchestrator                           |
|  -> Validation Service (call)          |
|  -> Assignment Service (call)          |
|  -> Notification Service (call)        |
+----------------------------------------+
    |
    publishes: TicketResolved (domain event)
    |
+---v------------------------------------+
| Bounded Context: Billing (external    |
| — choreography via event)             |
|  Billing Service subscribes to        |
|  TicketResolved and creates invoice   |
+----------------------------------------+

This hybrid approach gives:

  • The clarity and control of orchestration within a domain where one team controls all participants
  • The decoupling and independence of choreography across domain and team boundaries

Workflow State Management: Summary Comparison

ApproachVisibilityInfrastructure CostQuery CapabilityBest For
Front ControllerSTART/END onlyLOWCoarse-grained statusSimple workflows, enough to know done/failed
Stamp CouplingPer-event stateLOW (travels with events)None (must process events)Small payloads, bounded steps
Event SourcingCOMPLETE historyHIGHFull audit/replayCompliance, complex sagas, long-running

Sysops Squad Saga: Managing Ticket Workflows

The problem: The Sysops Squad ticket lifecycle involves multiple services:

  1. Customer submits ticket (Ticket service)
  2. Ticket is validated (Validation service)
  3. Expert is assigned (Assignment service)
  4. Expert is notified (Notification service)
  5. Expert resolves the ticket (Resolution service)
  6. Customer is billed (Billing service)
  7. Survey is sent (Survey service)

The question: Should this workflow be orchestrated (one central coordinator) or choreographed (services reacting to events)?

What the team considers:

The team maps out the forces acting on this decision:

Forces favoring orchestration:

  • Business operations team needs to see ticket status at any moment — “Is ticket #4729 at the assignment step or the billing step?”
  • There are complex error scenarios — if assignment fails (no available expert), there are retry rules and escalation paths that require conditional logic
  • SLAs require knowing exact elapsed time at each step

Forces favoring choreography:

  • The Billing, Survey, and Notification teams are separate and want deployment independence
  • The system processes thousands of tickets per hour — an orchestrator could bottleneck
  • The team wants to add a new “Quality Review” step later without touching existing services

Decision: The team adopts a hybrid approach:

  • Orchestration for the core operational workflow (Ticket → Validation → Assignment → Resolution): high visibility, complex error handling needed, single team owns these services
  • Choreography for post-resolution steps (Billing, Survey, Notification of completion): these teams are independent, the steps are simple reactions to the TicketResolved event, and adding new post-resolution steps in the future should not require changes to the core workflow

Workflow state management: For the choreography portion, the team uses the Front Controller Pattern: the Ticket service records that a TicketResolved event was published and subscribes to a terminal WorkflowCompleted event from the last downstream service. This gives coarse-grained status visibility without the overhead of full event sourcing.


Key Takeaways

  1. Distributed workflow management is a fundamental architectural problem: without a shared call stack, workflow state must be designed explicitly — it does not emerge automatically.
  2. Orchestration gives a central coordinator (the mediator) full control and full visibility of workflow state, but introduces a single point of failure and coupling between participants and the orchestrator.
  3. Choreography achieves maximum decoupling and independent deployability but distributes workflow state across services — the “hard part” is that no single service knows the overall status of an in-flight workflow instance.
  4. The question “Where is my order?” (or “What step is ticket #4729 at?”) is trivial in orchestration and genuinely hard in choreography — this is the central trade-off of the chapter.
  5. Three approaches to choreography state management: Front Controller (lightweight, coarse-grained), Stamp Coupling (state travels with events, no shared DB), and Event Sourcing (full audit trail, highest infrastructure cost).
  6. Stamp coupling is a deliberate design choice — passing more data than any single service needs so that downstream services have the context they need without calling back upstream.
  7. Choreography’s implicit workflow (workflow logic spread across event subscriptions) makes it harder to understand, test, and change than orchestration’s explicit workflow (logic concentrated in the orchestrator).
  8. Real-world architectures often combine both styles: orchestration within a bounded context (where one team controls all participants) and choreography between bounded contexts (where team independence is paramount).
  9. Distributed tracing tools (Jaeger, Zipkin, OpenTelemetry) are not optional in a choreography-based architecture — they are the primary mechanism for answering “what happened to workflow instance X?”
  10. The choice between orchestration and choreography should be driven primarily by: (a) how important real-time workflow state visibility is, (b) how complex error compensation is, and (c) whether participant services are owned by the same team or different teams.

  • ch09-data-ownership-distributed-transactions — Data ownership principles that inform which service is authoritative for workflow state
  • ch10-distributed-data-access — Data access patterns used within workflows (services often need to read each other’s data during workflow execution)
  • ch12-transactional-sagas — The transactional implementation of orchestrated and choreographed workflows when distributed transaction correctness is required; sagas are a specific application of orchestration/choreography to distributed transactions
  • ch02-coupling — The coupling dimensions (runtime, semantic, behavioral) that orchestration and choreography affect differently
  • ch07-service-granularity — Service granularity decisions affect whether a workflow fits inside one service (and is thus not distributed) or must span multiple services

Last Updated: 2026-05-30