Chapter 11: Managing Distributed Workflows

saht distributed-workflows orchestration choreography workflow-state event-driven

Status: Notes complete

Overview

In a distributed system, business processes rarely live inside a single service. A “create order” workflow might touch an Order service, an Inventory service, a Payment service, a Notification service, and a Shipping service — each of which owns its own data and runs in its own process. The question Chapter 11 addresses is: how do you coordinate this multi-step workflow across services, and who is responsible for knowing where the workflow is at any given moment?

This is the distributed workflow problem, and it is more complex than it appears. In a monolith, the call stack is the workflow state: you know exactly what step you are on because you are inside a method call, and the runtime thread holds the state. In a distributed system, there is no shared thread, no shared call stack, and no shared memory. State must be managed explicitly — and every approach to managing it has different trade-offs.

The chapter presents two fundamental communication styles for distributed workflows:

Orchestration — a central coordinator tells each service what to do and when
Choreography — each service reacts to events from other services with no central coordinator

These are not merely implementation details. They are architectural decisions that determine how tightly services are coupled, how observable the workflow is, who owns workflow state, and how easy it is to add or change workflow steps without modifying other services.

The chapter also addresses the hardest part of choreography: how do you know the overall state of a workflow when no single service has the full picture?

The Sysops Squad Saga demonstrates how the team chooses between orchestration and choreography for the ticket lifecycle workflow.

Core Concepts

Workflow: A sequence of steps (involving one or more services) that together complete a business process. Examples: processing a customer order, onboarding a new employee, assigning and resolving a support ticket.

Mediator: In orchestration, the central component that coordinates workflow execution. It knows the sequence of steps, which service performs each step, and how to handle errors at each step.

Orchestration: A workflow communication style where a central orchestrator (mediator) explicitly directs each participant service: “Do step 1 now,” “Do step 2 now,” “Step 2 failed — compensate.” Participants respond only to explicit commands from the orchestrator.

Choreography: A workflow communication style where there is no central coordinator. Each service knows what to do when it receives a particular event. Services react to domain events published by other services. No single service has the full picture of the workflow — the workflow “emerges” from the interactions.

Workflow state: The current status of a particular workflow instance — which step has been completed, what the current step is, what the outcome was at each step, and whether the overall workflow succeeded or failed.

Front controller pattern: In choreography, a lightweight first service receives the initial request, persists workflow state, and publishes the first event — acting as the single entry point without becoming a full orchestrator.

Stamp coupling: Passing a data structure through a workflow that contains more fields than any single service needs, so that downstream services can add their own results to the structure. Used in choreography to thread workflow state through event messages.

Event sourcing: Reconstructing the current state of a workflow (or entity) by replaying all events that have been applied to it, rather than storing the current state directly.

Orchestration Communication Style

What It Is

A central orchestrator service (the mediator) explicitly controls the execution of a workflow. The orchestrator calls each participant service in sequence (or in parallel where the workflow allows), waits for responses, handles errors, applies retry logic, and decides how to proceed at each step.

Orchestration: "Ticket Processing Workflow"

  Client Request
       |
       v
+------+-------+         "Validate ticket"
| Orchestrator |  -------> Validation Service
|  (Mediator)  |  <------- OK
|              |
|              |         "Assign expert"
|              |  -------> Assignment Service
|              |  <------- Expert assigned
|              |
|              |         "Notify expert"
|              |  -------> Notification Service
|              |  <------- Notification sent
|              |
|              |         "Bill customer"
|              |  -------> Billing Service
|              |  <------- Invoice created
|              |
|  Workflow    |
|  state lives |
|  here        |
+------+-------+
       |
       v
  Response to Client

How Orchestration Works in Practice

The orchestrator typically:

Receives the initial trigger (API call, event, schedule)
Persists the initial workflow state
Calls Service 1, waits for response
Updates workflow state with Step 1 result
Calls Service 2 (possibly with data from Step 1’s response)
Continues until all steps complete or a step fails
On failure: executes compensation logic (calls rollback operations on previously successful steps)
On success: reports workflow completion to the caller or publishes a completion event

Common implementation patterns:

A dedicated orchestrator service (most common in microservices)
A workflow engine or BPM tool (Camunda, Temporal, AWS Step Functions) that provides durability, retry, and state management
A saga orchestrator specifically managing distributed transactions (see ch12-transactional-sagas)

Trade-offs of Orchestration

Advantages:

Advantage	Explanation
Centralized workflow state	The orchestrator always knows the current state of every active workflow instance — easy to query, monitor, and debug
Error handling clarity	Compensation logic, retries, and failure handling are all in one place — the orchestrator — rather than distributed across services
Easy to add steps	Adding a new step means updating the orchestrator; participant services are unaware of the change
Observability	The orchestrator is a single point for workflow telemetry — dashboards, alerts, and SLAs can be monitored from one service
Explicit sequencing	The order of operations is explicit and readable in the orchestrator’s code — no need to trace event flows across multiple services

Disadvantages:

Disadvantage	Explanation
Single point of failure	If the orchestrator is unavailable, all workflows in progress stall — no other service can advance the workflow
Coupling to orchestrator	Participant services may need to implement specific APIs or response formats the orchestrator expects — behavioral coupling to the orchestrator
God service risk	Over time, business logic can creep into the orchestrator, turning it into a service with too much responsibility — “workflow logic” and “business logic” blur
Scalability bottleneck	All workflow traffic flows through the orchestrator — it must scale with the number of concurrent workflow instances
Orchestrator becomes a dependency	Every service that participates in any workflow must be available and compatible with the orchestrator’s protocol

Orchestration Sequence Diagram

Client   Orchestrator   Validation   Assignment   Notification   Billing
  |           |              |             |             |           |
  |--request->|              |             |             |           |
  |           |--validate--->|             |             |           |
  |           |<--ok---------|             |             |           |
  |           |--assign----------------->  |             |           |
  |           |<--expert_id--------------|             |           |
  |           |--notify-------------------------------------->|      |
  |           |<--sent-----------------------------------------|    |
  |           |--bill----------------------------------------------------->|
  |           |<--invoice_id--------------------------------------------- |
  |<--done----|

Choreography Communication Style

What It Is

In choreography, there is no central orchestrator. Instead, each service in the workflow knows what to do when it receives a particular event. Services publish events when they complete their work; other services subscribe to those events and react accordingly. The workflow “emerges” from the choreographed interactions between services.

Choreography: "Ticket Processing Workflow"

  Client Request
       |
       v
+------+-------+
| Ticket        |  publishes: TicketCreated
| Service       |
+---------------+
                        |
                        v
                +-------+--------+
                | Validation     | consumes: TicketCreated
                | Service        | publishes: TicketValidated
                +-------+--------+
                        |
                        v
                +-------+--------+
                | Assignment     | consumes: TicketValidated
                | Service        | publishes: ExpertAssigned
                +-------+--------+
                        |
                        v
                +-------+--------+
                | Notification   | consumes: ExpertAssigned
                | Service        | publishes: ExpertNotified
                +-------+--------+
                        |
                        v
                +-------+--------+
                | Billing        | consumes: ExpertNotified
                | Service        | publishes: CustomerBilled
                +-------+--------+

  No single service knows the full workflow state!

How Choreography Works in Practice

The initial trigger (API call) causes the first service to perform its action and publish a domain event to a message broker
The broker delivers the event to all subscribed services
Each subscribed service performs its action and publishes its own event
The chain continues until the workflow completes (no more services react to the final event, or a terminal event is published)
On failure: the failing service publishes a failure event; other services that subscribed to failure events may perform compensating actions

Infrastructure required: A reliable message broker (Kafka, RabbitMQ, AWS SQS/SNS, Google Pub/Sub) that guarantees event delivery, ordering (where needed), and replay capability.

Trade-offs of Choreography

Advantages:

Advantage	Explanation
Highly decoupled	No service knows about any other service — only about event types and its own behavior
Independent deployability	Any service can be replaced, updated, or scaled without the others knowing, as long as event contracts are preserved
No single point of failure	There is no orchestrator to fail; the workflow continues as long as the message broker and participant services are available
Easier to add consumers	A new service can be added to the workflow simply by subscribing to the relevant event — existing services do not change
Scales naturally	Each service scales independently based on its own load

Disadvantages:

Disadvantage	Explanation
Distributed workflow state	No single service has the full picture of where a workflow instance is at any given moment — the hard part
Difficult to debug	Tracing a specific workflow instance across multiple services and events requires distributed tracing tooling (Jaeger, Zipkin)
Error handling complexity	Compensation and rollback logic must be distributed across services — each service must know what to do if “its” event is a failure event
Implicit sequencing	The order of operations is implicit — you must trace event subscriptions across multiple services to understand the full workflow
Workflow changes require coordination	Changing the order of steps may require updating multiple services’ event subscriptions

Choreography Sequence Diagram

Client  Ticket Svc   Broker   Validation  Assignment  Notification   Billing
  |        |            |         |             |            |           |
  |--req-->|            |         |             |            |           |
  |        |--publish-->|         |             |            |           |
  |        |  (TicketCreated)     |             |            |           |
  |<--ack--|            |--event->|             |            |           |
  |        |            |  (TicketCreated)      |            |           |
  |        |            |         |--publish--->|            |           |
  |        |            |         |  (TicketValidated)       |           |
  |        |            |         |             |--publish-->|           |
  |        |            |         |             |  (ExpertAssigned)      |
  |        |            |         |             |            |--publish->|
  |        |            |         |             |            | (ExpertNotified)
  |        |            |         |             |            |           |
  |        |            |         |             |            |  (CustomerBilled)

Workflow State Management in Choreography

This is the “hard part” the chapter title promises. In orchestration, workflow state is trivial — the orchestrator holds it. In choreography, no single service has the full picture. How can you answer “What is the status of workflow instance #4729?”

The chapter presents three main approaches to managing workflow state in choreography:

Approach 1: Front Controller Pattern

A lightweight “front controller” service receives the initial request and is responsible for:

Persisting the initial workflow record (with a unique workflow ID)
Publishing the first event to start the choreography chain
Subscribing to the final completion/failure events to update the workflow record

The front controller is not a full orchestrator — it does not coordinate each step. It only captures the beginning and end of the workflow, plus the workflow ID that all events carry so they can be correlated.

+---------------------+
| Front Controller    |
| - Creates workflow  |  -----publishes TicketCreated (with workflowId)----->
| - Stores status     |
| - Subscribes to     |  <----receives CustomerBilled (with workflowId)------
|   terminal events   |
| - Updates status    |
+---------------------+

Workflow DB:
 workflowId | status    | created_at          | completed_at
 4729       | completed | 2026-05-30 09:00:00 | 2026-05-30 09:00:05

Trade-off: The front controller has partial visibility — it knows the workflow started and whether it ultimately completed or failed, but not which intermediate step the workflow is currently at. It is a light-touch solution appropriate when course-grained status tracking is sufficient.

Approach 2: Stamp Coupling for State

Each event in the choreography chain carries a data envelope (the “stamp”) that contains not just the payload for the current step but also accumulated state from previous steps. Each service adds its own contribution to the stamp before publishing the next event. The final event in the chain contains the complete workflow history.

TicketCreated event:
{
  "workflowId": "4729",
  "ticketId": "T-001",
  "customerId": "C-999",
  "description": "Laptop won't start"
}

TicketValidated event:
{
  "workflowId": "4729",
  "ticketId": "T-001",
  "customerId": "C-999",
  "description": "Laptop won't start",
  "validationResult": "ok",         <-- added by Validation Service
  "priority": "HIGH"                <-- added by Validation Service
}

ExpertAssigned event:
{
  "workflowId": "4729",
  "ticketId": "T-001",
  "customerId": "C-999",
  "priority": "HIGH",
  "expertId": "E-042",              <-- added by Assignment Service
  "scheduledTime": "2026-05-31"     <-- added by Assignment Service
}

Advantages:

No shared database needed for state — state travels with the events
Each service has full context it needs from previous steps
Replay is straightforward — the full state is in the event stream

Disadvantages:

Events grow larger as the workflow progresses — bandwidth and serialization cost
Services receive data they do not need (they only process their own step but see the full stamp)
Schema changes to early-stage fields ripple through all downstream events
Workflow state is only visible to a service when it processes an event — no way to query “current state” without processing all events

When to use: When the workflow data is small, the number of steps is bounded, and the team values simplicity over query capability.

Approach 3: Event Sourcing for State

A dedicated workflow state service (or event store) subscribes to all workflow events across all services. It maintains a complete record of every event that occurred for every workflow instance. Workflow state at any point can be reconstructed by replaying the event log.

All workflow events ---> +--------------------+
                         | Workflow State     |
                         | Service /          |
                         | Event Store        |
                         |                    |
                         | workflowId: 4729   |
                         | events:            |
                         |  - TicketCreated   |
                         |  - TicketValidated |
                         |  - ExpertAssigned  |
                         |  - ExpertNotified  |
                         |  (CustomerBilled   |
                         |   not yet seen)    |
                         +--------------------+

Query: "What is the state of workflow 4729?"
Answer: Replay events -> "Step 4 of 5 complete, waiting for billing"

Advantages:

Complete workflow state is always queryable
Full audit trail — every state transition is recorded
Can reconstruct state at any historical point
Natural fit for saga compensation — you can always see what succeeded and needs reversal

Disadvantages:

Significant infrastructure investment — event store, event schema governance
State reconstruction requires replaying events — can be slow for long-running workflows with many events
Eventual consistency in state queries — the state service may not have processed all events yet
Complex to implement correctly — event ordering, idempotency, and compaction are non-trivial

When to use: When auditability is required (financial, compliance, healthcare), when workflows are long-running, when compensating actions are complex, or when workflow observability is a high priority.

Head-to-Head: Orchestration vs. Choreography

Comprehensive Comparison Table

Dimension	Orchestration	Choreography
Workflow state visibility	HIGH — orchestrator holds all state	LOW to MEDIUM — distributed or reconstructed
Coupling between services	MEDIUM — participants coupled to orchestrator’s API	LOW — participants only coupled to event schemas
Independent deployability	MEDIUM — orchestrator must be updated for changes	HIGH — services deploy independently
Error handling	CENTRALIZED — all in orchestrator	DISTRIBUTED — each service handles its own failures
Observability / debuggability	HIGH — single point of monitoring	LOW — requires distributed tracing tools
Scalability bottleneck	POSSIBLE — orchestrator may bottleneck	NONE — no central coordinator
Single point of failure	YES — orchestrator failure stops all workflows	NO — broker failure is contained; partial failure is possible
Adding new workflow steps	SIMPLE — update orchestrator	MEDIUM — new service subscribes to existing events
Changing step order	SIMPLE — update orchestrator logic	COMPLEX — may require changing multiple services’ subscriptions
Testing	EASIER — mock participant services against orchestrator	HARDER — must trace event chains across services
Team autonomy	LOWER — participant teams must coordinate with orchestrator team	HIGHER — teams are more independent
Implicit vs. explicit workflow	EXPLICIT — readable in orchestrator code	IMPLICIT — workflow must be reconstructed from event subscriptions

When to Choose Orchestration

The workflow involves complex conditional branching (if X then Y else Z)
Error compensation is complex and involves multiple rollback paths
Business stakeholders need real-time workflow visibility and status reporting
The team is small and works on both the orchestrator and participant services
Strong consistency and deterministic sequencing are required
The workflow is long-running and must survive process restarts (durable orchestration)
You are implementing a saga with compensation steps (see ch12-transactional-sagas)

When to Choose Choreography

Services are owned by different teams who need strong deployment independence
The workflow is relatively simple and linear (few conditional branches)
Eventual consistency is acceptable for workflow state visibility
New participants will be added over time and must not require changes to existing services
High scalability is required and the orchestrator would become a bottleneck
You want to maximize the architectural benefit of the event-driven pattern already used elsewhere

Combining Orchestration and Choreography

Real-world systems rarely use one style exclusively. A common pattern is to use orchestration within a bounded context and choreography between bounded contexts:

Bounded Context: Ticket Management (internal — orchestration)
+----------------------------------------+
| Orchestrator                           |
|  -> Validation Service (call)          |
|  -> Assignment Service (call)          |
|  -> Notification Service (call)        |
+----------------------------------------+
    |
    publishes: TicketResolved (domain event)
    |
+---v------------------------------------+
| Bounded Context: Billing (external    |
| — choreography via event)             |
|  Billing Service subscribes to        |
|  TicketResolved and creates invoice   |
+----------------------------------------+

This hybrid approach gives:

The clarity and control of orchestration within a domain where one team controls all participants
The decoupling and independence of choreography across domain and team boundaries

Workflow State Management: Summary Comparison

Approach	Visibility	Infrastructure Cost	Query Capability	Best For
Front Controller	START/END only	LOW	Coarse-grained status	Simple workflows, enough to know done/failed
Stamp Coupling	Per-event state	LOW (travels with events)	None (must process events)	Small payloads, bounded steps
Event Sourcing	COMPLETE history	HIGH	Full audit/replay	Compliance, complex sagas, long-running

Sysops Squad Saga: Managing Ticket Workflows

The problem: The Sysops Squad ticket lifecycle involves multiple services:

Customer submits ticket (Ticket service)
Ticket is validated (Validation service)
Expert is assigned (Assignment service)
Expert is notified (Notification service)
Expert resolves the ticket (Resolution service)
Customer is billed (Billing service)
Survey is sent (Survey service)

The question: Should this workflow be orchestrated (one central coordinator) or choreographed (services reacting to events)?

What the team considers:

The team maps out the forces acting on this decision:

Forces favoring orchestration:

Business operations team needs to see ticket status at any moment — “Is ticket #4729 at the assignment step or the billing step?”
There are complex error scenarios — if assignment fails (no available expert), there are retry rules and escalation paths that require conditional logic
SLAs require knowing exact elapsed time at each step

Forces favoring choreography:

The Billing, Survey, and Notification teams are separate and want deployment independence
The system processes thousands of tickets per hour — an orchestrator could bottleneck
The team wants to add a new “Quality Review” step later without touching existing services

Decision: The team adopts a hybrid approach:

Orchestration for the core operational workflow (Ticket → Validation → Assignment → Resolution): high visibility, complex error handling needed, single team owns these services
Choreography for post-resolution steps (Billing, Survey, Notification of completion): these teams are independent, the steps are simple reactions to the TicketResolved event, and adding new post-resolution steps in the future should not require changes to the core workflow

Workflow state management: For the choreography portion, the team uses the Front Controller Pattern: the Ticket service records that a TicketResolved event was published and subscribes to a terminal WorkflowCompleted event from the last downstream service. This gives coarse-grained status visibility without the overhead of full event sourcing.

Key Takeaways

Distributed workflow management is a fundamental architectural problem: without a shared call stack, workflow state must be designed explicitly — it does not emerge automatically.
Orchestration gives a central coordinator (the mediator) full control and full visibility of workflow state, but introduces a single point of failure and coupling between participants and the orchestrator.
Choreography achieves maximum decoupling and independent deployability but distributes workflow state across services — the “hard part” is that no single service knows the overall status of an in-flight workflow instance.
The question “Where is my order?” (or “What step is ticket #4729 at?”) is trivial in orchestration and genuinely hard in choreography — this is the central trade-off of the chapter.
Three approaches to choreography state management: Front Controller (lightweight, coarse-grained), Stamp Coupling (state travels with events, no shared DB), and Event Sourcing (full audit trail, highest infrastructure cost).
Stamp coupling is a deliberate design choice — passing more data than any single service needs so that downstream services have the context they need without calling back upstream.
Choreography’s implicit workflow (workflow logic spread across event subscriptions) makes it harder to understand, test, and change than orchestration’s explicit workflow (logic concentrated in the orchestrator).
Real-world architectures often combine both styles: orchestration within a bounded context (where one team controls all participants) and choreography between bounded contexts (where team independence is paramount).
Distributed tracing tools (Jaeger, Zipkin, OpenTelemetry) are not optional in a choreography-based architecture — they are the primary mechanism for answering “what happened to workflow instance X?”
The choice between orchestration and choreography should be driven primarily by: (a) how important real-time workflow state visibility is, (b) how complex error compensation is, and (c) whether participant services are owned by the same team or different teams.

ch09-data-ownership-distributed-transactions — Data ownership principles that inform which service is authoritative for workflow state
ch10-distributed-data-access — Data access patterns used within workflows (services often need to read each other’s data during workflow execution)
ch12-transactional-sagas — The transactional implementation of orchestrated and choreographed workflows when distributed transaction correctness is required; sagas are a specific application of orchestration/choreography to distributed transactions
ch02-coupling — The coupling dimensions (runtime, semantic, behavioral) that orchestration and choreography affect differently
ch07-service-granularity — Service granularity decisions affect whether a workflow fits inside one service (and is thus not distributed) or must span multiple services

Last Updated: 2026-05-30

Study Notes by Niladri & AI

Explorer

ch11-managing-distributed-workflows

Chapter 11: Managing Distributed Workflows

Overview

Core Concepts

Orchestration Communication Style

What It Is

How Orchestration Works in Practice

Trade-offs of Orchestration

Orchestration Sequence Diagram

Choreography Communication Style

What It Is

How Choreography Works in Practice

Trade-offs of Choreography

Choreography Sequence Diagram

Workflow State Management in Choreography

Approach 1: Front Controller Pattern

Approach 2: Stamp Coupling for State

Approach 3: Event Sourcing for State

Head-to-Head: Orchestration vs. Choreography

Comprehensive Comparison Table

When to Choose Orchestration

When to Choose Choreography

Combining Orchestration and Choreography

Workflow State Management: Summary Comparison

Sysops Squad Saga: Managing Ticket Workflows

Key Takeaways

Graph View

Table of Contents

Backlinks

Study Notes by Niladri & AI

Explorer

ch11-managing-distributed-workflows

Chapter 11: Managing Distributed Workflows

Overview

Core Concepts

Orchestration Communication Style

What It Is

How Orchestration Works in Practice

Trade-offs of Orchestration

Orchestration Sequence Diagram

Choreography Communication Style

What It Is

How Choreography Works in Practice

Trade-offs of Choreography

Choreography Sequence Diagram

Workflow State Management in Choreography

Approach 1: Front Controller Pattern

Approach 2: Stamp Coupling for State

Approach 3: Event Sourcing for State

Head-to-Head: Orchestration vs. Choreography

Comprehensive Comparison Table

When to Choose Orchestration

When to Choose Choreography

Combining Orchestration and Choreography

Workflow State Management: Summary Comparison

Sysops Squad Saga: Managing Ticket Workflows

Key Takeaways

Related Resources

Graph View

Table of Contents

Backlinks