Module 06: Multi-Agent Systems
Multi-agent systems are the architecture you reach for when a single agent is no longer sufficient. This module covers when to decompose tasks, how to coordinate agents, the protocols for inter-agent communication, failure handling, and how the major frameworks compare.
1. Why Multi-Agent Systems
Single Agent Limitations
A single LLM agent hits practical walls in several dimensions:
Context window limits. Even with 200K token windows, complex tasks generate more intermediate state than can fit in one context. A research task that involves reading 20 documents, running code, checking APIs, and synthesizing results will overflow any practical context budget if handled in a single agent loop.
Specialization. A generalist agent is mediocre at everything. A coding-specialized agent (trained with code-specific system prompts, appropriate tool sets, narrow scope) outperforms a generalist on code tasks. Multi-agent systems let you route each subtask to the agent best suited for it.
Speed. A single agent is serial: it does step 1, then step 2, then step 3. If steps 2 and 3 are independent, a multi-agent system can run them in parallel — reducing wall-clock time dramatically.
Reliability. If a single agent’s context becomes corrupted (a tool returned garbage, a loop went off the rails), the entire task fails. Multi-agent systems can retry at the subagent level without restarting the whole pipeline.
When to Decompose
Add agents when you have:
- Tasks larger than a context window — research tasks, large codebase changes, multi-document synthesis.
- Parallelizable subtasks — summarize 10 articles simultaneously, check 5 APIs concurrently.
- Subtasks requiring specialization — different system prompts, different tool sets, different models (use GPT-4 for reasoning, Haiku for classification).
- Long-running workflows — if a task takes 30+ minutes, checkpoint-ability via separate agents is valuable.
Coordination Overhead Is Real
Adding agents adds complexity. Every agent boundary introduces:
- Latency: one more API round-trip
- Cost: one more set of input tokens (the task description, context, tools)
- Failure surface: one more thing that can error out
- Debugging complexity: understanding failures across multiple agents is harder than within one
Do not decompose a task that a single agent can handle cleanly. The default should be: use one agent. Add a second agent only when there is a clear, concrete reason.
2. Orchestrator–Subagent Pattern
The orchestrator–subagent pattern is the most common multi-agent architecture. It has a clear division of responsibility and is easy to reason about.
Roles
Orchestrator:
- Receives the high-level task
- Breaks it into scoped subtasks
- Delegates each subtask to an appropriate subagent
- Collects and validates results
- Assembles the final output
- Owns the overall state of the task
Subagent:
- Receives a single, well-defined task from the orchestrator
- Has no knowledge of the larger task or other subagents
- Returns a structured result in the expected format
- May use its own tools and context, but does not call other agents (unless you have nested delegation)
Communication Protocol
The orchestrator-to-subagent message should include:
- Task description: what to do, stated precisely and completely
- Context: exactly the information the subagent needs (no more, no less)
- Output format: the exact structure of the expected result
- Constraints: time budget, length limits, tools allowed
The subagent-to-orchestrator result should include:
- Result: the actual output
- Status: success / partial / failed
- Confidence (optional but useful): how certain the subagent is
- Errors (if any): what went wrong
- Metadata: token usage, time taken, etc.
# Orchestrator delegates a research subtask
subagent_task = {
"task": "Summarize the key arguments in the provided text",
"context": "PAPER_TEXT_HERE",
"output_format": {
"summary": "string, 3-5 sentences",
"key_claims": "list of 3-5 strings",
"methodology": "string, 1-2 sentences or null"
},
"constraints": {
"max_tokens": 400,
"language": "English"
}
}
# Subagent returns a structured result
subagent_result = {
"status": "success",
"result": {
"summary": "...",
"key_claims": ["...", "...", "..."],
"methodology": "..."
},
"metadata": {
"tokens_used": 312,
"model": "claude-haiku-4-5-20251001"
}
}State Management
The orchestrator owns all shared state. Subagents are stateless workers. The orchestrator:
- Tracks which subtasks are complete, in-flight, or failed
- Stores intermediate results until assembly
- Decides what to do when a subagent fails
- Has the final say on task completion
This “orchestrator owns state, subagents are stateless” separation is critical. If subagents start sharing state with each other directly, you’ve created a distributed system with all its attendant consistency problems.
Full Example Walkthrough
Task: Research question — “What are the current limitations of transformer-based language models?”
Step 1 — Orchestrator decomposes:
Subtask A: Summarize limitations from a given paper abstract [assigned to Research Agent 1]
Subtask B: Identify limitations mentioned in a given blog post [assigned to Research Agent 2]
Subtask C: List technical limitations from a given technical blog [assigned to Research Agent 3]
Step 2 — Subagents execute in parallel:
- Agent 1 returns:
{claims: ["quadratic attention complexity", "fixed context length", ...]} - Agent 2 returns:
{claims: ["hallucination", "world knowledge cutoff", ...]} - Agent 3 returns:
{claims: ["inference cost", "energy consumption", ...]}
Step 3 — Orchestrator synthesizes:
- Collects all three result sets
- Deduplicates overlapping claims
- Structures the final answer with citations to each source
- Returns the assembled result
3. DAG-Based Task Decomposition
What Is a Task DAG?
A Directed Acyclic Graph (DAG) models a task as a set of nodes (subtasks) connected by directed edges (dependencies). “Acyclic” means there are no circular dependencies — every path eventually terminates.
For task planning, the DAG tells you:
- Which tasks can run in parallel (no dependency between them)
- Which tasks must wait for others (explicit dependency edge)
- The overall execution order (topological sort)
Identifying Parallelizable vs Sequential Tasks
Run in parallel if:
- Task B does not use the output of Task A
- Tasks A and B access different resources with no lock contention
- Tasks A and B are logically independent parts of the same whole
Must run sequentially if:
- Task B consumes the output of Task A (data dependency)
- Task B is a quality check on Task A (functional dependency)
- Task B sets up resources that Task A will use
Topological Sort = Execution Order
Given a task DAG, topological sort gives you a valid execution order where every node comes after all of its dependencies.
from collections import deque
def topological_sort(tasks: dict[str, list[str]]) -> list[str]:
"""
tasks: {task_id: [list of task_ids this task depends on]}
Returns tasks in execution order (dependencies first).
"""
# Build in-degree counts
in_degree = {t: 0 for t in tasks}
dependents = {t: [] for t in tasks}
for task, deps in tasks.items():
for dep in deps:
in_degree[task] += 1
dependents[dep].append(task)
# Start with tasks that have no dependencies
queue = deque([t for t, deg in in_degree.items() if deg == 0])
order = []
while queue:
task = queue.popleft()
order.append(task)
for dependent in dependents[task]:
in_degree[dependent] -= 1
if in_degree[dependent] == 0:
queue.append(dependent)
if len(order) != len(tasks):
raise ValueError("Cycle detected in task DAG")
return orderASCII DAG: Research Task Example
┌─────────────────────┐
│ ORCHESTRATOR │
│ (receives task) │
└─────────┬───────────┘
│ decomposes
┌─────────────┼─────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Fetch │ │ Fetch │ │ Fetch │
│ Source │ │ Source │ │ Source │
│ A │ │ B │ │ C │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Summarize│ │Summarize│ │Summarize│
│ A │ │ B │ │ C │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└─────────────┼─────────────┘
│ fan-in
▼
┌─────────────────────┐
│ SYNTHESIZER │
│ (combines results) │
└─────────────────────┘
- Fetch A, B, C run in parallel (no dependencies between them)
- Each Summarize depends on its corresponding Fetch (sequential within lane)
- Summarize A, B, C can run in parallel (independent)
- Synthesizer depends on all three summarize steps (sequential after fan-in)
Topological sort gives one valid order: [FetchA, FetchB, FetchC, SumA, SumB, SumC, Synthesize]
But in practice you run [FetchA||FetchB||FetchC] → [SumA||SumB||SumC] → [Synthesize]
Failure Handling in a DAG
When a node fails:
Option 1 — Fail fast: Cancel all downstream nodes, return error to orchestrator. Use when partial results are useless.
Option 2 — Continue with partial results: Mark the failed node as failed, skip all nodes that depend on it, continue executing independent nodes, inform the synthesizer that some inputs are missing. Use when partial results are better than nothing.
Option 3 — Retry: Re-queue the failed node (up to N times) before declaring failure. Use for transient failures (rate limits, timeouts).
Option 4 — Substitute: If a node fails, run a fallback (a simpler model, a different tool, a cached result). Use when you have a viable alternative.
4. Parallel Agent Execution (Fan-Out / Fan-In)
Fan-Out
Fan-out means distributing work: send the same prompt structure (with different inputs) to N agents simultaneously.
# Fan-out: 3 agents receive the same task structure, different inputs
tasks = [
{"topic": "renewable energy storage", "agent_id": "agent_1"},
{"topic": "grid infrastructure modernization", "agent_id": "agent_2"},
{"topic": "policy frameworks for energy transition", "agent_id": "agent_3"},
]
# All three are dispatched simultaneouslyFan-In
Fan-in means collecting and synthesizing N results into one.
results = await asyncio.gather(
run_agent(tasks[0]),
run_agent(tasks[1]),
run_agent(tasks[2]),
)
# All three results are now available simultaneously
synthesized = synthesize(results)Python Implementation with asyncio
import asyncio
import anthropic
import time
async def run_agent(client, topic: str, agent_id: str) -> dict:
"""Single async agent call."""
start = time.monotonic()
response = await asyncio.to_thread(
client.messages.create,
model="claude-haiku-4-5-20251001",
max_tokens=256,
messages=[{
"role": "user",
"content": f"Write a 3-sentence summary of: {topic}"
}]
)
elapsed = time.monotonic() - start
return {
"agent_id": agent_id,
"topic": topic,
"result": response.content[0].text,
"elapsed_seconds": elapsed,
}
async def parallel_research(topics: list[str]) -> list[dict]:
"""Fan-out: run all agents in parallel. Fan-in: collect all results."""
client = anthropic.Anthropic()
tasks = [
run_agent(client, topic, f"agent_{i}")
for i, topic in enumerate(topics)
]
results = await asyncio.gather(*tasks, return_exceptions=True)
return [r for r in results if not isinstance(r, Exception)]Rate Limiting and Cost Considerations
Running N agents in parallel multiplies your API usage by N. Key considerations:
Anthropic rate limits: Requests per minute (RPM) and tokens per minute (TPM) are per-API-key limits. If you fan out to 10 agents simultaneously, all 10 requests consume from the same RPM bucket. Add a semaphore to cap concurrency.
# Limit to 5 concurrent agent calls at once
semaphore = asyncio.Semaphore(5)
async def rate_limited_agent(client, topic, agent_id):
async with semaphore:
return await run_agent(client, topic, agent_id)Cost estimation before fanning out: Each agent call costs money. For 10 agents with 1,000 input tokens each and 200 output tokens each, using claude-haiku-4-5, you’re paying ~1/day just for that one workflow. Estimate before building.
Parallel vs sequential timing: With N=3 parallel agents taking 2s each: parallel = 2s total vs sequential = 6s total. The speedup is approximately min(N, available_concurrency) — capped by the slowest agent.
5. Handoff Protocols
Structured Output as the Contract
The interface between agents should be a strongly-typed schema, not a freeform string. When one agent’s output is another agent’s input, both must agree on the format. Using structured output (via tool_use or response_format) enforces this.
Pydantic Models for Inter-Agent Communication
from pydantic import BaseModel, Field
from typing import Optional, Literal
from datetime import datetime
class AgentResult(BaseModel):
"""Canonical structure for any agent's output."""
task_id: str
agent_id: str
status: Literal["success", "partial", "failed"]
result: Optional[dict] = None
confidence: Optional[float] = Field(None, ge=0.0, le=1.0)
errors: list[str] = []
metadata: dict = {}
completed_at: datetime = Field(default_factory=datetime.utcnow)
class ResearchResult(AgentResult):
"""Specialized result for research subagents."""
class Config:
extra = "allow"
# result will contain:
# {
# "summary": str,
# "key_claims": list[str],
# "sources_used": list[str],
# }Versioning Handoff Schemas
As your system evolves, the schema between agents will change. Without versioning, deploying a new orchestrator that emits v2 schemas to subagents still running v1 parsers causes silent failures.
Best practices:
- Include a
schema_versionfield in every handoff message. - Subagents should validate the schema version before processing.
- Support N-1 compatibility: new subagents should handle both current and previous schema versions.
- Use semantic versioning: minor bumps are backwards-compatible additions; major bumps are breaking changes.
What to Include in a Handoff
| Field | Required | Purpose |
|---|---|---|
task_id | Yes | Correlate requests and responses for tracing |
status | Yes | Let orchestrator know if it got what it needed |
result | Conditional | The actual output (null if status=failed) |
errors | On failure | Structured error info for orchestrator retry logic |
confidence | Recommended | Let orchestrator weight or re-verify low-confidence results |
metadata | Recommended | Tokens used, model version, latency — for monitoring |
schema_version | Recommended | Future-proof the interface |
6. Failure Handling in Multi-Agent Systems
Multi-agent systems have more failure modes than single-agent systems. Plan for failure explicitly.
Retry at the Agent Level
The simplest and most important failure handler: when an agent fails, try again.
import asyncio
async def run_with_retry(agent_fn, *args, max_retries: int = 3, backoff: float = 1.0):
"""Exponential backoff retry for a single agent call."""
for attempt in range(max_retries):
try:
return await agent_fn(*args)
except Exception as e:
if attempt == max_retries - 1:
raise
wait = backoff * (2 ** attempt)
print(f"[Retry] Attempt {attempt+1} failed: {e}. Retrying in {wait}s.")
await asyncio.sleep(wait)Only retry on transient errors (rate limits, timeouts, 5xx). Do not retry on permanent errors (invalid input, schema mismatch) — they will fail every time.
Fallback Agents
When a specialist fails, fall back to a generalist.
async def run_with_fallback(primary_fn, fallback_fn, *args):
"""Try primary agent; if it fails, use fallback."""
try:
result = await primary_fn(*args)
if result.status != "failed":
return result
except Exception:
pass
print("[Fallback] Primary agent failed. Using fallback.")
return await fallback_fn(*args)
# Example: specialist code review agent → fallback to generalist
result = await run_with_fallback(
specialist_code_reviewer,
generalist_reviewer,
code_to_review
)Circuit Breakers
A circuit breaker stops routing tasks to an agent that is consistently failing, protecting the overall system.
class CircuitBreaker:
"""
Three states:
CLOSED — normal operation, requests pass through
OPEN — agent is failing, reject requests immediately
HALF — test mode, allow one request to see if agent recovered
"""
def __init__(self, failure_threshold: int = 5, timeout: float = 60.0):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.timeout = timeout
self.state = "CLOSED"
self.last_failure_time = 0.0
def call_allowed(self) -> bool:
if self.state == "CLOSED":
return True
if self.state == "OPEN":
if time.time() - self.last_failure_time > self.timeout:
self.state = "HALF"
return True # Allow the test call
return False
return True # HALF — allow one call
def on_success(self):
self.failure_count = 0
self.state = "CLOSED"
def on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
print(f"[CircuitBreaker] OPEN — too many failures")Partial Results Assembly
When some subagents succeed and some fail, the orchestrator must decide how to assemble a useful output from incomplete inputs.
def assemble_partial_results(
results: list[AgentResult],
required_tasks: set[str],
optional_tasks: set[str]
) -> dict:
"""
Assemble final output from a mix of successful and failed subagents.
"""
succeeded = {r.task_id: r for r in results if r.status == "success"}
failed = {r.task_id: r for r in results if r.status == "failed"}
# All required tasks must succeed
missing_required = required_tasks - set(succeeded.keys())
if missing_required:
return {
"status": "failed",
"reason": f"Required tasks failed: {missing_required}",
"partial_data": succeeded
}
# Optional tasks: include what we have, note what's missing
missing_optional = optional_tasks - set(succeeded.keys())
return {
"status": "partial" if missing_optional else "success",
"results": {tid: r.result for tid, r in succeeded.items()},
"missing_optional": list(missing_optional),
"errors": {tid: r.errors for tid, r in failed.items()},
}7. Multi-Agent Frameworks Comparison
LangGraph
Model: Graph-based. Nodes are Python functions (or agents). Edges are transitions, which can be conditional. State flows through the graph as a typed dictionary.
Strengths:
- Explicit, auditable control flow — you can see exactly what transitions are possible
- First-class support for cycles (loops until convergence)
- Built-in persistence and checkpointing (resume interrupted workflows)
- Rich ecosystem (LangChain tools, integrations)
- Human-in-the-loop is a first-class concept
Weaknesses:
- Steeper learning curve than simpler frameworks
- The graph abstraction adds boilerplate for straightforward linear workflows
- Tight coupling to LangChain conventions
Best for: Complex workflows with conditional branching, loops, and human checkpoints. Long-running processes where persistence matters.
from langgraph.graph import StateGraph
def orchestrator(state): ...
def researcher(state): ...
def synthesizer(state): ...
graph = StateGraph(dict)
graph.add_node("orchestrator", orchestrator)
graph.add_node("researcher", researcher)
graph.add_node("synthesizer", synthesizer)
graph.add_edge("orchestrator", "researcher")
graph.add_edge("researcher", "synthesizer")AutoGen
Model: Conversation-based. Agents are objects that can converse with each other. Multi-agent collaboration is framed as a group chat or pairwise conversation.
Strengths:
- Intuitive for conversation-native workflows
- Easy to set up basic multi-agent dialogues
- Good support for code execution in sandboxes
- Microsoft-backed, large community
Weaknesses:
- Less control over execution order than LangGraph
- The conversation model can be hard to predict for strict pipelines
- State management is less explicit
Best for: Conversational multi-agent systems, debate patterns (agent A argues, agent B critiques), code generation with auto-execution and error correction.
CrewAI
Model: Role-based. Define “agents” with roles and goals, group them into “crews”, assign “tasks”. Higher-level abstraction than LangGraph or AutoGen.
Strengths:
- Fastest time-to-prototype
- Intuitive role/crew metaphor
- Good for non-engineers to understand the architecture at a glance
Weaknesses:
- Less control over internals
- Harder to debug when the crew doesn’t behave as expected
- Limited customization vs LangGraph
Best for: Rapid prototyping, non-technical teams, simple sequential crew pipelines.
from crewai import Agent, Task, Crew
researcher = Agent(role="Researcher", goal="Find relevant information", ...)
writer = Agent(role="Writer", goal="Write a clear summary", ...)
research_task = Task(description="Research X", agent=researcher)
write_task = Task(description="Write a report about X", agent=writer)
crew = Crew(agents=[researcher, writer], tasks=[research_task, write_task])
result = crew.kickoff()Claude Code’s Agent Tool
Model: Built-in subagent delegation within Claude Code. The Agent tool spins up a subagent in an isolated worktree, runs it to completion, and returns the result.
Strengths:
- Zero setup — no framework to install
- Worktree isolation prevents file system contamination between agents
- Inherits all Claude Code tools (Read, Write, Bash, Grep, etc.)
- Natural for code-focused tasks
Weaknesses:
- Only works within Claude Code (not a standalone framework)
- Limited visibility into subagent’s internal steps from the orchestrator
- No built-in parallel fan-out
Best for: Software development tasks where you want to delegate a complete coding subtask to a subagent with full tool access.
Decision Guide
| Scenario | Recommended |
|---|---|
| Complex conditional workflows, needs checkpointing | LangGraph |
| Conversational agents, code generation with auto-execution | AutoGen |
| Quick prototype, simple crew pipelines | CrewAI |
| Code tasks within Claude Code | Claude Code Agent tool |
| Full control, no framework overhead | Raw SDK (as in the examples below) |
The raw SDK approach (what this module’s examples use) is always a valid choice for production systems where you need full control, easy debugging, and no framework abstractions in your way.
8. Interview Flashcards
Q1: What is the orchestrator-subagent pattern?
A: The orchestrator-subagent pattern separates task coordination from task execution.
The orchestrator receives the high-level goal, decomposes it into scoped subtasks, delegates each to a subagent, and assembles the final result. It owns all shared state.
The subagent receives one well-defined task, executes it with its own tools and context, and returns a structured result. It has no knowledge of the broader task or other agents.
Key benefits: separation of concerns, clear failure isolation (a subagent failure doesn’t corrupt the orchestrator’s state), and natural support for parallelism (the orchestrator can delegate multiple subtasks simultaneously).
In interviews: draw the pattern as a diagram — orchestrator at the top, arrows going down to N subagents, arrows coming back up with results, synthesis at the bottom.
Q2: When should you decompose a task into multiple agents?
A: Decompose when:
- The task generates more intermediate state than fits in one context window
- Subtasks are logically independent and can run in parallel (reducing latency)
- Different subtasks benefit from different specializations (different system prompts, models, or tool sets)
- The overall workflow is too long-running for a single agent loop
Do not decompose when:
- A single agent can handle it cleanly (coordination overhead outweighs benefits)
- Subtasks are tightly coupled (sharing state via the orchestrator adds complexity without parallelism gains)
- The task is time-sensitive and the latency of inter-agent communication is unacceptable
The heuristic: if you cannot write a clean, typed interface between what the orchestrator sends and what the subagent returns, the decomposition is probably wrong.
Q3: How do you handle failure when one agent in a pipeline fails?
A: The right strategy depends on the role of the failing agent:
-
Retry with backoff: For transient failures (rate limits, timeouts). Use exponential backoff with a cap. Retry up to 3 times before escalating.
-
Fallback agent: Replace the failing specialist with a generalist. Accept lower quality output rather than total failure.
-
Partial results: If the failing agent’s output is optional (not on the critical path), continue with the remaining results. Annotate the final output to indicate what’s missing.
-
Circuit breaker: If an agent fails repeatedly, stop routing to it. This protects the overall system from cascading failures and prevents wasted retries.
-
Fail fast: If the failing agent produces a required output that the synthesizer cannot proceed without, fail the whole pipeline immediately. This is cleaner than a synthesizer that silently produces wrong output with missing inputs.
Always distinguish between required and optional tasks in your orchestrator’s assembly logic.
Q4: What is fan-out/fan-in in multi-agent context?
A: Fan-out and fan-in are the parallel execution pattern:
Fan-out: The orchestrator sends the same task type (with different inputs) to N agents simultaneously. Example: summarize 5 documents by running 5 summarization agents in parallel.
Fan-in: After all N agents complete, collect all N results and pass them to a synthesis step. asyncio.gather() in Python is the typical fan-in mechanism.
The pattern looks like:
Orchestrator
│
├── Agent A (topic 1) ─┐
├── Agent B (topic 2) ─┤ (all run in parallel)
└── Agent C (topic 3) ─┘
│
Synthesizer (fan-in)
Key considerations:
- Concurrency control (semaphore) to avoid hitting API rate limits
- Handle exceptions per-agent so one failure doesn’t cancel all others (use
return_exceptions=Truewithasyncio.gather()) - The synthesizer must handle partial results if some agents fail
Q5: How do agents communicate with each other?
A: In practice, agents communicate through the orchestrator, not directly with each other (in a standard orchestrator-subagent architecture).
The communication protocol:
- Orchestrator → Subagent: A structured message containing the task description, required context, expected output format, and constraints. This is typically a
messagesarray passed to an LLM call. - Subagent → Orchestrator: A structured result object (validated with Pydantic) containing status, result, confidence, errors, and metadata.
The key principle is typed interfaces: the schema between orchestrator and subagent should be explicit and validated. Never pass freeform strings between agents if you can avoid it — parse failures are hard to debug in a multi-agent pipeline.
In advanced architectures (agent meshes, multi-party conversation with AutoGen), agents can communicate directly. But this increases complexity and makes failure analysis harder. Start with the hub-and-spoke (orchestrator-centric) model.
Q6: What is a DAG and how does it apply to agent task planning?
A: A DAG (Directed Acyclic Graph) is a graph where edges are one-directional and there are no cycles. In task planning, nodes represent subtasks and edges represent “must complete before” dependencies.
Applied to agents: given a complex task, model it as a DAG where:
- Each node is a subtask that one agent will execute
- Directed edge A → B means “A must complete before B can start”
- Nodes with no incoming edges can run immediately (they have no prerequisites)
- Nodes at the same level of the topological sort can run in parallel
Practical use:
- Draw the dependency graph for your task
- Run topological sort to get a valid execution sequence
- Within each “generation” of the topological sort (nodes that can all run after the same set of predecessors), execute in parallel via fan-out
- Fan-in after each generation before proceeding to the next
Failure handling in a DAG: when a node fails, mark it and all nodes downstream of it as skipped. Nodes in independent branches continue running.
Q7: Compare LangGraph vs AutoGen vs CrewAI for a complex research workflow
A: Given a workflow: “search 5 sources, summarize each, critique each summary, synthesize into a report, have a human review before publishing”:
LangGraph is the best fit here:
- The conditional human-review step is a first-class concept in LangGraph (interrupt-and-resume)
- The DAG structure maps directly to LangGraph nodes and edges
- Checkpointing means you can resume if the process is interrupted mid-run
- State is explicitly typed throughout
AutoGen could work but:
- The human-review step requires a human proxy agent, which is non-trivial to configure correctly
- Managing 5 parallel summarization agents in AutoGen’s conversation model is awkward
- Better for the “summarize → critique → revise” loop part of the workflow than for the full DAG
CrewAI is least suited:
- Designed for sequential crew pipelines, not complex DAGs
- Human-in-the-loop is not a first-class concept
- Limited control over parallel execution
Verdict: LangGraph for complex workflows with conditionals, human-in-the-loop, and persistence needs. AutoGen for the debate/critique loop within a workflow. CrewAI for simple sequential pipelines only.
Q8: How do you debug a multi-agent system?
A: Debugging multi-agent systems requires more structure than debugging single-agent systems. Key approaches:
1. Structured logging with correlation IDs. Every agent call should log its task_id, agent_id, inputs (truncated), output (truncated), status, and duration. Use the task_id to trace a request across all agents involved.
2. Trace the DAG execution. Log each node as it starts, completes, or fails. Reconstruct the execution graph post-mortem.
3. Save all intermediate results. Don’t just keep the final output. Store each subagent’s raw output to file or a database. This lets you replay individual steps with modified inputs without re-running the whole pipeline.
4. Isolation testing. Test each agent independently before testing the full pipeline. If Agent B fails, first verify that Agent A’s output is what Agent B expects. Often the bug is in the handoff schema, not in either agent.
5. Prompt-level debugging. When an agent returns a wrong answer, add it to a test suite. Run the exact messages array that agent received and iterate on the prompt or output format.
6. Confidence thresholds. Build in confidence scoring. When confidence is low, log extensively and optionally trigger human review. This surfaces agents that are technically “succeeding” but producing low-quality output.
7. Replay infrastructure. The ability to replay a failed run from any checkpoint is invaluable. LangGraph has this built in. In a custom system, persist the state after each major step.
What’s Next
- Work through
examples/orchestrator_subagent.pyto see the orchestrator-subagent pattern implemented end-to-end with the raw Anthropic SDK. - Work through
examples/parallel_agents.pyto see fan-out/fan-in with asyncio timing comparisons. - Complete the exercises in
exercises/README.md, especially the engineering report system design. - See
references.mdfor LangGraph tutorials, the AutoGen paper, and production case studies.