System Design: Multi-Agent Research Pipeline
Interview Prompt: “Design an automated research assistant that takes a topic, researches it from multiple sources, and produces a comprehensive report.”
Step 1: Clarifying Questions
Output and quality:
- What does a “comprehensive report” look like? (Academic paper style? Executive brief? Technical deep-dive? Length?)
- What’s the quality bar — is this a first-draft that humans refine, or a final output?
- How important is source accuracy and citation quality vs. comprehensiveness?
Sources and scope:
- What sources should it research? (Web search, academic papers, internal documents, news, financial data?)
- Are there prohibited sources? (Behind paywalls, competitor blogs, unreliable sites?)
- How current must the information be? (Academic research can use 5-year-old papers; market reports need last-week’s data.)
Operational constraints:
- How long can the process take? (5 minutes? 30 minutes? Hours?)
- What’s the cost budget per report?
- How many reports will be generated per day?
User interaction:
- Is the user waiting interactively, or does this run async and deliver via email/Slack?
- Can the user provide feedback mid-process (“focus more on X, less on Y”)?
- Does the system learn from user feedback on past reports?
For this walkthrough, I’ll assume:
- Business/technology research reports, 2,000–5,000 words
- Output is a polished first draft — humans will review and may edit, but it should be publication-ready
- Sources: web search (Tavily API), arXiv for academic papers, internal document library (existing RAG system)
- No real-time data (no live market data, news within last 30 days is acceptable)
- Latency: async, deliver in < 15 minutes, user gets a “processing” status
- ~50 reports/day, cost budget $5–20 per report
- Human review step required before publishing
Step 2: Requirements
Functional Requirements
- Accept a topic (free-form text or structured brief) and produce a comprehensive research report
- Cite every factual claim with a verifiable source
- Cover the topic from multiple perspectives (technical, business, historical, current state)
- Fact-check key claims across multiple sources before including them
- Produce a structured report with executive summary, body sections, and references
- Support async execution with status tracking
Non-Functional Requirements
- End-to-end latency: < 15 minutes for 95% of reports
- Source coverage: minimum 10 distinct sources per report
- Cost: < $15 per report at target quality
- Factual accuracy: < 5% of cited facts should be incorrect or misattributed (verified on golden test set)
- Availability: 99% (internal tool, async — downtime is tolerable)
- Parallelism: ability to process 10 reports simultaneously
Step 3: Agent DAG Design
The system uses a Directed Acyclic Graph (DAG) of specialized agents. Each agent is an LLM call with specific tools, a specific role, and structured input/output.
USER INPUT
(topic + brief)
│
▼
┌─────────────────┐
│ Orchestrator │
│ (planning LLM) │
│ - Decomposes │
│ topic into │
│ sub-topics │
│ - Assigns work │
└────────┬────────┘
│
┌────────────────┼────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Researcher 1 │ │ Researcher 2 │ │ Researcher 3 │
│ (Sub-topic A)│ │ (Sub-topic B)│ │ (Sub-topic C)│
│ - Web search │ │ - Web search │ │ - Academic │
│ - Source eval│ │ - Source eval│ │ papers │
│ - Notes │ │ - Notes │ │ - Notes │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└─────────────────┼──────────────────┘
│
▼
┌──────────────────┐
│ Fact-Checker │
│ - Cross-reference│
│ key claims │
│ - Flag conflicts │
│ - Confidence │
│ scores │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Writer │
│ - Synthesizes │
│ all research │
│ - Drafts report │
│ - Adds citations │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Editor │
│ - Quality check │
│ - Coherence │
│ - Citation audit │
│ - Final polish │
└────────┬─────────┘
│
▼
FINAL REPORT
→ Human Review Queue
Step 4: Orchestration Strategy
The Orchestrator Agent
The Orchestrator is the first LLM call. Its job is to decompose the research topic into concrete sub-topics and produce a work plan.
Input: Raw topic from user (e.g., “The current state of AI regulation in the European Union”)
Output: Structured research plan
{
"topic": "AI regulation in the European Union",
"angle": "Current state analysis for business decision-makers",
"sub_topics": [
{
"id": "ST1",
"focus": "EU AI Act: overview, timeline, enforcement status",
"required_sources": ["official EU documents", "legal analysis"],
"researcher_id": "R1"
},
{
"id": "ST2",
"focus": "Impact on AI companies operating in EU: compliance burden, timelines",
"required_sources": ["industry reports", "company filings"],
"researcher_id": "R2"
},
{
"id": "ST3",
"focus": "Comparison with US and UK regulatory approaches",
"required_sources": ["policy analysis", "regulatory documents"],
"researcher_id": "R3"
}
],
"key_claims_to_verify": [
"The EU AI Act passed in [year]",
"High-risk AI systems require [specific requirements]",
"Penalties are up to [amount]"
],
"target_word_count": 3000,
"output_structure": ["Executive Summary", "Regulatory Overview", "Business Impact", "Global Context", "Conclusions"]
}Why a separate planning step? The quality of the researcher agents depends entirely on the specificity of their briefs. A vague brief produces vague research. The orchestrator’s job is to transform a fuzzy user request into precise, bounded tasks for each researcher.
Orchestrator model choice: Claude 3.5 Sonnet. Planning quality matters here. This is one LLM call — cost is a small fraction of total.
Researcher Agents (×3, Parallel)
Each researcher receives its sub-topic brief and executes a research loop using web search and document retrieval tools.
Tools available to each researcher:
web_search(query: str, max_results: int) → List[SearchResult]— Tavily APIfetch_page(url: str) → PageContent— fetch and extract full text from a URLsearch_arxiv(query: str, max_results: int) → List[Paper]— academic paper searchsearch_internal_docs(query: str) → List[DocumentChunk]— RAG over internal library
Researcher execution pattern (ReAct loop):
Thought: I need to find information about the EU AI Act passage date and provisions.
Action: web_search("EU AI Act official passage date enforcement timeline 2024")
Observation: [results...]
Thought: The top result mentions April 2024. I should fetch the official EU source to verify.
Action: fetch_page("https://eur-lex.europa.eu/...")
Observation: [official text...]
Thought: Confirmed: The EU AI Act was adopted in March 2024 and entered into force in August 2024.
I have the basic chronology. Now I need the compliance requirements for high-risk AI.
Action: web_search("EU AI Act high-risk AI systems requirements Article 9 10")
...
Thought: I have enough information. Let me compile my research notes.
Final output: [structured research notes]
Researcher output structure:
{
"sub_topic": "ST1",
"findings": [
{
"claim": "The EU AI Act entered into force on 1 August 2024",
"evidence": "Direct quote from EUR-Lex official publication",
"source_url": "https://eur-lex.europa.eu/...",
"source_name": "EUR-Lex Official EU Legislation",
"confidence": "high",
"retrieved_at": "2025-04-14T10:30:00Z"
},
...
],
"key_quotes": [...],
"gaps": ["Could not find enforcement action examples — may need more research"],
"word_count_guidance": 900
}Researcher model choice: Claude 3.5 Haiku. These agents do lots of tool calls and produce structured notes — speed and cost matter more than deep reasoning. Budget ~8–12 tool calls per researcher × 3 researchers.
Fact-Checker Agent
The Fact-Checker receives the compiled findings from all three researchers and cross-references key claims.
What it does:
- Identifies claims that appear in multiple researcher reports — do they agree or conflict?
- For the “key claims to verify” list from the Orchestrator, checks that at least 2 independent sources confirm each claim.
- Flags claims with only a single source or where sources conflict.
- Assigns a confidence score (high/medium/low/unverified) to each key claim.
Tools available:
web_search(to find additional confirming or conflicting sources)fetch_page(to check specific source documents)
Output:
{
"verified_claims": [
{
"claim": "EU AI Act entered into force August 2024",
"status": "verified",
"confirming_sources": 3,
"confidence": "high"
}
],
"conflicting_claims": [
{
"claim": "Penalties up to 35 million EUR",
"status": "conflict",
"sources": [
{"says": "35 million EUR", "source": "TechCrunch article"},
{"says": "35 million EUR or 7% of global revenue", "source": "Official EU text"}
],
"resolution": "Use official EU text — news article is incomplete",
"recommended_text": "up to €35 million or 7% of annual worldwide turnover, whichever is higher"
}
],
"unverified_claims": [...],
"fact_check_summary": "23/25 key claims verified. 2 conflicts resolved. 1 claim removed (unverifiable)."
}Why a separate fact-checker and not have researchers fact-check each other? Cross-referencing is a different cognitive task from researching. Having a single agent read all research output and specifically look for conflicts catches errors that would be invisible to an agent focused on finding new information. It also allows parallel research (researchers don’t wait for each other) while still getting cross-source verification.
Fact-Checker model choice: Claude 3.5 Sonnet. Reasoning over potentially conflicting information requires more careful analysis.
Writer Agent
The Writer receives the verified research findings and produces the full draft report.
Input: All researcher findings + fact-checker output + orchestrator’s report structure
What it does:
- Synthesizes findings across sub-topics into a coherent narrative
- Follows the output structure defined by the Orchestrator
- Writes in the appropriate register (business-focused, accessible prose)
- Inserts inline citations for every factual claim using a citation format
- Flags sections where fact-checker noted low confidence claims (using a
[LOW CONFIDENCE]marker)
Tools available: None. The Writer has all the information it needs in its context window from the previous agents. Giving the Writer search tools risks it going off-brief and re-researching when it should be writing.
Prompt for the Writer:
You are an expert research writer producing a professional report.
You have been provided with:
1. A research brief specifying the topic, angle, and target audience
2. Research findings from three researchers, each covering a different sub-topic
3. A fact-checker's report identifying verified, conflicting, and unverified claims
Your task:
- Write a comprehensive report following the provided structure
- Target length: [word_count] words
- Cite every factual claim using the citation keys provided in the research findings
- Use only claims marked as "verified" or "medium confidence" — do not include "unverified" claims
- For "conflicting" claims, use the fact-checker's recommended resolution
- Write clearly for a business-executive audience — avoid jargon, explain technical terms
Writer output: Markdown-formatted report with inline citation markers like [Source 1] and a references section.
Writer model choice: Claude 3.5 Sonnet. Writing quality is critical — this is the output the user sees.
Editor Agent
The Editor is the final quality gate before human review. It reads the full draft and checks for:
- Coherence: Does the narrative flow logically? Are there abrupt transitions or contradictory claims?
- Citation completeness: Is every factual claim cited? Are citation keys matched to the references?
- Coverage: Does the report actually answer the original brief? Are there obvious gaps?
- Tone consistency: Is the writing register consistent throughout?
- Fact-check compliance: Were any
[LOW CONFIDENCE]markers missed? Were any unverified claims included?
Editor output: Either PASS with minor editorial changes applied, or FAIL with specific issues listed and a request for the Writer to revise specific sections.
Revision loop: If the Editor returns FAIL, the Writer revises the flagged sections (max 2 revision cycles). After 2 cycles, the report goes to human review regardless of status, with the Editor’s issues noted.
Editor model choice: Claude 3.5 Haiku (simple review task, needs to be fast to not add significant latency).
Step 5: Parallel Execution Design
The three researcher agents run in parallel — this is the primary latency optimization.
t=0:00 Orchestrator completes plan
t=0:00 Researcher 1, 2, 3 start simultaneously
t=0:04 Researcher 2 completes (fastest, sub-topic was narrower)
t=0:07 Researcher 1 completes
t=0:09 Researcher 3 completes (longest, academic paper search)
t=0:09 Fact-Checker starts (waits for all researchers)
t=0:12 Fact-Checker completes
t=0:12 Writer starts
t=0:20 Writer completes
t=0:20 Editor starts
t=0:23 Editor completes → PASS
t=0:23 Report queued for human review
Total elapsed time: ~23 minutes — well within the 15-minute target for most reports. (Note: some complex topics require more researcher iterations.)
Without parallelism, researchers would run sequentially: 9 + 7 + 4 = 20 minutes just for research, pushing total to 35+ minutes.
Implementation with async:
import asyncio
async def run_pipeline(brief: ResearchBrief) -> Report:
# Step 1: Plan (sequential - must complete before anything else)
plan = await orchestrator.plan(brief)
# Step 2: Research (parallel - all three run simultaneously)
research_results = await asyncio.gather(
researcher.research(plan.sub_topics[0]),
researcher.research(plan.sub_topics[1]),
researcher.research(plan.sub_topics[2]),
return_exceptions=True # Don't let one failure kill all three
)
# Handle partial failures
valid_results = [r for r in research_results if not isinstance(r, Exception)]
failed = [r for r in research_results if isinstance(r, Exception)]
if len(valid_results) < 2:
raise PipelineError("Insufficient research completed")
if failed:
log_warning(f"{len(failed)} researcher(s) failed, proceeding with partial results")
# Step 3: Fact-check (sequential - needs all research)
fact_check = await fact_checker.verify(valid_results, plan.key_claims)
# Step 4: Write (sequential - needs verified facts)
draft = await writer.write(plan, valid_results, fact_check)
# Step 5: Edit (sequential - needs full draft)
for revision_round in range(2):
edit_result = await editor.review(draft, plan)
if edit_result.status == "PASS":
break
draft = await writer.revise(draft, edit_result.issues)
return Report(content=draft, metadata=collect_metadata(plan, fact_check))Step 6: Output Validation Between Agents
Each agent’s output is typed and validated before being passed to the next agent. This prevents cascading failures where a malformed output from one agent causes a cryptic error in the next.
Validation approach:
from pydantic import BaseModel, validator
from typing import List, Literal
class ResearchFinding(BaseModel):
claim: str
evidence: str
source_url: str
confidence: Literal["high", "medium", "low"]
@validator("source_url")
def url_must_be_valid(cls, v):
# Basic URL validation
assert v.startswith("http"), "Source URL must be a valid URL"
return v
class ResearcherOutput(BaseModel):
sub_topic_id: str
findings: List[ResearchFinding]
gaps: List[str]
@validator("findings")
def must_have_findings(cls, v):
assert len(v) >= 3, "Researcher must produce at least 3 findings"
return vWhat happens on validation failure:
- Retry the agent once with the validation error appended to its context: “Your previous output had a validation error: {error}. Please correct and resubmit.”
- If second attempt also fails: skip this sub-topic, note it in the fact-checker context, proceed with remaining agents.
- Always: log the failure with full context for debugging.
Structured output enforcement: All agents are called with response_format specifying JSON output matching the expected schema. Use Claude’s tool_use feature with the output schema as a tool definition — the model must produce a valid tool call, which forces schema compliance.
Step 7: Failure Handling
Single Researcher Failure
Scenario: One of three researchers times out or produces invalid output.
Handling: Proceed with 2/3 researchers. The Writer and Editor are given explicit context: “Note: Sub-topic B research failed. This section may be incomplete.” The final report includes a “Coverage Limitations” note. This is acceptable — better to deliver 80% coverage than nothing.
Fact-Checker Produces Excessive Conflicts
Scenario: Fact-checker finds that 40% of claims are in conflict (maybe the topic is genuinely contested, or the researchers found low-quality sources).
Handling: Fact-checker includes a confidence_summary with an overall reliability score. If score < 70%, the Writer is instructed to write in a more hedged register (“According to multiple sources…” instead of “It is established that…”) and the report is flagged for extra human review scrutiny.
Writer Produces Insufficient Length
Scenario: Writer produces 800 words when 3,000 were requested.
Detection: Editor checks word count explicitly and returns FAIL with “Insufficient coverage — expand sections 2, 3, 4.”
Handling: Writer revises with explicit instruction to expand specific sections. After 2 revisions, if still short, deliver what exists with a flag.
Pipeline Timeout
Scenario: A researcher gets stuck in a loop fetching unreachable URLs and the 15-minute deadline approaches.
Handling:
- Each agent has a hard timeout (Researcher: 5 minutes, Fact-checker: 3 minutes, Writer: 4 minutes, Editor: 2 minutes)
- On timeout: return partial output +
status: "timeout"+ whatever was completed - Orchestrator collects partial outputs and makes a best-effort decision: if enough content exists, continue pipeline; if too little, fail with a user-facing error and refund the compute cost
Complete Pipeline Failure
Scenario: Anthropic API is unavailable.
Handling: Queue the report job with a “pending” status. Retry every 5 minutes for up to 2 hours. If not completed in 2 hours, notify the user and offer to retry. Never silently drop a report request.
Step 8: Latency vs. Quality Trade-off Analysis
This is a key interview topic — show you’ve thought deeply about the tension.
| Configuration | Latency | Quality | Cost | Use case |
|---|---|---|---|---|
| 1 researcher, no fact-check, Haiku writer | ~5 min | Low | ~$1 | Quick briefing, internal use |
| 3 researchers, no fact-check, Haiku writer | ~8 min | Medium | ~$3 | First draft, human-heavy review |
| 3 researchers + fact-check + Sonnet writer | ~15 min | High | ~$10 | Publication-ready draft (default) |
| 5 researchers + 2-pass fact-check + Sonnet writer + Sonnet editor | ~25 min | Very high | ~$20 | Critical research, deep accuracy requirements |
Adaptive quality: Consider routing based on report complexity. A brief on “our company’s Q3 revenue targets” (simple, internal facts, single source) can use the 5-minute path. A brief on “geopolitical risk analysis for Southeast Asian expansion” routes to the full pipeline.
Async + streaming update: Since users don’t wait interactively, the latency number matters less than perceived responsiveness. Send status updates: “Researcher agents started… Fact-checking complete… Report draft being written…” This manages expectations and makes the experience feel faster.
Cost vs. quality inflection point: The primary cost driver is the Writer (long context input + long output). Using Sonnet vs. Haiku for the Writer adds ~0.50 and catches embarrassing errors — always worth it.
Step 9: Scale Considerations
50 reports/day (Current target)
The async pipeline handles this trivially. Each report takes ~15 minutes. With 10 concurrent pipelines (parallelism limit), 50 reports/day requires about 75 minutes of pipeline throughput — easily achievable in a 24-hour window.
Infrastructure: A few cloud functions or a small Kubernetes cluster. No dedicated infrastructure needed.
500 reports/day (10x growth)
New requirements:
- Dedicated worker pool for researcher agents (rate limits on Tavily search API become a concern)
- Redis-based job queue (BullMQ or similar) for proper work distribution
- Caching: if two reports are requested on similar topics within 24 hours, reuse researcher findings
- Rate limit management: Tavily allows ~1000 searches/minute by default — 500 reports × 30 searches = 15,000 searches/day, well within limits
5,000 reports/day (100x growth, serious scale)
Major changes:
- LLM cost becomes significant: 5,000 reports × 50,000/day — need to optimize hard
- Consider distilling the planner and researcher roles to smaller/cheaper models after collecting high-quality examples
- Implement aggressive deduplication: many reports on similar topics can share retrieved sources
- Build a source cache: if we fetched a URL’s content in the last 24 hours, use cached version instead of re-fetching
- Shard the pipeline across multiple cloud regions for throughput
Step 10: What I’d Do Differently With 6 More Months
Month 1–2: Source quality scoring
Not all web sources are equal. Build a source reputation database — penalize results from known low-quality domains, reward results from primary sources (government sites, peer-reviewed journals, established publications). The researcher agents currently treat all search results equally.
Month 2–3: User feedback integration
After a human reviews and edits a report, diff the original vs. final to learn: what did the human add? what did they remove? what did they rewrite? Use these edits as a training signal to improve the Writer’s first draft.
Month 3–4: Multi-modal research
Many important sources are PDFs (academic papers, regulatory documents, financial reports). The current system does basic text extraction. Build proper PDF parsing with table recognition, figure captioning, and structured data extraction.
Month 4–5: Domain specialization
Generalist research is mediocre at domain-specific topics. Build specialized researcher variants: a legal researcher with access to Westlaw/LexisNexis and trained on legal reasoning, a financial researcher with access to SEC filings and earnings calls, a technical researcher with access to patent databases and GitHub. Route the orchestrator to domain-specific researchers based on topic classification.
Month 6: Longitudinal research
The current system treats each report as independent. Add “living document” support: given an existing report, research what has changed since it was written and produce an update. This is much more valuable for business intelligence use cases than one-time static reports.