Module 03 References — AI Agents
Curated references for going deeper on agent architecture, tool use, and production systems.
Organized from foundational theory to practical implementation.
Foundational Papers
ReAct: Synergizing Reasoning and Acting in Language Models
https://arxiv.org/abs/2210.03629
Yao et al., 2022. The paper that introduced the ReAct pattern: interleaving reasoning traces
with actions in a single context. Read this before any other agent paper. The core insight
(Thought → Action → Observation → Thought) underpins virtually every production agent system.
Pay attention to Section 3 (Method) and the HotpotQA/Fever task traces which make the
pattern concrete.
Tool Learning with Foundation Models
https://arxiv.org/abs/2305.10601
Qin et al., 2023. Comprehensive survey covering how foundation models learn to use tools:
taxonomy of tools (perception, action, reasoning), training strategies, evaluation benchmarks.
Useful for understanding the design space — what counts as a “tool” and how models are trained
to use them reliably. Read the taxonomy sections (2–3) and the failure analysis.
Anthropic Documentation
Tool Use Overview
https://docs.anthropic.com/en/docs/build-with-claude/tool-use/overview
The authoritative reference for Claude’s tool use API: tool definitions, message structure,
tool_use and tool_result content blocks, parallel tool calls, error handling, and
best practices. Read this fully before implementing any tool-using agent. Key sections:
how to structure tool definitions, the message flow for tool results, and handling errors.
Agents Guide
https://docs.anthropic.com/en/docs/build-with-claude/agents
Anthropic’s guidance on building agentic systems with Claude: orchestrator vs. subagent roles,
multi-agent architectures, safety considerations, and best practices. Covers the “minimal
footprint” principle and how to think about agent authorization. Essential reading for
production agent design.
Computer Use (Advanced Tool Use)
https://docs.anthropic.com/en/docs/build-with-claude/tool-use/computer-use
Claude’s ability to control a computer via screen capture and keyboard/mouse tools. Demonstrates
the most extreme form of tool use — an agent operating a full desktop environment. Useful for
understanding the design principles of long-horizon agents operating in complex environments,
even if you don’t need computer use itself.
Framework & Concept References
LangGraph: Agentic Concepts
https://langchain-ai.github.io/langgraph/concepts/agentic_concepts
LangGraph’s documentation on agent architectures: the ReAct loop, plan-and-execute, reflection,
multi-agent systems, and human-in-the-loop patterns. Even if you don’t use LangGraph, this
reference provides excellent vocabulary and taxonomies for agent design patterns. Read the
“Agent types” section to understand the design space.
Supplementary Reading
Toolformer: Language Models Can Teach Themselves to Use Tools
https://arxiv.org/abs/2302.04761
Schick et al., Meta, 2023. How to train models to decide which tools to use and when via
self-supervised learning. Background on how modern tool-use capabilities were developed.
HuggingGPT / Taskmatrix: Solving AI Tasks with ChatGPT
https://arxiv.org/abs/2303.04671
Shen et al., 2023. Agent system where the LLM acts as a controller for specialized AI models.
Demonstrates multi-model orchestration — the LLM selects and coordinates domain-specific models
as “tools”. Useful for understanding multi-agent architectures.
WebGPT: Browser-Assisted Question-Answering
https://arxiv.org/abs/2112.09332
Nakano et al., OpenAI, 2021. Pre-ReAct work on training models to use a web browser for QA.
Historical context for how web-browsing agents were first approached.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
https://arxiv.org/abs/2201.11903
Wei et al., Google Brain, 2022. The foundational CoT paper. Required reading to understand
what ReAct is built on top of and why the comparison matters.
Production and Reliability
Evaluating Language Model Agents on Realistic Autonomous Tasks
https://arxiv.org/abs/2312.11671
Kinniment et al., 2023. Systematic evaluation of agent reliability on real-world tasks.
Key finding: agents fail in characteristic ways that are predictable and addressable.
Directly useful for building robust evaluation harnesses.
Gorilla: Large Language Model Connected with Massive APIs
https://arxiv.org/abs/2305.15334
Patil et al., 2023. Training LLMs to call APIs reliably. Focus on the evaluation methodology:
how do you measure whether a model is calling tools correctly at scale?
Blog Posts and Practical Guides
Anthropic’s Model Spec — Agentic and Multi-Agent Safety
https://www.anthropic.com/claude/model-spec
The values and principles built into Claude that affect how it behaves as an agent —
minimal footprint, deference to humans, refusing harmful tool use. Understanding the
model’s built-in safety dispositions is necessary for reliable agent design.
Building Production-Ready Agents (Anthropic Cookbook)
https://github.com/anthropics/anthropic-cookbook
Practical code examples from Anthropic covering tool use, agents, multi-agent systems,
and evaluation. The agents directory has production-quality patterns worth studying
directly.
Key Concepts Index
| Concept | Primary Reference |
|---|---|
| ReAct pattern | arxiv.org/abs/2210.03629 |
| Tool use API (Claude) | docs.anthropic.com tool-use/overview |
| Parallel tool calls | docs.anthropic.com tool-use/overview |
| Plan-and-execute | langchain-ai.github.io/langgraph/concepts/agentic_concepts |
| Agent safety / footprint | docs.anthropic.com/en/docs/build-with-claude/agents |
| Computer use | docs.anthropic.com tool-use/computer-use |
| Tool learning survey | arxiv.org/abs/2305.10601 |
| Multi-agent orchestration | arxiv.org/abs/2303.04671 |
| CoT (ReAct prerequisite) | arxiv.org/abs/2201.11903 |