Module 03 References — AI Agents

Curated references for going deeper on agent architecture, tool use, and production systems.
Organized from foundational theory to practical implementation.


Foundational Papers

ReAct: Synergizing Reasoning and Acting in Language Models

https://arxiv.org/abs/2210.03629
Yao et al., 2022. The paper that introduced the ReAct pattern: interleaving reasoning traces
with actions in a single context. Read this before any other agent paper. The core insight
(Thought → Action → Observation → Thought) underpins virtually every production agent system.
Pay attention to Section 3 (Method) and the HotpotQA/Fever task traces which make the
pattern concrete.

Tool Learning with Foundation Models

https://arxiv.org/abs/2305.10601
Qin et al., 2023. Comprehensive survey covering how foundation models learn to use tools:
taxonomy of tools (perception, action, reasoning), training strategies, evaluation benchmarks.
Useful for understanding the design space — what counts as a “tool” and how models are trained
to use them reliably. Read the taxonomy sections (2–3) and the failure analysis.


Anthropic Documentation

Tool Use Overview

https://docs.anthropic.com/en/docs/build-with-claude/tool-use/overview
The authoritative reference for Claude’s tool use API: tool definitions, message structure,
tool_use and tool_result content blocks, parallel tool calls, error handling, and
best practices. Read this fully before implementing any tool-using agent. Key sections:
how to structure tool definitions, the message flow for tool results, and handling errors.

Agents Guide

https://docs.anthropic.com/en/docs/build-with-claude/agents
Anthropic’s guidance on building agentic systems with Claude: orchestrator vs. subagent roles,
multi-agent architectures, safety considerations, and best practices. Covers the “minimal
footprint” principle and how to think about agent authorization. Essential reading for
production agent design.

Computer Use (Advanced Tool Use)

https://docs.anthropic.com/en/docs/build-with-claude/tool-use/computer-use
Claude’s ability to control a computer via screen capture and keyboard/mouse tools. Demonstrates
the most extreme form of tool use — an agent operating a full desktop environment. Useful for
understanding the design principles of long-horizon agents operating in complex environments,
even if you don’t need computer use itself.


Framework & Concept References

LangGraph: Agentic Concepts

https://langchain-ai.github.io/langgraph/concepts/agentic_concepts
LangGraph’s documentation on agent architectures: the ReAct loop, plan-and-execute, reflection,
multi-agent systems, and human-in-the-loop patterns. Even if you don’t use LangGraph, this
reference provides excellent vocabulary and taxonomies for agent design patterns. Read the
“Agent types” section to understand the design space.


Supplementary Reading

Toolformer: Language Models Can Teach Themselves to Use Tools

https://arxiv.org/abs/2302.04761
Schick et al., Meta, 2023. How to train models to decide which tools to use and when via
self-supervised learning. Background on how modern tool-use capabilities were developed.

HuggingGPT / Taskmatrix: Solving AI Tasks with ChatGPT

https://arxiv.org/abs/2303.04671
Shen et al., 2023. Agent system where the LLM acts as a controller for specialized AI models.
Demonstrates multi-model orchestration — the LLM selects and coordinates domain-specific models
as “tools”. Useful for understanding multi-agent architectures.

WebGPT: Browser-Assisted Question-Answering

https://arxiv.org/abs/2112.09332
Nakano et al., OpenAI, 2021. Pre-ReAct work on training models to use a web browser for QA.
Historical context for how web-browsing agents were first approached.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

https://arxiv.org/abs/2201.11903
Wei et al., Google Brain, 2022. The foundational CoT paper. Required reading to understand
what ReAct is built on top of and why the comparison matters.


Production and Reliability

Evaluating Language Model Agents on Realistic Autonomous Tasks

https://arxiv.org/abs/2312.11671
Kinniment et al., 2023. Systematic evaluation of agent reliability on real-world tasks.
Key finding: agents fail in characteristic ways that are predictable and addressable.
Directly useful for building robust evaluation harnesses.

Gorilla: Large Language Model Connected with Massive APIs

https://arxiv.org/abs/2305.15334
Patil et al., 2023. Training LLMs to call APIs reliably. Focus on the evaluation methodology:
how do you measure whether a model is calling tools correctly at scale?


Blog Posts and Practical Guides

Anthropic’s Model Spec — Agentic and Multi-Agent Safety

https://www.anthropic.com/claude/model-spec
The values and principles built into Claude that affect how it behaves as an agent —
minimal footprint, deference to humans, refusing harmful tool use. Understanding the
model’s built-in safety dispositions is necessary for reliable agent design.

Building Production-Ready Agents (Anthropic Cookbook)

https://github.com/anthropics/anthropic-cookbook
Practical code examples from Anthropic covering tool use, agents, multi-agent systems,
and evaluation. The agents directory has production-quality patterns worth studying
directly.


Key Concepts Index

ConceptPrimary Reference
ReAct patternarxiv.org/abs/2210.03629
Tool use API (Claude)docs.anthropic.com tool-use/overview
Parallel tool callsdocs.anthropic.com tool-use/overview
Plan-and-executelangchain-ai.github.io/langgraph/concepts/agentic_concepts
Agent safety / footprintdocs.anthropic.com/en/docs/build-with-claude/agents
Computer usedocs.anthropic.com tool-use/computer-use
Tool learning surveyarxiv.org/abs/2305.10601
Multi-agent orchestrationarxiv.org/abs/2303.04671
CoT (ReAct prerequisite)arxiv.org/abs/2201.11903