Module 03 Exercises — AI Agents
These exercises build on the concepts in ../README.md and the examples in ../examples/.
Work through them in order — each one builds on the previous.
Exercise 1: Add a web_search Tool
Goal: Extend react_agent.py with a mocked web_search tool and test it with 3 different queries.
What to build:
Add a search_web(query: str, num_results: int = 5) -> str function that returns realistic-looking mocked results. Create a dictionary of at least 10 canned responses covering different domains (tech, science, sports, current events). Register it in the tool registry and add an Anthropic tool definition with a detailed description and input schema.
Test queries to run:
"What is the current version of Python and when was it released?""Who is the CEO of Anthropic and what is the company known for?""What is the difference between RAG and fine-tuning for LLMs?"
What to observe:
- Does the agent correctly decide to use
search_webvssearch_wikipediavscalculator? - Does it combine multiple tool calls to answer a complex question?
- What happens when a query matches no canned result?
Stretch goal: Add a recent_news(topic: str) -> str tool and test with time-sensitive queries. Notice when the agent chooses news vs. Wikipedia.
Key things to get right:
- The tool description should clearly state when to use
search_webvssearch_wikipedia - Return structured results: “Result 1: [title] — [snippet]\nResult 2: …”
- Handle the case where num_results is requested but fewer canned results exist
Exercise 2: Step Counter and Graceful Stopping
Goal: Implement a step counter with a progress indicator, and make the agent produce a useful partial answer when the step budget runs out.
What to build:
Currently react_agent.py returns a generic “max steps reached” message when the budget is exhausted. That’s not useful to the user — they have no idea what the agent found before stopping.
Modify run_react_agent() to:
- Track a
StepTrackerwith: current step, max steps, steps remaining, list of observations so far - Print a progress indicator before each step:
[Step 3/10 | 7 remaining] - When the max step limit is hit, instead of returning a generic message, make one final LLM call with this prompt:
You ran out of steps before completing the task. Here is what you found so far:
{list_of_observations}
Based on what you found, provide the most useful partial answer you can.
Be explicit that this is incomplete. Suggest what the user should ask next
to get the rest of the answer.
- Return this graceful summary instead of the generic error message.
Test it with a deliberately complex query that will exhaust the budget at max_steps=3:
"Compare the GDP, population, and capital cities of France, Germany, Italy, Spain, and Portugal"
What to observe:
- Does the partial summary correctly reflect what was actually found?
- Does the “what to ask next” suggestion make sense?
- How does the quality of the partial answer compare to just stopping silently?
Key implementation note: Store observations as a list of (step, tool_name, result) tuples. The graceful stop prompt should format these clearly.
Exercise 3: Reflection Loop
Goal: Add a self-evaluation step that rates the agent’s own answer and retries if the score is below a threshold.
What to build:
After the agent produces its final answer, run a second LLM call as a “critic” that evaluates the answer and assigns a score.
The reflection prompt should be:
You are a quality evaluator for AI-generated answers.
Original task: {original_task}
Agent's answer: {agent_answer}
Evaluate this answer on the following dimensions:
1. Completeness (1-5): Does it fully answer the question?
2. Accuracy (1-5): Is the information factually correct and grounded?
3. Clarity (1-5): Is it clearly written and easy to understand?
Output ONLY a JSON object in this exact format:
{
"completeness": <1-5>,
"accuracy": <1-5>,
"clarity": <1-5>,
"overall": <average of the three>,
"issues": ["issue 1", "issue 2"],
"suggestions": ["improvement 1", "improvement 2"]
}
If overall < 3.5, run the agent again with the original task plus a hint:
{original_task}
Note: A previous attempt received this feedback:
Issues found: {issues}
Please address these specific issues in your answer: {suggestions}
Implementation structure:
def run_agent_with_reflection(
task: str,
max_steps: int = 10,
quality_threshold: float = 3.5,
max_retries: int = 2
) -> tuple[str, dict]:
"""Returns (final_answer, evaluation_result)"""
...Test with these queries:
- A simple question (should score highly on first attempt)
- An ambiguous question that could be interpreted multiple ways (may need retry)
- A question that requires multiple tools (tests completeness)
What to observe:
- How consistent is the evaluator’s scoring? Run the same answer through the evaluator 3 times.
- Does the hint actually improve the retry’s quality?
- Are there cases where the evaluator is wrong (scores a good answer poorly)?
Stretch goal: Parse the JSON evaluation and visualize it as a simple text table showing before/after scores when a retry happens.
Exercise 4: Plan-and-Execute
Goal: Implement the two-phase plan-and-execute pattern and compare it to the standard ReAct agent.
What to build:
Phase 1 — Planner:
def generate_plan(task: str) -> list[str]:
"""
Call the LLM to generate a numbered execution plan.
Returns a list of step descriptions.
"""Use this system prompt for the planner:
You are a planning assistant. Given a task, produce a clear numbered plan of steps.
Each step should be:
- Specific and actionable
- Small enough to accomplish in a single tool call or reasoning step
- Ordered logically (earlier steps provide context for later ones)
Output ONLY the numbered list, one step per line. No explanations, no preamble.
Example format:
1. Search for X
2. Calculate Y using the result from step 1
3. Look up Z
4. Synthesize findings into a final answer
Phase 2 — Executor:
def execute_plan(plan: list[str], task: str) -> str:
"""
Execute the plan step by step, accumulating context.
For each step, call the LLM with the full plan visible,
the step to execute, and the results of all previous steps.
"""For each step, build a message like:
Original task: {task}
Full plan:
{numbered_plan}
Steps completed so far:
{previous_results}
Current step to execute: Step {n}: {step_description}
Execute this step now. Use tools if needed. Output only the result of this step.
Test with:
"Research the history of the Python programming language and write a 3-paragraph summary""Find the area of a circle with radius equal to the square root of the year Python was first released"
Comparison exercise: Run both the standard run_react_agent() and your plan_and_execute() on the same task. Print both outputs side-by-side and note:
- Which used more steps?
- Which produced a more complete answer?
- Which was easier to debug?
What to observe:
- The plan generated in Phase 1 may not perfectly fit what Phase 2 discovers. How does the executor handle it when a step’s assumption is wrong?
- Does the plan help or hurt on simple tasks?
Exercise 5: Interview Simulation — Design an Internal Docs Agent
Goal: Practice the kind of open-ended system design question asked in AI engineering interviews.
The question:
“Design an agent that can answer questions about a company’s internal documentation (wiki pages, engineering RFCs, HR policies, etc.). The documentation corpus is ~50,000 documents and grows weekly. What tools does the agent need, what can go wrong, and how would you test it?”
Your task: Write a structured answer covering the sections below. This is a written exercise — produce a design document, not code.
Section A: Tool Design
List the tools the agent needs. For each tool, specify:
- Tool name (verb_noun format)
- What it does in one sentence
- Key input parameters (name, type, description)
- When the agent should use it vs. a different tool
- Failure modes specific to this tool
Minimum tools to cover:
- Document search / retrieval
- Document reading (full text)
- Answer synthesis check (optional but useful)
- Cross-reference / link following
Section B: What Can Go Wrong
Enumerate at least 8 failure modes specific to an internal docs agent. For each, describe:
- What the failure looks like to the user
- What causes it under the hood
- One concrete mitigation
Example format:
Failure: Agent returns outdated policy information
Cause: Retrieved document is from 2 years ago; newer policy exists
Mitigation: Index documents with last-modified date; include date in retrieval results;
instruct agent to note document age and recommend verification for time-sensitive info
Topics to cover in your failure modes:
- Retrieval failures (wrong documents, low recall)
- Confidentiality (agent returns docs the user shouldn’t see)
- Hallucination (agent makes up content not in any document)
- Stale data (documents not updated in the index)
- Query understanding (user asks in jargon the index doesn’t understand)
- Multi-document synthesis errors
- Context overflow (many retrieved docs fill context window)
- Malicious content in docs (prompt injection via document content)
Section C: Testing Strategy
Describe how you would test this agent before deploying it to 500 engineers. Cover:
- Unit tests for each tool function in isolation
- Integration tests for the full agent loop on known queries
- Evals — how do you measure answer quality at scale?
- Adversarial tests — what malicious or edge-case inputs would you test?
- Regression tests — what do you track over time as the system evolves?
- Human evaluation — who evaluates and using what rubric?
Section D: Production Architecture
Sketch the production architecture (as a text diagram or bullet list covering each component):
- What retrieval system backs the
search_documentstool? - How are documents kept up to date in the index?
- How do you handle access control (user A should not see documents user B owns)?
- How do you observe and debug production agent runs?
- What does the deployment pipeline look like?
Grading yourself
After writing your answer, check it against these criteria:
- Did you name all tools using verb_noun convention with clear descriptions?
- Did you identify at least one failure mode that isn’t on the list above (novel thinking)?
- Is your testing strategy concrete enough that an engineer could implement it?
- Did you address the “50,000 documents, growing weekly” constraint in your architecture?
- Did you address access control — a common omission?
This question is representative of what you’ll encounter at companies like Anthropic, OpenAI, Google DeepMind, and AI-forward startups in a Staff or Senior Engineer interview for an AI/ML systems role.