Module 11 Exercises: AI Engineering Ecosystem
These exercises are designed to give you hands-on experience with the frameworks covered in the module — and, crucially, to help you develop opinions about them. The best outcome is not “I completed the exercise” but “I now know when I would and wouldn’t reach for this framework.”
Exercise 1: Rebuild the RAG Pipeline from Module 02 Using LangChain LCEL
Goal: Compare the complexity of a bare-SDK RAG pipeline versus an LCEL one.
Prerequisites: Complete Module 02 (or build a simple RAG pipeline using the bare Anthropic SDK and a vector database directly).
Task
-
Take your existing Module 02 RAG pipeline (bare SDK + Chroma/pgvector).
-
Rewrite it using LangChain LCEL:
- Use
ChatAnthropicas the LLM - Use
ChromaorFAISSas the vector store - Build the chain:
retriever | prompt | llm | StrOutputParser - Use
RunnablePassthroughfor the fan-in pattern
- Use
-
Measure and compare:
| Metric | Bare SDK | LangChain LCEL |
|---|---|---|
| Lines of code | ||
| Time to first working version | ||
| Debugging difficulty (1-5) | ||
| Prompt visibility (can you see what’s sent?) | ||
| Time to add streaming |
- Write a short (3-5 sentence) opinion: For this specific pipeline, was LangChain worth it?
Starting Point
# Bare SDK RAG (from Module 02 — your version may differ)
import anthropic
import chromadb
client = anthropic.Anthropic()
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("docs")
def rag_query(question: str) -> str:
# Retrieve
results = collection.query(query_texts=[question], n_results=3)
context = "\n\n".join(results["documents"][0])
# Generate
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
system="Answer using only the context provided.",
messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}],
)
return response.content[0].textExpected Insight
The LCEL version will likely be similar in line count for simple pipelines. The value emerges when you need to swap components (e.g., change the vector store) or add streaming. If the bare SDK version is already clean and working, the LCEL rewrite may not feel worth it.
Exercise 2: Add a Checkpointer to the LangGraph Agent
Goal: Implement persistent state in the LangGraph agent from examples/langgraph_agent.py, enabling pause/resume and multi-session support.
Part A: Sqlite Checkpointer
- Modify
langgraph_agent.pyto useSqliteSaver:
from langgraph.checkpoint.sqlite import SqliteSaver
with SqliteSaver.from_conn_string("agent_state.db") as checkpointer:
app = graph_builder.compile(checkpointer=checkpointer)
config = {"configurable": {"thread_id": "session-001"}}
result = app.invoke(
{"messages": [HumanMessage(content="What is the weather in Paris?")]},
config=config,
)-
Run the agent twice with the same
thread_id. Observe that the second run continues from where the first left off (the message history accumulates). -
Run the agent with a different
thread_id. Observe that it starts fresh.
Part B: Human-in-the-Loop
- Compile the graph with
interrupt_before=["execute_tools"]:
app = graph_builder.compile(
checkpointer=checkpointer,
interrupt_before=["execute_tools"],
)-
Run the agent with a tool-calling message. The graph will pause before executing tools.
-
Inspect the pending tool calls:
state = app.get_state(config)
pending_tools = state.values["messages"][-1].tool_calls
print("Pending tool calls:", pending_tools)- Resume execution:
# Resume with None input (continue from where we left off)
result = app.invoke(None, config=config)- Try modifying the tool call arguments before resuming:
from langchain_core.messages import AIMessage
# Get current state
state = app.get_state(config)
last_message = state.values["messages"][-1]
# Modify the tool call (e.g., change the city)
modified_tool_calls = [
{**tc, "args": {"city": "Berlin"}} # override city
for tc in last_message.tool_calls
]
modified_message = AIMessage(
content=last_message.content,
tool_calls=modified_tool_calls,
)
# Update state with modified message
app.update_state(config, {"messages": [modified_message]}, as_node="call_llm")
# Resume
result = app.invoke(None, config=config)Part C: Reflection
Write answers to:
- Why does checkpointing matter for production agents (beyond just “it persists state”)?
- What is the difference between
interrupt_beforeandinterrupt_after? When would you use each? - What would break if you used
MemorySaverinstead ofSqliteSaverin a web server with multiple workers?
Exercise 3: Build a 2-Agent CrewAI Crew
Goal: Build a researcher + writer crew that produces a short report on a given topic. Develop an opinion about CrewAI’s ergonomics vs. LangGraph.
Setup
pip install crewai crewai-tools python-dotenvTask
Build a crew with two agents:
- Research Analyst: searches for information on a topic (use SerperDevTool or a fake tool), produces a structured research summary.
- Technical Writer: takes the research summary and produces a polished 300-word report.
from crewai import Agent, Crew, Process, Task
from crewai_tools import SerperDevTool # optional, requires SERPER_API_KEY
# Or define a fake search tool:
from crewai.tools import BaseTool
class FakeSearchTool(BaseTool):
name: str = "web_search"
description: str = "Search the web for information on a topic."
def _run(self, query: str) -> str:
return f"[Fake search result for '{query}']: Found 5 articles discussing key aspects of {query}..."
researcher = Agent(
role="Research Analyst",
goal="Find comprehensive, accurate information about {topic}",
backstory=(
"You are a senior research analyst with expertise in synthesizing "
"information from multiple sources into clear, structured summaries."
),
tools=[FakeSearchTool()],
llm="claude-haiku-4-5-20251001",
verbose=True,
)
writer = Agent(
role="Technical Writer",
goal="Transform research into engaging, clear reports",
backstory=(
"You are an experienced technical writer who specializes in making "
"complex topics accessible. You produce well-structured, concise reports."
),
llm="claude-haiku-4-5-20251001",
verbose=True,
)
research_task = Task(
description="Research the current state of {topic}. Identify key trends, challenges, and opportunities.",
expected_output="A structured research summary with bullet points covering key findings.",
agent=researcher,
)
writing_task = Task(
description=(
"Using the research summary, write a 300-word report on {topic}. "
"The report should be suitable for a technical audience. "
"Include an introduction, key findings, and a conclusion."
),
expected_output="A 300-word professional report with clear structure.",
agent=writer,
context=[research_task], # writer receives researcher's output
)
crew = Crew(
agents=[researcher, writer],
tasks=[research_task, writing_task],
process=Process.sequential,
verbose=True,
)
result = crew.kickoff(inputs={"topic": "vector databases in production AI systems"})
print(result.raw)Extension (optional)
- Switch to
Process.hierarchicaland add a manager LLM. Observe how the routing changes. - Add a third “Editor” agent that reviews the report for accuracy and tone.
Reflection Questions
After completing the exercise, write answers to:
- How does the “role/goal/backstory” paradigm compare to LangGraph’s explicit node functions? Which gives you more confidence in production?
- The
context=[research_task]parameter passes the previous task’s output to the next agent. What is the risk of this approach at scale? - If you needed to add a conditional: “if the research quality is below a threshold, redo the research” — how would you implement this in CrewAI vs. LangGraph? Which is easier?
Exercise 4: Benchmark LangChain vs. Bare SDK
Goal: Measure the real-world differences between LangChain and the bare SDK for a simple, fixed task.
The Task
Summarize 10 short paragraphs (1-3 sentences each) and return the results as a JSON array. Do this 10 times (to average out latency variance).
Measurement Template
import time
import json
import statistics
# --- Version A: Bare Anthropic SDK ---
import anthropic
bare_client = anthropic.Anthropic()
def bare_sdk_summarize(texts: list[str]) -> list[str]:
"""Summarize a list of texts using the bare SDK."""
results = []
for text in texts:
response = bare_client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=100,
messages=[{"role": "user", "content": f"Summarize in one sentence: {text}"}],
)
results.append(response.content[0].text)
return results
# --- Version B: LangChain LCEL ---
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
lc_llm = ChatAnthropic(model="claude-haiku-4-5-20251001", max_tokens=100)
lc_prompt = ChatPromptTemplate.from_template("Summarize in one sentence: {text}")
lc_chain = lc_prompt | lc_llm | StrOutputParser()
def lc_summarize(texts: list[str]) -> list[str]:
"""Summarize a list of texts using LangChain batch."""
return lc_chain.batch([{"text": t} for t in texts])
# --- Benchmark harness ---
SAMPLE_TEXTS = [
"Large language models are neural networks trained on vast corpora of text data.",
"Vector databases store high-dimensional embeddings for semantic similarity search.",
"Retrieval-augmented generation combines vector search with LLM generation.",
"LangChain provides abstractions for building composable LLM pipelines.",
"Fine-tuning adjusts a pre-trained model's weights on a task-specific dataset.",
"Prompt engineering shapes LLM behavior without changing model weights.",
"Chain-of-thought prompting improves reasoning by asking models to think step-by-step.",
"Tool calling allows LLMs to invoke external functions and APIs.",
"Embeddings are dense vector representations that capture semantic meaning.",
"Agents use LLMs to plan and execute multi-step tasks with tools.",
]
RUNS = 5 # reduce if API costs are a concern
def benchmark(fn, name, runs=RUNS):
latencies = []
for i in range(runs):
start = time.perf_counter()
result = fn(SAMPLE_TEXTS)
elapsed = time.perf_counter() - start
latencies.append(elapsed)
print(f" {name} run {i+1}: {elapsed:.2f}s")
return {
"mean": statistics.mean(latencies),
"median": statistics.median(latencies),
"stdev": statistics.stdev(latencies) if len(latencies) > 1 else 0,
}
print("Benchmarking bare SDK...")
bare_stats = benchmark(bare_sdk_summarize, "bare_sdk")
print("\nBenchmarking LangChain LCEL...")
lc_stats = benchmark(lc_summarize, "langchain")
print("\nResults:")
for name, stats in [("Bare SDK", bare_stats), ("LangChain", lc_stats)]:
print(f" {name}: mean={stats['mean']:.2f}s, median={stats['median']:.2f}s, stdev={stats['stdev']:.2f}s")What to Measure
| Metric | Bare SDK | LangChain LCEL |
|---|---|---|
| Lines of code | ||
| Mean latency (s) | ||
| Median latency (s) | ||
| Latency overhead vs bare SDK | N/A | |
| Number of dependencies | ||
| Cold start time (import time) |
To measure import time:
time python -c "import anthropic"
time python -c "from langchain_anthropic import ChatAnthropic"Expected Findings
- LangChain overhead: typically 50-200ms per batch due to abstraction layers. For a high-throughput service, this matters.
- Import time: LangChain is significantly slower to import than the bare SDK — relevant for Lambda functions and serverless environments.
- Code lines: comparable for simple cases. LangChain does not save significant code for tasks this simple.
Reflection
Write a 3-sentence conclusion: for this specific task (batch summarization), which approach would you recommend, and why?
Exercise 5: Interview Simulation — Framework Selection for a Law Firm
Goal: Practice the framework selection reasoning that interviewers test. Write a structured response as if you are in a technical interview.
Scenario
You are interviewing for a senior AI engineer role at a legal technology company. The interviewer says:
“Your team is building a document Q&A system for a law firm. The system needs to answer questions over a corpus of case files — PDFs, Word documents, and deposition transcripts. Documents can be up to 500 pages long. Lawyers need to be able to cite the exact passage that informed each answer. The system will be used by 50 lawyers who are not technical. Accuracy is critical — a wrong answer could be malpractice. We need this production-ready in 3 months.
Which framework would you choose and why? What are the risks?”
Structure Your Answer
Write a response with these sections:
1. Clarifying questions I would ask (2-3 questions before committing to a choice)
- What does “production-ready” mean — SLA, uptime, audit requirements?
- Is on-premises deployment required (some law firms prohibit cloud storage of client data)?
- What is the expected query volume?
2. Framework recommendation
State your choice and justify it. Consider:
- Why LlamaIndex over LangChain for this use case?
- Why the bare SDK might be appropriate for parts of the pipeline
- What vector database you would choose (pgvector for on-prem? Pinecone for cloud?)
- How you would handle the 500-page document challenge (chunking strategy)
- How you would implement citation / source attribution
3. Architecture sketch (text-based)
[PDF/DOCX/TXT ingestion]
→ Document parser (pdfplumber, python-docx)
→ LlamaIndex SentenceWindowNodeParser (preserves citation context)
→ Embedding (text-embedding-3-small or Voyage Law)
→ pgvector (on-prem) or Pinecone (cloud)
[Query time]
User question
→ Query rewriting (expand legal terms, abbreviations)
→ Hybrid retrieval (semantic + BM25 keyword)
→ Cohere Rerank (cross-encoder reranking)
→ LlamaIndex ResponseSynthesizer with source_nodes
→ Answer + cited passages (passage, document, page number)
→ Hallucination check (does answer contradict retrieved passages?)
4. Risks and mitigations (3-5 specific risks)
Example format:
-
Risk: LLM hallucination could produce legally incorrect answers.
Mitigation: Every answer must cite a retrieved passage. Implement a faithfulness check: ask a second LLM call to verify the answer is supported by the cited passages. Flag low-confidence answers for human review. -
Risk: [Your risk here]
Mitigation: [Your mitigation here]
5. What I would NOT use and why
Be explicit about rejected options:
- “I would not use the OpenAI Assistants API because…”
- “I would not use LangChain’s RetrievalQA because…”
- “I would not use CrewAI because…”
Evaluation Criteria
A strong answer:
- Asks clarifying questions before committing to a framework
- Justifies the choice in terms of the specific constraints (accuracy, citation, 500-page docs, 3-month timeline)
- Addresses risks with specific, technical mitigations — not just “we’ll test it”
- Is explicit about what NOT to use and why
- Shows awareness of the trade-off between time-to-delivery and long-term maintainability
A weak answer:
- Immediately picks a framework without asking questions
- Justifies the choice because “it’s popular” or “I know it well”
- Does not mention hallucination risk in a legal context
- Does not explain how citation/source attribution works technically
Submission Checklist
- Exercise 1: Completed comparison table + 3-5 sentence opinion
- Exercise 2: Working agent with SqliteSaver + written answers to reflection questions
- Exercise 3: Working CrewAI crew + written answers to reflection questions
- Exercise 4: Completed benchmark table + 3-sentence conclusion
- Exercise 5: Written interview response with all five sections
The goal is not perfect code. The goal is a clear opinion, backed by evidence from your own experiments.