Module 11 Exercises: AI Engineering Ecosystem

These exercises are designed to give you hands-on experience with the frameworks covered in the module — and, crucially, to help you develop opinions about them. The best outcome is not “I completed the exercise” but “I now know when I would and wouldn’t reach for this framework.”

Exercise 1: Rebuild the RAG Pipeline from Module 02 Using LangChain LCEL

Goal: Compare the complexity of a bare-SDK RAG pipeline versus an LCEL one.

Prerequisites: Complete Module 02 (or build a simple RAG pipeline using the bare Anthropic SDK and a vector database directly).

Task

Take your existing Module 02 RAG pipeline (bare SDK + Chroma/pgvector).
Rewrite it using LangChain LCEL:
- Use ChatAnthropic as the LLM
- Use Chroma or FAISS as the vector store
- Build the chain: retriever | prompt | llm | StrOutputParser
- Use RunnablePassthrough for the fan-in pattern
Measure and compare:

Metric	Bare SDK	LangChain LCEL
Lines of code
Time to first working version
Debugging difficulty (1-5)
Prompt visibility (can you see what’s sent?)
Time to add streaming

Write a short (3-5 sentence) opinion: For this specific pipeline, was LangChain worth it?

Starting Point

# Bare SDK RAG (from Module 02 — your version may differ)
import anthropic
import chromadb
 
client = anthropic.Anthropic()
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("docs")
 
def rag_query(question: str) -> str:
    # Retrieve
    results = collection.query(query_texts=[question], n_results=3)
    context = "\n\n".join(results["documents"][0])
    # Generate
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        system="Answer using only the context provided.",
        messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}],
    )
    return response.content[0].text

Expected Insight

The LCEL version will likely be similar in line count for simple pipelines. The value emerges when you need to swap components (e.g., change the vector store) or add streaming. If the bare SDK version is already clean and working, the LCEL rewrite may not feel worth it.

Exercise 2: Add a Checkpointer to the LangGraph Agent

Goal: Implement persistent state in the LangGraph agent from examples/langgraph_agent.py, enabling pause/resume and multi-session support.

Part A: Sqlite Checkpointer

Modify langgraph_agent.py to use SqliteSaver:

from langgraph.checkpoint.sqlite import SqliteSaver
 
with SqliteSaver.from_conn_string("agent_state.db") as checkpointer:
    app = graph_builder.compile(checkpointer=checkpointer)
 
    config = {"configurable": {"thread_id": "session-001"}}
    result = app.invoke(
        {"messages": [HumanMessage(content="What is the weather in Paris?")]},
        config=config,
    )

Run the agent twice with the same thread_id. Observe that the second run continues from where the first left off (the message history accumulates).
Run the agent with a different thread_id. Observe that it starts fresh.

Part B: Human-in-the-Loop

Compile the graph with interrupt_before=["execute_tools"]:

app = graph_builder.compile(
    checkpointer=checkpointer,
    interrupt_before=["execute_tools"],
)

Run the agent with a tool-calling message. The graph will pause before executing tools.
Inspect the pending tool calls:

state = app.get_state(config)
pending_tools = state.values["messages"][-1].tool_calls
print("Pending tool calls:", pending_tools)

Resume execution:

# Resume with None input (continue from where we left off)
result = app.invoke(None, config=config)

Try modifying the tool call arguments before resuming:

from langchain_core.messages import AIMessage
 
# Get current state
state = app.get_state(config)
last_message = state.values["messages"][-1]
 
# Modify the tool call (e.g., change the city)
modified_tool_calls = [
    {**tc, "args": {"city": "Berlin"}}  # override city
    for tc in last_message.tool_calls
]
modified_message = AIMessage(
    content=last_message.content,
    tool_calls=modified_tool_calls,
)
 
# Update state with modified message
app.update_state(config, {"messages": [modified_message]}, as_node="call_llm")
 
# Resume
result = app.invoke(None, config=config)

Part C: Reflection

Write answers to:

Why does checkpointing matter for production agents (beyond just “it persists state”)?
What is the difference between interrupt_before and interrupt_after? When would you use each?
What would break if you used MemorySaver instead of SqliteSaver in a web server with multiple workers?

Exercise 3: Build a 2-Agent CrewAI Crew

Goal: Build a researcher + writer crew that produces a short report on a given topic. Develop an opinion about CrewAI’s ergonomics vs. LangGraph.

Setup

pip install crewai crewai-tools python-dotenv

Task

Build a crew with two agents:

Research Analyst: searches for information on a topic (use SerperDevTool or a fake tool), produces a structured research summary.
Technical Writer: takes the research summary and produces a polished 300-word report.

from crewai import Agent, Crew, Process, Task
from crewai_tools import SerperDevTool  # optional, requires SERPER_API_KEY
 
# Or define a fake search tool:
from crewai.tools import BaseTool
 
class FakeSearchTool(BaseTool):
    name: str = "web_search"
    description: str = "Search the web for information on a topic."
 
    def _run(self, query: str) -> str:
        return f"[Fake search result for '{query}']: Found 5 articles discussing key aspects of {query}..."
 
researcher = Agent(
    role="Research Analyst",
    goal="Find comprehensive, accurate information about {topic}",
    backstory=(
        "You are a senior research analyst with expertise in synthesizing "
        "information from multiple sources into clear, structured summaries."
    ),
    tools=[FakeSearchTool()],
    llm="claude-haiku-4-5-20251001",
    verbose=True,
)
 
writer = Agent(
    role="Technical Writer",
    goal="Transform research into engaging, clear reports",
    backstory=(
        "You are an experienced technical writer who specializes in making "
        "complex topics accessible. You produce well-structured, concise reports."
    ),
    llm="claude-haiku-4-5-20251001",
    verbose=True,
)
 
research_task = Task(
    description="Research the current state of {topic}. Identify key trends, challenges, and opportunities.",
    expected_output="A structured research summary with bullet points covering key findings.",
    agent=researcher,
)
 
writing_task = Task(
    description=(
        "Using the research summary, write a 300-word report on {topic}. "
        "The report should be suitable for a technical audience. "
        "Include an introduction, key findings, and a conclusion."
    ),
    expected_output="A 300-word professional report with clear structure.",
    agent=writer,
    context=[research_task],  # writer receives researcher's output
)
 
crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, writing_task],
    process=Process.sequential,
    verbose=True,
)
 
result = crew.kickoff(inputs={"topic": "vector databases in production AI systems"})
print(result.raw)

Extension (optional)

Switch to Process.hierarchical and add a manager LLM. Observe how the routing changes.
Add a third “Editor” agent that reviews the report for accuracy and tone.

Reflection Questions

After completing the exercise, write answers to:

How does the “role/goal/backstory” paradigm compare to LangGraph’s explicit node functions? Which gives you more confidence in production?
The context=[research_task] parameter passes the previous task’s output to the next agent. What is the risk of this approach at scale?
If you needed to add a conditional: “if the research quality is below a threshold, redo the research” — how would you implement this in CrewAI vs. LangGraph? Which is easier?

Exercise 4: Benchmark LangChain vs. Bare SDK

Goal: Measure the real-world differences between LangChain and the bare SDK for a simple, fixed task.

The Task

Summarize 10 short paragraphs (1-3 sentences each) and return the results as a JSON array. Do this 10 times (to average out latency variance).

Measurement Template

import time
import json
import statistics
 
# --- Version A: Bare Anthropic SDK ---
import anthropic
 
bare_client = anthropic.Anthropic()
 
def bare_sdk_summarize(texts: list[str]) -> list[str]:
    """Summarize a list of texts using the bare SDK."""
    results = []
    for text in texts:
        response = bare_client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=100,
            messages=[{"role": "user", "content": f"Summarize in one sentence: {text}"}],
        )
        results.append(response.content[0].text)
    return results
 
# --- Version B: LangChain LCEL ---
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
 
lc_llm = ChatAnthropic(model="claude-haiku-4-5-20251001", max_tokens=100)
lc_prompt = ChatPromptTemplate.from_template("Summarize in one sentence: {text}")
lc_chain = lc_prompt | lc_llm | StrOutputParser()
 
def lc_summarize(texts: list[str]) -> list[str]:
    """Summarize a list of texts using LangChain batch."""
    return lc_chain.batch([{"text": t} for t in texts])
 
# --- Benchmark harness ---
SAMPLE_TEXTS = [
    "Large language models are neural networks trained on vast corpora of text data.",
    "Vector databases store high-dimensional embeddings for semantic similarity search.",
    "Retrieval-augmented generation combines vector search with LLM generation.",
    "LangChain provides abstractions for building composable LLM pipelines.",
    "Fine-tuning adjusts a pre-trained model's weights on a task-specific dataset.",
    "Prompt engineering shapes LLM behavior without changing model weights.",
    "Chain-of-thought prompting improves reasoning by asking models to think step-by-step.",
    "Tool calling allows LLMs to invoke external functions and APIs.",
    "Embeddings are dense vector representations that capture semantic meaning.",
    "Agents use LLMs to plan and execute multi-step tasks with tools.",
]
RUNS = 5  # reduce if API costs are a concern
 
def benchmark(fn, name, runs=RUNS):
    latencies = []
    for i in range(runs):
        start = time.perf_counter()
        result = fn(SAMPLE_TEXTS)
        elapsed = time.perf_counter() - start
        latencies.append(elapsed)
        print(f"  {name} run {i+1}: {elapsed:.2f}s")
    return {
        "mean": statistics.mean(latencies),
        "median": statistics.median(latencies),
        "stdev": statistics.stdev(latencies) if len(latencies) > 1 else 0,
    }
 
print("Benchmarking bare SDK...")
bare_stats = benchmark(bare_sdk_summarize, "bare_sdk")
 
print("\nBenchmarking LangChain LCEL...")
lc_stats = benchmark(lc_summarize, "langchain")
 
print("\nResults:")
for name, stats in [("Bare SDK", bare_stats), ("LangChain", lc_stats)]:
    print(f"  {name}: mean={stats['mean']:.2f}s, median={stats['median']:.2f}s, stdev={stats['stdev']:.2f}s")

What to Measure

Metric	Bare SDK	LangChain LCEL
Lines of code
Mean latency (s)
Median latency (s)
Latency overhead vs bare SDK	N/A
Number of dependencies
Cold start time (import time)

To measure import time:

time python -c "import anthropic"
time python -c "from langchain_anthropic import ChatAnthropic"

Expected Findings

LangChain overhead: typically 50-200ms per batch due to abstraction layers. For a high-throughput service, this matters.
Import time: LangChain is significantly slower to import than the bare SDK — relevant for Lambda functions and serverless environments.
Code lines: comparable for simple cases. LangChain does not save significant code for tasks this simple.

Reflection

Write a 3-sentence conclusion: for this specific task (batch summarization), which approach would you recommend, and why?

Exercise 5: Interview Simulation — Framework Selection for a Law Firm

Goal: Practice the framework selection reasoning that interviewers test. Write a structured response as if you are in a technical interview.

Scenario

You are interviewing for a senior AI engineer role at a legal technology company. The interviewer says:

“Your team is building a document Q&A system for a law firm. The system needs to answer questions over a corpus of case files — PDFs, Word documents, and deposition transcripts. Documents can be up to 500 pages long. Lawyers need to be able to cite the exact passage that informed each answer. The system will be used by 50 lawyers who are not technical. Accuracy is critical — a wrong answer could be malpractice. We need this production-ready in 3 months.

Which framework would you choose and why? What are the risks?”

Structure Your Answer

Write a response with these sections:

1. Clarifying questions I would ask (2-3 questions before committing to a choice)

What does “production-ready” mean — SLA, uptime, audit requirements?
Is on-premises deployment required (some law firms prohibit cloud storage of client data)?
What is the expected query volume?

2. Framework recommendation

State your choice and justify it. Consider:

Why LlamaIndex over LangChain for this use case?
Why the bare SDK might be appropriate for parts of the pipeline
What vector database you would choose (pgvector for on-prem? Pinecone for cloud?)
How you would handle the 500-page document challenge (chunking strategy)
How you would implement citation / source attribution

3. Architecture sketch (text-based)

[PDF/DOCX/TXT ingestion]
    → Document parser (pdfplumber, python-docx)
    → LlamaIndex SentenceWindowNodeParser (preserves citation context)
    → Embedding (text-embedding-3-small or Voyage Law)
    → pgvector (on-prem) or Pinecone (cloud)

[Query time]
    User question
    → Query rewriting (expand legal terms, abbreviations)
    → Hybrid retrieval (semantic + BM25 keyword)
    → Cohere Rerank (cross-encoder reranking)
    → LlamaIndex ResponseSynthesizer with source_nodes
    → Answer + cited passages (passage, document, page number)
    → Hallucination check (does answer contradict retrieved passages?)

4. Risks and mitigations (3-5 specific risks)

Example format:

Risk: LLM hallucination could produce legally incorrect answers.
Mitigation: Every answer must cite a retrieved passage. Implement a faithfulness check: ask a second LLM call to verify the answer is supported by the cited passages. Flag low-confidence answers for human review.
Risk: [Your risk here]
Mitigation: [Your mitigation here]

5. What I would NOT use and why

Be explicit about rejected options:

“I would not use the OpenAI Assistants API because…”
“I would not use LangChain’s RetrievalQA because…”
“I would not use CrewAI because…”

Evaluation Criteria

A strong answer:

Asks clarifying questions before committing to a framework
Justifies the choice in terms of the specific constraints (accuracy, citation, 500-page docs, 3-month timeline)
Addresses risks with specific, technical mitigations — not just “we’ll test it”
Is explicit about what NOT to use and why
Shows awareness of the trade-off between time-to-delivery and long-term maintainability

A weak answer:

Immediately picks a framework without asking questions
Justifies the choice because “it’s popular” or “I know it well”
Does not mention hallucination risk in a legal context
Does not explain how citation/source attribution works technically

Submission Checklist

Exercise 1: Completed comparison table + 3-5 sentence opinion
Exercise 2: Working agent with SqliteSaver + written answers to reflection questions
Exercise 3: Working CrewAI crew + written answers to reflection questions
Exercise 4: Completed benchmark table + 3-sentence conclusion
Exercise 5: Written interview response with all five sections

The goal is not perfect code. The goal is a clear opinion, backed by evidence from your own experiments.

Study Notes by Niladri & AI

Explorer

README

Module 11 Exercises: AI Engineering Ecosystem

Exercise 1: Rebuild the RAG Pipeline from Module 02 Using LangChain LCEL

Task

Starting Point

Expected Insight

Exercise 2: Add a Checkpointer to the LangGraph Agent

Part A: Sqlite Checkpointer

Part B: Human-in-the-Loop

Part C: Reflection

Exercise 3: Build a 2-Agent CrewAI Crew

Setup

Task

Extension (optional)

Reflection Questions

Exercise 4: Benchmark LangChain vs. Bare SDK

The Task

Measurement Template

What to Measure

Expected Findings

Reflection

Exercise 5: Interview Simulation — Framework Selection for a Law Firm

Scenario

Structure Your Answer

Evaluation Criteria

Submission Checklist

Graph View

Table of Contents