Module 09 Exercises: Production LLM Systems
These exercises turn the concepts from the README into hands-on practice.
Each one is scoped to be completable in 30–90 minutes.
Exercise 1: Instrument the RAG Pipeline with Structured Cost Logging
Objective: Add production-grade observability to the RAG pipeline from module 02.
Background: The module 02 RAG pipeline retrieves documents and calls Claude, but has no logging. In production, you need to know the cost, latency, and cache efficiency of every call.
Task:
-
Open the RAG pipeline from
02-rag/(or build a minimal stub if you skipped that module). -
Create a
LLMCallLoggerclass that captures the following fields on every call:request_id(UUID)timestamp(ISO 8601)modelfeature(e.g.,"rag_answer")user_id(pass as parameter, default"anonymous")system_prompt_hash(SHA-256 hex of the system prompt, first 8 chars)input_tokens,output_tokenscache_read_input_tokens,cache_creation_input_tokenscost_usd(computed using the formula from the README)latency_msfinish_reasonnum_retrieved_docs(RAG-specific)
-
Write each log record as a single JSON line to
llm_calls.jsonl. -
After 5 test queries, load
llm_calls.jsonland compute:- Total cost
- Average latency
- Cache hit rate (if you enabled caching)
Stretch goal: Add prompt caching to the system prompt and verify that cache_read_input_tokens increases on calls 2–5.
Expected output:
{"request_id": "a3f1...", "timestamp": "2024-01-15T10:23:45Z", "model": "claude-haiku-4-5-20251001", "feature": "rag_answer", "input_tokens": 1240, "output_tokens": 183, "cache_read_input_tokens": 800, "cache_creation_input_tokens": 0, "cost_usd": 0.000492, "latency_ms": 843, "finish_reason": "end_turn", "num_retrieved_docs": 3}Exercise 2: Exponential Backoff Retry Decorator
Objective: Implement a production-grade retry wrapper for Anthropic API calls.
Task:
-
Implement a
@retry_with_backoffdecorator with the following behaviour:- Retries on
anthropic.RateLimitError(HTTP 429) and any 5xxAPIStatusError - Uses exponential backoff:
wait = min(max_delay, base * 2^attempt) - Adds jitter:
wait += random.uniform(0, wait * 0.1) - Accepts parameters:
max_attempts=5,base_delay=1.0,max_delay=60.0 - Raises the original exception after all attempts are exhausted
- Logs each retry attempt with attempt number, wait time, and error type
- Retries on
-
Write a test that simulates rate limiting:
- Use
unittest.mock.patchto make the first 3 API calls raiseRateLimitError - Assert that the 4th call succeeds
- Assert that the decorator waited at least
1 + 2 + 4 = 7seconds total (or verify the sleep call count)
- Use
-
Apply the decorator to a function that calls
client.messages.create.
Example interface:
@retry_with_backoff(max_attempts=5, base_delay=1.0, max_delay=60.0)
def call_claude(prompt: str) -> str:
...Key constraint: The jitter must be random (not deterministic). Explain in a comment why jitter prevents thundering herd.
Exercise 3: Prompt Caching Cost Savings Calculator
Objective: Understand the economics of prompt caching through a concrete calculation and measurement.
Part A — Manual calculation (no API required):
A customer support system makes 1,000 calls/day with a 500-token system prompt.
Model: Claude Haiku. Prices: input 0.30/MTok, cache_read $0.03/MTok.
Cache lifetime: 5 minutes. Average call rate: 1 call every 90 seconds.
Answer these questions with calculations:
- Without caching: what is the daily input token cost for the system prompt alone?
- With caching: how many cache writes occur per day? (Hint: cache expires after 5 minutes of inactivity. With 1 call/90s, the cache stays warm. How many cold starts per day?)
- With caching: what is the daily cost of cache writes + cache reads?
- What is the percentage saving from caching?
- What is the break-even point: at what call volume does caching start saving money vs losing money from the cache write overhead?
Part B — Measured verification (API required):
- Run
prompt_caching_demo.pyfrom theexamples/directory. - Record the
cache_creation_input_tokenson call 1 andcache_read_input_tokenson calls 2–3. - Calculate the actual cost savings from the measured token counts.
- Compare with your theoretical calculation from Part A.
Deliverable: A markdown file exercise_03_solution.md with your calculations shown step by step, and a screenshot or copy-paste of the demo output from Part B.
Exercise 4: Simple Model Router
Objective: Build a model router that selects the cheapest adequate model for each query.
Task:
Implement a ModelRouter class with the following routing logic:
| Condition | Route to |
|---|---|
| Query < 50 words AND no code-related keywords | claude-haiku-4-5-20251001 |
| Query >= 50 words OR contains code keywords | claude-sonnet-4-5-20251001 |
Code-related keywords: code, debug, function, class, error, stack trace, implement, refactor, algorithm, complexity.
Interface:
class ModelRouter:
def select_model(self, query: str) -> str:
"""Return the model name to use for this query."""
...
def call(self, query: str, client: anthropic.Anthropic) -> tuple[str, str]:
"""Route query to the right model and return (response_text, model_used)."""
...Test your router with at least 6 queries:
- “What is 2+2?” → should route to Haiku
- “Summarize the French Revolution in one paragraph.” → route based on word count
- “Debug this Python function: def foo(x): return x/0” → should route to Sonnet
- “Write a recursive algorithm for binary search with O(log n) complexity.” → Sonnet
- A long essay question (50+ words) → Sonnet
- A very short classification task → Haiku
Log which model was selected and the actual cost for each call. Calculate the total cost of your test set and compare it to the cost if all calls went to Sonnet.
Stretch goal: Add a third tier — if the query contains “analyze in depth”, “comprehensive report”, or similar high-stakes phrases, route to Opus (but add a cost guard that refuses if estimated cost > $0.10).
Exercise 5 (Interview Sim): Infrastructure Design for 10,000 Requests/Day
Objective: Practice the system design interview question most commonly asked when hiring for LLM engineering roles.
The Question:
“Design the infrastructure for an LLM-powered API that must handle 10,000 requests per day with a P99 latency under 2 seconds, at minimum cost. The system serves a B2C product where users ask questions about their financial portfolio. Responses should be accurate and contextually aware of previous messages in the session.”
Deliverable: Write a structured design document (exercise_05_design.md) covering:
1. Capacity planning (show your math):
- Requests per second (average and peak — assume 10:1 peak ratio)
- Expected token counts (estimate input and output tokens per request)
- Daily token volume
- Estimated daily cost at baseline (without optimizations)
2. Model and routing strategy:
- Which model(s) would you use?
- What routing logic would you apply?
- Expected cost after routing
3. Prompt caching strategy:
- What would you cache?
- Estimated cache hit rate
- Cost after caching
4. Infrastructure choices (justify each):
- Compute: serverless vs long-running (which and why, given the latency constraint)
- Region selection
- Load balancing
5. Latency budget (show that P99 < 2s is achievable):
- Break down where latency comes from (API overhead, prompt processing, generation)
- Which optimizations close the gap if you are over budget
6. Observability plan:
- What metrics would you track?
- What alerts would you set up?
- How would you detect a cost anomaly (e.g., a bug causing 100x token usage)?
7. Rate limit handling:
- How many TPM does 10,000 req/day correspond to at peak?
- What tier do you need?
- What happens if you hit the limit mid-day?
Grading criteria (use this as a self-checklist):
- All numbers are calculated, not guessed
- The latency budget shows P99 < 2s is achievable
- Cost optimizations (caching, routing) are quantified with before/after estimates
- Infrastructure choice matches the latency requirement (not just the cheapest option)
- Failure modes are addressed (rate limits, cold starts, API downtime)
- Observability plan covers cost, latency, and error rate