Module 09 Exercises: Production LLM Systems

These exercises turn the concepts from the README into hands-on practice.
Each one is scoped to be completable in 30–90 minutes.


Exercise 1: Instrument the RAG Pipeline with Structured Cost Logging

Objective: Add production-grade observability to the RAG pipeline from module 02.

Background: The module 02 RAG pipeline retrieves documents and calls Claude, but has no logging. In production, you need to know the cost, latency, and cache efficiency of every call.

Task:

  1. Open the RAG pipeline from 02-rag/ (or build a minimal stub if you skipped that module).

  2. Create a LLMCallLogger class that captures the following fields on every call:

    • request_id (UUID)
    • timestamp (ISO 8601)
    • model
    • feature (e.g., "rag_answer")
    • user_id (pass as parameter, default "anonymous")
    • system_prompt_hash (SHA-256 hex of the system prompt, first 8 chars)
    • input_tokens, output_tokens
    • cache_read_input_tokens, cache_creation_input_tokens
    • cost_usd (computed using the formula from the README)
    • latency_ms
    • finish_reason
    • num_retrieved_docs (RAG-specific)
  3. Write each log record as a single JSON line to llm_calls.jsonl.

  4. After 5 test queries, load llm_calls.jsonl and compute:

    • Total cost
    • Average latency
    • Cache hit rate (if you enabled caching)

Stretch goal: Add prompt caching to the system prompt and verify that cache_read_input_tokens increases on calls 2–5.

Expected output:

{"request_id": "a3f1...", "timestamp": "2024-01-15T10:23:45Z", "model": "claude-haiku-4-5-20251001", "feature": "rag_answer", "input_tokens": 1240, "output_tokens": 183, "cache_read_input_tokens": 800, "cache_creation_input_tokens": 0, "cost_usd": 0.000492, "latency_ms": 843, "finish_reason": "end_turn", "num_retrieved_docs": 3}

Exercise 2: Exponential Backoff Retry Decorator

Objective: Implement a production-grade retry wrapper for Anthropic API calls.

Task:

  1. Implement a @retry_with_backoff decorator with the following behaviour:

    • Retries on anthropic.RateLimitError (HTTP 429) and any 5xx APIStatusError
    • Uses exponential backoff: wait = min(max_delay, base * 2^attempt)
    • Adds jitter: wait += random.uniform(0, wait * 0.1)
    • Accepts parameters: max_attempts=5, base_delay=1.0, max_delay=60.0
    • Raises the original exception after all attempts are exhausted
    • Logs each retry attempt with attempt number, wait time, and error type
  2. Write a test that simulates rate limiting:

    • Use unittest.mock.patch to make the first 3 API calls raise RateLimitError
    • Assert that the 4th call succeeds
    • Assert that the decorator waited at least 1 + 2 + 4 = 7 seconds total (or verify the sleep call count)
  3. Apply the decorator to a function that calls client.messages.create.

Example interface:

@retry_with_backoff(max_attempts=5, base_delay=1.0, max_delay=60.0)
def call_claude(prompt: str) -> str:
    ...

Key constraint: The jitter must be random (not deterministic). Explain in a comment why jitter prevents thundering herd.


Exercise 3: Prompt Caching Cost Savings Calculator

Objective: Understand the economics of prompt caching through a concrete calculation and measurement.

Part A — Manual calculation (no API required):

A customer support system makes 1,000 calls/day with a 500-token system prompt.
Model: Claude Haiku. Prices: input 0.30/MTok, cache_read $0.03/MTok.
Cache lifetime: 5 minutes. Average call rate: 1 call every 90 seconds.

Answer these questions with calculations:

  1. Without caching: what is the daily input token cost for the system prompt alone?
  2. With caching: how many cache writes occur per day? (Hint: cache expires after 5 minutes of inactivity. With 1 call/90s, the cache stays warm. How many cold starts per day?)
  3. With caching: what is the daily cost of cache writes + cache reads?
  4. What is the percentage saving from caching?
  5. What is the break-even point: at what call volume does caching start saving money vs losing money from the cache write overhead?

Part B — Measured verification (API required):

  1. Run prompt_caching_demo.py from the examples/ directory.
  2. Record the cache_creation_input_tokens on call 1 and cache_read_input_tokens on calls 2–3.
  3. Calculate the actual cost savings from the measured token counts.
  4. Compare with your theoretical calculation from Part A.

Deliverable: A markdown file exercise_03_solution.md with your calculations shown step by step, and a screenshot or copy-paste of the demo output from Part B.


Exercise 4: Simple Model Router

Objective: Build a model router that selects the cheapest adequate model for each query.

Task:

Implement a ModelRouter class with the following routing logic:

ConditionRoute to
Query < 50 words AND no code-related keywordsclaude-haiku-4-5-20251001
Query >= 50 words OR contains code keywordsclaude-sonnet-4-5-20251001

Code-related keywords: code, debug, function, class, error, stack trace, implement, refactor, algorithm, complexity.

Interface:

class ModelRouter:
    def select_model(self, query: str) -> str:
        """Return the model name to use for this query."""
        ...
 
    def call(self, query: str, client: anthropic.Anthropic) -> tuple[str, str]:
        """Route query to the right model and return (response_text, model_used)."""
        ...

Test your router with at least 6 queries:

  • “What is 2+2?” → should route to Haiku
  • “Summarize the French Revolution in one paragraph.” → route based on word count
  • “Debug this Python function: def foo(x): return x/0” → should route to Sonnet
  • “Write a recursive algorithm for binary search with O(log n) complexity.” → Sonnet
  • A long essay question (50+ words) → Sonnet
  • A very short classification task → Haiku

Log which model was selected and the actual cost for each call. Calculate the total cost of your test set and compare it to the cost if all calls went to Sonnet.

Stretch goal: Add a third tier — if the query contains “analyze in depth”, “comprehensive report”, or similar high-stakes phrases, route to Opus (but add a cost guard that refuses if estimated cost > $0.10).


Exercise 5 (Interview Sim): Infrastructure Design for 10,000 Requests/Day

Objective: Practice the system design interview question most commonly asked when hiring for LLM engineering roles.

The Question:

“Design the infrastructure for an LLM-powered API that must handle 10,000 requests per day with a P99 latency under 2 seconds, at minimum cost. The system serves a B2C product where users ask questions about their financial portfolio. Responses should be accurate and contextually aware of previous messages in the session.”

Deliverable: Write a structured design document (exercise_05_design.md) covering:

1. Capacity planning (show your math):

  • Requests per second (average and peak — assume 10:1 peak ratio)
  • Expected token counts (estimate input and output tokens per request)
  • Daily token volume
  • Estimated daily cost at baseline (without optimizations)

2. Model and routing strategy:

  • Which model(s) would you use?
  • What routing logic would you apply?
  • Expected cost after routing

3. Prompt caching strategy:

  • What would you cache?
  • Estimated cache hit rate
  • Cost after caching

4. Infrastructure choices (justify each):

  • Compute: serverless vs long-running (which and why, given the latency constraint)
  • Region selection
  • Load balancing

5. Latency budget (show that P99 < 2s is achievable):

  • Break down where latency comes from (API overhead, prompt processing, generation)
  • Which optimizations close the gap if you are over budget

6. Observability plan:

  • What metrics would you track?
  • What alerts would you set up?
  • How would you detect a cost anomaly (e.g., a bug causing 100x token usage)?

7. Rate limit handling:

  • How many TPM does 10,000 req/day correspond to at peak?
  • What tier do you need?
  • What happens if you hit the limit mid-day?

Grading criteria (use this as a self-checklist):

  • All numbers are calculated, not guessed
  • The latency budget shows P99 < 2s is achievable
  • Cost optimizations (caching, routing) are quantified with before/after estimates
  • Infrastructure choice matches the latency requirement (not just the cheapest option)
  • Failure modes are addressed (rate limits, cold starts, API downtime)
  • Observability plan covers cost, latency, and error rate