Module 09: Production LLM Systems
This module covers everything you need to deploy, operate, and optimize LLM-powered systems in production. It is interview-dense — every section maps directly to common system design and ML engineering questions.
Table of Contents
- Cost Optimization
- Prompt Caching (Anthropic)
- Streaming
- Batching (Anthropic Batch API)
- Latency Optimization
- Rate Limits and Retry Strategy
- Observability
- Deployment Patterns
- Model Routing
- Interview Flashcards
1. Cost Optimization
Token Cost Model
LLM APIs charge per token, and the cost is asymmetric:
| Token Type | Relative Cost | Notes |
|---|---|---|
| Input tokens | 1x | Everything you send: system, user, history |
| Output tokens | 3–5x | Everything the model generates |
| Cache read tokens | ~0.1x | Anthropic prompt caching (10x cheaper) |
| Cache write | ~1.25x | One-time cost to populate the cache |
This asymmetry has a direct design implication: controlling output length is more impactful than trimming input. A verbose system prompt costs far less than a verbose model response.
As a concrete example using Claude Haiku pricing (approximate):
- Input: $0.25 per million tokens
- Output: $1.25 per million tokens
- Cache read: $0.03 per million tokens
- Cache write: $0.30 per million tokens
Estimating and Tracking Costs
The formula for cost per request:
cost = (input_tokens * price_per_input_MTok / 1_000_000)
+ (output_tokens * price_per_output_MTok / 1_000_000)
+ (cache_read_tokens * price_per_cache_read_MTok / 1_000_000)
+ (cache_write_tokens * price_per_cache_write_MTok / 1_000_000)
Every Anthropic API response includes a usage object:
{
"usage": {
"input_tokens": 350,
"output_tokens": 120,
"cache_read_input_tokens": 800,
"cache_creation_input_tokens": 0
}
}Track cost per request by attaching user ID and feature tag at call time, then aggregating in your data warehouse:
def compute_cost(usage, model="claude-haiku-4-5"):
PRICES = {
"claude-haiku-4-5": {
"input": 0.25, # per MTok
"output": 1.25,
"cache_read": 0.03,
"cache_write": 0.30,
},
"claude-sonnet-4-5": {
"input": 3.00,
"output": 15.00,
"cache_read": 0.30,
"cache_write": 3.75,
},
}
p = PRICES[model]
return (
usage.input_tokens * p["input"] / 1_000_000
+ usage.output_tokens * p["output"] / 1_000_000
+ getattr(usage, "cache_read_input_tokens", 0) * p["cache_read"] / 1_000_000
+ getattr(usage, "cache_creation_input_tokens", 0) * p["cache_write"] / 1_000_000
)Strategies to Reduce Cost
1. Prompt Caching (biggest win)
For any system where the system prompt or a large document stays the same across many requests, prompt caching cuts input token costs by ~10x. A 1,000-token system prompt sent 10,000 times/day costs 0.30/day in cache reads (after the first write). See Section 2 for full details.
2. Model Routing
Not every query requires your most powerful model. Route cheap, simple tasks to small models:
| Task Type | Recommended Model | Rationale |
|---|---|---|
| Intent classification | Haiku | Deterministic, low token count |
| Summarization of short text | Haiku | Simple extraction task |
| Multi-step reasoning | Sonnet | Balance of cost and accuracy |
| Complex code generation / analysis | Sonnet / Opus | High stakes, needs power |
A routing classifier itself can be a small model call (~50 input tokens, ~5 output tokens), which costs nearly nothing but can gate expensive calls.
3. Output Length Control
Always set max_tokens to a reasonable bound. A model given no constraint may produce 2,000 tokens when 200 would suffice. Use structured output (JSON mode or tool-use schemas) to eliminate preamble (“Sure, here is the JSON you requested…”).
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=256, # hard cap — saves cost and latency
messages=[{"role": "user", "content": prompt}]
)4. Batch API (50% discount)
For workloads that tolerate up to 24-hour latency — evaluation pipelines, document processing, nightly enrichment — use the Batch API for a 50% cost reduction. See Section 4.
5. Response Caching
For identical inputs (e.g., a FAQ bot), cache the full LLM response in Redis or a CDN. Key on a hash of (model + system_prompt + user_message). Cache TTL depends on content volatility — FAQ answers can be cached for hours, financial data for seconds.
import hashlib, json, redis
r = redis.Redis()
def cached_completion(system, user, model, client):
key = hashlib.sha256(
json.dumps({"model": model, "system": system, "user": user}).encode()
).hexdigest()
cached = r.get(key)
if cached:
return json.loads(cached)
resp = client.messages.create(
model=model, max_tokens=512,
system=system,
messages=[{"role": "user", "content": user}]
)
r.setex(key, 3600, json.dumps(resp.content[0].text))
return resp.content[0].textCost vs Quality Matrix
| Model | Input Cost (MTok) | Output Cost (MTok) | Speed | Quality |
|---|---|---|---|---|
| Claude Haiku | ~$0.25 | ~$1.25 | Very fast | Good for simple |
| Claude Sonnet | ~$3.00 | ~$15.00 | Moderate | Strong general |
| Claude Opus | ~$15.00 | ~$75.00 | Slowest | Best complex |
Haiku is ~60x cheaper per output token than Opus. For a high-volume system, choosing Haiku over Opus for tasks where Haiku is sufficient can cut costs by 98%.
2. Prompt Caching (Anthropic)
What It Is
Prompt caching allows you to mark a prefix of your prompt as cacheable. On the first request, Anthropic computes and stores the KV cache for that prefix. On subsequent requests that share the same prefix, the model skips recomputation and reads from cache.
Economics:
- Cache write: ~1.25x the normal input token price (one-time cost)
- Cache read: ~0.1x the normal input token price (~10x cheaper)
- Cache read latency: ~3x faster than re-processing the same tokens
For a 1,000-token system prompt used in 10,000 calls:
- Without caching: 10,000 × 1,000 tokens × 2.50
- With caching: 1 write + 9,999 reads = 0.30 = $0.60 (76% savings)
How to Implement
Add cache_control: {"type": "ephemeral"} to the last content block of the prefix you want cached. The cache boundary is at the end of that block.
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = """
You are an expert financial analyst with deep knowledge of...
[500+ tokens of stable instructions and context]
"""
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # mark this for caching
}
],
messages=[
{"role": "user", "content": "What is the P/E ratio of AAPL?"}
]
)
usage = response.usage
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Input tokens: {usage.input_tokens}")On the first call: cache_creation_input_tokens > 0, cache_read_input_tokens = 0.
On subsequent calls within the cache TTL: cache_read_input_tokens > 0.
What Can Be Cached
Anything that does not vary per request:
- System prompt (most common)
- Tool definitions (large tool schemas can be 500+ tokens)
- Reference documents injected at the start of a conversation
- Few-shot examples that are the same for all users
Variable content — user messages, conversation history — goes AFTER the cached prefix. The model processes cached tokens instantly and only generates KV states for the new tokens.
Cache Lifetime
The default cache type is ephemeral, with a TTL of approximately 5 minutes. The timer resets on each cache hit. In practice:
- Actively-used caches (many requests per minute) stay warm indefinitely
- Idle caches (no requests for 5+ minutes) expire and require a re-write on the next request
There is no persistent cache tier currently — all caches are ephemeral.
Best Practices
-
Put stable content first, variable content last. The cache key is the exact prefix. Any change to the prefix (even whitespace) invalidates the cache.
-
Combine all stable context into one block. If you have a system prompt + tool definitions + a static knowledge base, put them all in the same cached block.
-
Minimum cacheable size. Anthropic requires a minimum of 1,024 tokens for the cached block (for Sonnet and Opus; 2,048 for Haiku). Short system prompts below the threshold will not be cached — pad them with useful context.
-
Monitor cache hit rate. Calculate:
cache_hit_rate = cache_read_tokens / (cache_read_tokens + cache_creation_tokens). Target > 90% for a well-optimized system. -
Cache tool definitions. If your agent uses 10 tools with detailed JSON schemas, that can be 2,000+ tokens. Cache the entire tool block.
# Caching tool definitions
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
tools=[
{
"name": "search_database",
"description": "Search the product database...",
"input_schema": { ... },
"cache_control": {"type": "ephemeral"} # cache this tool block
}
],
system="You are a helpful assistant.",
messages=[{"role": "user", "content": user_query}]
)Calculating Cache Hit Rate
def cache_hit_rate(usage):
reads = getattr(usage, "cache_read_input_tokens", 0)
writes = getattr(usage, "cache_creation_input_tokens", 0)
total = reads + writes
if total == 0:
return 0.0
return reads / total
# Use across a batch of responses to get aggregate rate
rates = [cache_hit_rate(r.usage) for r in responses]
print(f"Average cache hit rate: {sum(rates)/len(rates):.1%}")3. Streaming
Why Stream
Without streaming, the user sees nothing until the model finishes generating — which can be 5–30 seconds for long responses. Streaming sends tokens to the client as they are generated, so:
- Perceived latency drops dramatically. Users see output after 200–500ms (TTFT) even if total generation takes 10 seconds.
- Progressive rendering is possible (markdown, code blocks appear incrementally).
- Early cancellation: if the user’s intent changes, they can cancel mid-generation and avoid paying for unused output tokens.
How to Implement with Anthropic SDK
import anthropic
import time
client = anthropic.Anthropic()
start = time.time()
first_token_time = None
with client.messages.stream(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain transformers in depth."}]
) as stream:
for text in stream.text_stream:
if first_token_time is None:
first_token_time = time.time()
print(f"\nTTFT: {first_token_time - start:.3f}s\n")
print(text, end="", flush=True)
total = time.time() - start
print(f"\n\nTotal time: {total:.3f}s")The .text_stream iterator yields decoded text deltas. For low-level event access:
with client.messages.stream(...) as stream:
for event in stream:
print(event.type, event)Event Types
| Event Type | Payload | Notes |
|---|---|---|
message_start | message object with usage (partial) | First event; contains model info |
content_block_start | content_block with type/index | Starts a new content block |
content_block_delta | delta.text or delta.partial_json | Incremental token(s) |
content_block_stop | index | Block complete |
message_delta | usage.output_tokens, stop_reason | Final usage stats |
| `message_stop` | (empty) | Stream complete |
Streaming Tool Calls
When a tool call is being generated, the delta type is input_json_delta with partial_json (a string fragment of the tool input JSON). You must accumulate these fragments and parse the complete JSON only after content_block_stop.
import json
tool_input_buffer = ""
with client.messages.stream(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
tools=[{"name": "get_weather", "description": "...", "input_schema": {...}}],
messages=[{"role": "user", "content": "What is the weather in Paris?"}]
) as stream:
for event in stream:
if event.type == "content_block_delta":
delta = event.delta
if delta.type == "input_json_delta":
tool_input_buffer += delta.partial_json
elif event.type == "content_block_stop":
if tool_input_buffer:
tool_input = json.loads(tool_input_buffer)
print(f"Tool called with: {tool_input}")
tool_input_buffer = ""SSE for Web Clients
Use Server-Sent Events to push streaming tokens to a browser. In a FastAPI app:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import anthropic
app = FastAPI()
client = anthropic.Anthropic()
@app.get("/stream")
async def stream_response(query: str):
def generate():
with client.messages.stream(
model="claude-haiku-4-5-20251001",
max_tokens=512,
messages=[{"role": "user", "content": query}]
) as stream:
for text in stream.text_stream:
yield f"data: {text}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")On the client:
const evtSource = new EventSource("/stream?query=hello");
evtSource.onmessage = (e) => {
if (e.data === "[DONE]") { evtSource.close(); return; }
document.getElementById("output").textContent += e.data;
};When NOT to Stream
- Batch processing jobs: streaming adds overhead; full responses are needed before the next step anyway.
- Downstream systems that need the full response: parsers, validators, and structured output consumers should get the complete text.
- Short responses (< 100 tokens): the streaming overhead (connection setup, event parsing) may exceed the latency benefit.
- Serverless with cold starts: streaming requires a persistent connection that conflicts with short-lived Lambda invocations.
4. Batching (Anthropic Batch API)
What It Is
The Message Batches API allows you to submit up to 10,000 requests in a single API call. Anthropic processes them asynchronously and makes results available for download. You get a 50% discount on all token costs compared to the real-time API.
The trade-off: latency. Batches can take anywhere from a few minutes to 24 hours to complete, depending on load.
Use Cases
- Offline evaluation runs: running eval sets of 1,000+ test cases nightly
- Bulk document processing: summarizing, classifying, or extracting from large corpora
- Data enrichment pipelines: adding LLM-generated metadata to a database table
- Report generation: producing weekly summaries that do not need to be real-time
Batch API Flow
1. Create batch → 2. Poll status → 3. Download results
Step 1 — Create batch:
import anthropic
client = anthropic.Anthropic()
requests = [
{
"custom_id": f"doc-{i}",
"params": {
"model": "claude-haiku-4-5-20251001",
"max_tokens": 256,
"messages": [
{"role": "user", "content": f"Summarize: {docs[i]}"}
]
}
}
for i in range(len(docs))
]
batch = client.messages.batches.create(requests=requests)
print(f"Batch ID: {batch.id}, Status: {batch.processing_status}")Step 2 — Poll status:
import time
while True:
batch = client.messages.batches.retrieve(batch.id)
print(f"Status: {batch.processing_status} | "
f"Succeeded: {batch.request_counts.succeeded} | "
f"Errored: {batch.request_counts.errored}")
if batch.processing_status == "ended":
break
time.sleep(60) # poll every minuteStep 3 — Download results:
results = {}
for result in client.messages.batches.results(batch.id):
if result.result.type == "succeeded":
results[result.custom_id] = result.result.message.content[0].text
else:
print(f"Failed: {result.custom_id} — {result.result.error}")Trade-off Summary
| Dimension | Real-time API | Batch API |
|---|---|---|
| Latency | Milliseconds to seconds | Minutes to 24 hours |
| Cost | Full price | 50% discount |
| Max requests | Rate-limited | 10,000 per batch |
| Use case | User-facing features | Offline / async jobs |
| Retry on error | Immediate | Re-submit failed items |
5. Latency Optimization
Baseline Latency Metrics
- TTFT (Time to First Token): time from request send to first token received. This is what users perceive as “the wait.” Target: < 500ms for interactive apps.
- Total generation time: TTFT + (tokens_generated / tokens_per_second). For 500 tokens at 100 tok/s, that is 5 seconds additional.
- End-to-end latency: includes network round-trips, tool call execution, and any pre/post-processing.
Sources of Latency
| Source | Typical Contribution | Reducible? |
|---|---|---|
| API call overhead | 50–200ms | Yes — region selection |
| Prompt processing | 100–500ms | Yes — caching |
| Token generation | 1–30 seconds | Yes — model selection |
| Tool call (1 round) | +1–3s per round | Yes — parallelism |
| Network (each hop) | 10–100ms | Yes — edge / CDN |
Strategies
1. Model Selection
Haiku is 5–10x faster than Opus at token generation. For latency-critical paths (< 1s total), Haiku is the only viable choice. Use a latency budget:
Total budget: 2,000ms
- API overhead: 150ms
- Prompt cache: 100ms
- 200 output tokens @ 80 tok/s Haiku: 2,500ms ← too slow
- 200 output tokens @ 80 tok/s Haiku with max_tokens=100: 1,250ms ← ok
2. Prompt Caching
Caching the prompt prefix reduces TTFT by ~3x for cached content. A 2,000-token system prompt that takes 400ms to process takes 130ms when cached.
3. Streaming
Streaming does not reduce total time — it hides latency. The user sees the first token at TTFT, so the experience feels fast even if total generation takes 10 seconds. For interactive UIs, streaming is mandatory.
4. Parallel Tool Calls
In agentic systems, many tool calls are independent. Run them concurrently using asyncio.gather or concurrent.futures.ThreadPoolExecutor:
import asyncio
async def run_tools_parallel(tool_calls, tool_functions):
tasks = [
tool_functions[tc.name](**tc.input)
for tc in tool_calls
]
results = await asyncio.gather(*tasks)
return resultsA 3-tool sequential chain at 500ms each = 1,500ms. In parallel = ~500ms.
5. Region Selection
Anthropic has API endpoints in multiple AWS regions. Choose the region closest to your servers. US East is default; US West, EU, and AP are available. Reduces network RTT by 20–100ms.
6. Connection Reuse
Use persistent HTTP connections (the SDK handles this by default). Avoid creating a new anthropic.Anthropic() client per request — create it once at application startup.
P50 vs P99 Latency
| Metric | Definition | Why It Matters |
|---|---|---|
| P50 | Median latency (50th percentile) | What most users experience |
| P95 | 95th percentile | Captures slow tail |
| P99 | 99th percentile (worst 1%) | What your SLA should guarantee |
LLM latency distributions are heavy-tailed. P99 can be 5–10x P50 due to:
- Occasional long outputs
- Token generation variance
- Network jitter
- Cold starts in serverless
Alert on P99, not mean. A mean latency of 1.5s with a P99 of 15s means 1% of users wait 15 seconds — bad for production.
Track P99 in your metrics pipeline:
# Prometheus histogram (Python)
from prometheus_client import Histogram
llm_latency = Histogram(
"llm_request_duration_seconds",
"LLM request latency",
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)
with llm_latency.time():
response = client.messages.create(...)6. Rate Limits and Retry Strategy
Anthropic Rate Limits
Limits are enforced at three levels:
| Limit Type | Acronym | Scope |
|---|---|---|
| Requests per minute | RPM | Max concurrent/burst requests |
| Tokens per minute | TPM | Input + output tokens per 60 seconds |
| Tokens per day | TPD | Daily cap; resets at midnight UTC |
Limits vary by tier (Build → Scale → Enterprise). Check your tier limits in the Anthropic console. When you hit a limit, the API returns HTTP 429.
Response headers tell you the current state:
anthropic-ratelimit-requests-limit: 1000
anthropic-ratelimit-requests-remaining: 842
anthropic-ratelimit-requests-reset: 2024-01-01T00:01:00Z
anthropic-ratelimit-tokens-limit: 80000
anthropic-ratelimit-tokens-remaining: 12500
Exponential Backoff with Jitter
The correct algorithm for retrying a 429 (or any transient 5xx):
wait = min(cap, base * 2^attempt) + random(0, jitter)
base: initial wait (e.g., 1 second)cap: maximum wait (e.g., 60 seconds)jitter: prevents thundering herd when many clients retry simultaneously
import time
import random
import functools
import anthropic
def with_retry(max_attempts=5, base_delay=1.0, max_delay=60.0):
"""Decorator for exponential backoff with jitter."""
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_attempts):
try:
return func(*args, **kwargs)
except anthropic.RateLimitError as e:
if attempt == max_attempts - 1:
raise
delay = min(max_delay, base_delay * (2 ** attempt))
jitter = random.uniform(0, delay * 0.1)
wait = delay + jitter
print(f"Rate limited. Attempt {attempt+1}/{max_attempts}. "
f"Waiting {wait:.1f}s...")
time.sleep(wait)
except anthropic.APIStatusError as e:
if e.status_code >= 500 and attempt < max_attempts - 1:
delay = min(max_delay, base_delay * (2 ** attempt))
time.sleep(delay + random.uniform(0, 1))
else:
raise
return wrapper
return decorator
@with_retry(max_attempts=5)
def call_claude(prompt):
client = anthropic.Anthropic()
return client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=256,
messages=[{"role": "user", "content": prompt}]
)Strategies When Hitting Limits
- Queue-based rate control: Use a token bucket or sliding window queue (e.g.,
asyncio.Semaphore) to smooth request bursts.
import asyncio
semaphore = asyncio.Semaphore(10) # max 10 concurrent requests
async def throttled_call(prompt):
async with semaphore:
return await async_client.messages.create(...)-
Reduce parallelism: if
RPMis the bottleneck, reduce concurrent workers. IfTPMis the bottleneck, reduce tokens per request (trim context, lowermax_tokens). -
Stagger requests: for batch jobs, add a small delay between requests to stay under TPM limits.
-
Upgrade tier: for sustained high volume, tier upgrades give 10–100x higher limits.
-
Use the Batch API: batch requests bypass real-time rate limits. Designed for high-volume offline work.
7. Observability
Good observability is what separates a prototype from a production system. You need to know: what happened, when, how much it cost, and how fast it was — for every single LLM call.
What to Log for Every LLM Call
Request fields:
| Field | Type | Example |
|---|---|---|
request_id | string | UUID generated by your app |
timestamp | ISO8601 | 2024-01-15T10:23:45.123Z |
model | string | claude-haiku-4-5-20251001 |
temperature | float | 0.7 |
max_tokens | int | 512 |
system_prompt_hash | string | SHA256 of the system prompt |
user_id | string | For per-user cost tracking |
feature | string | chat, summarize, classify |
input_preview | string | First 200 chars of user message |
Response fields:
| Field | Type | Example |
|---|---|---|
input_tokens | int | 350 |
output_tokens | int | 128 |
cache_read_input_tokens | int | 800 |
cache_creation_input_tokens | int | 0 |
cost_usd | float | 0.000214 |
latency_ms | int | 1243 |
ttft_ms | int | 312 |
finish_reason | string | end_turn, max_tokens |
output_preview | string | First 200 chars |
Agent-specific fields:
| Field | Type | Example |
|---|---|---|
tool_calls | array | [{"name": "search", "input": {...}}] |
tool_results | array | [{"tool_use_id": "...", "content": "..."}] |
steps_taken | int | 4 |
total_agent_cost | float | Sum of cost across all agent steps |
Structured Logging Format
Use JSON logs. They are trivially ingestible by Datadog, Splunk, CloudWatch, or BigQuery.
import json
import time
import hashlib
import logging
logger = logging.getLogger("llm")
def log_llm_call(model, system, user, response, latency_ms, ttft_ms,
user_id=None, feature=None):
usage = response.usage
cost = compute_cost(usage, model)
record = {
"event": "llm_call",
"model": model,
"system_prompt_hash": hashlib.sha256(system.encode()).hexdigest()[:8],
"user_id": user_id,
"feature": feature,
"input_tokens": usage.input_tokens,
"output_tokens": usage.output_tokens,
"cache_read_input_tokens": getattr(usage, "cache_read_input_tokens", 0),
"cache_creation_input_tokens": getattr(usage, "cache_creation_input_tokens", 0),
"cost_usd": round(cost, 8),
"latency_ms": latency_ms,
"ttft_ms": ttft_ms,
"finish_reason": response.stop_reason,
}
logger.info(json.dumps(record))
return recordCost Tracking Per User / Per Feature
Aggregate from the structured logs:
-- Daily cost by user
SELECT user_id,
SUM(cost_usd) AS total_cost,
SUM(input_tokens + output_tokens) AS total_tokens,
COUNT(*) AS request_count
FROM llm_calls
WHERE DATE(timestamp) = CURRENT_DATE
GROUP BY user_id
ORDER BY total_cost DESC;
-- Cost by feature with cache efficiency
SELECT feature,
SUM(cost_usd) AS total_cost,
AVG(cache_read_input_tokens::float /
NULLIF(cache_read_input_tokens + cache_creation_input_tokens, 0)) AS cache_hit_rate
FROM llm_calls
GROUP BY feature;Latency Dashboards
Key metrics to track in Grafana / Datadog:
- P50 / P95 / P99 TTFT — broken down by model and feature
- P50 / P99 total latency — same breakdown
- Token throughput — total tokens/min processed
- Error rate — 4xx and 5xx by type
- Cache hit rate — per feature, over time
Alerting
Set up alerts for anomalies that could indicate runaway cost, service degradation, or abuse:
| Alert | Condition | Severity |
|---|---|---|
| Cost spike | Hourly cost > 2x rolling 7-day average | High |
| Error rate spike | 5xx rate > 5% over 5 minutes | Critical |
| Latency spike | P99 latency > 10s over 5 minutes | High |
| Cache cold | Cache hit rate drops below 70% | Medium |
| Token quota warning | TPD > 80% consumed before noon | Medium |
| Unusual user cost | Single user cost > $10 in one hour | High |
8. Deployment Patterns
Serverless (AWS Lambda, Google Cloud Run)
Architecture: each API request spins up an isolated function instance. The Anthropic API call happens inside the function. No persistent state.
Pros:
- Zero cost at idle (pay only per invocation)
- Auto-scales to thousands of concurrent instances automatically
- No infrastructure management
- Simple deployment (zip + deploy)
Cons:
- Cold starts: 200ms–2s to initialize a new container. Unacceptable for P99 latency on Sonnet/Opus (adds to an already-slow baseline).
- 15-minute max execution time (AWS Lambda). Long-running agent loops that take > 15 minutes will be killed.
- No persistent connections: each invocation opens a new TCP connection to the Anthropic API. Mitigate with keep-alive but cannot maintain connection-level state.
- Stateless: conversation history must be fetched from an external store (DynamoDB, Redis) on every invocation.
Best for: low-to-medium traffic (< 100 req/s), async workloads (webhooks, notifications), simple non-streaming endpoints.
Lambda configuration tips:
- Set memory to 512MB–1GB (CPU scales with memory)
- Use Lambda SnapStart (Java) or provisioned concurrency (Python) to eliminate cold starts for critical paths
- Deploy in the same AWS region as your Anthropic endpoint
Long-Running Server (ECS, Kubernetes)
Architecture: a persistent process (FastAPI, Flask) runs continuously. A load balancer distributes requests across multiple instances.
Pros:
- Persistent HTTP connections (lower latency, no cold starts)
- In-memory caching of conversation state, model clients, embeddings
- Full control over the runtime environment
- Supports long-running agent loops without timeout risk
Cons:
- Always-on cost: you pay for idle capacity
- More operational complexity (health checks, rolling deployments, autoscaling policies)
- Must handle graceful shutdown for streaming connections during deploys
Best for: high throughput (> 100 req/s), latency-sensitive applications, streaming, agentic workloads with long execution times.
Kubernetes tips:
- Use horizontal pod autoscaling on CPU or custom metrics (queue depth)
- Configure PodDisruptionBudgets for rolling updates without dropping connections
- Use
preStophooks to drain in-flight streaming requests before pod termination
Edge Deployment (Cloudflare Workers)
Architecture: JavaScript/WASM code runs at Cloudflare PoPs worldwide (300+ locations), as close to the user as possible.
Pros:
- Ultra-low network latency to users globally (< 20ms to nearest PoP)
- Scales automatically
Cons:
- Very limited runtime (128MB memory, 50ms CPU time limit for free; 30s for paid)
- Cannot run Python SDK — must use REST API directly
- No support for long-running streaming in all cases
Best for: request routing, token validation, lightweight proxy layer, edge caching of LLM responses.
Queue-Based Architecture
Architecture: incoming requests are written to a queue (AWS SQS, Google Pub/Sub, RabbitMQ). Workers poll the queue and call the LLM API.
User → API Gateway → SQS Queue → Worker Pool → Anthropic API → Result Store → Webhook/Poll
Pros:
- Handles traffic spikes without dropping requests (queue absorbs burst)
- Natural rate limiting (worker pool size = max concurrency)
- Retry logic is queue-native (dead-letter queues for failed messages)
- Decouples frontend from LLM processing
Cons:
- Inherently async — users cannot get synchronous streaming responses
- Adds latency (queue round-trip)
- More complex infrastructure
Best for: high-burst workloads, async workflows, tasks where the user does not wait synchronously (email generation, report building, batch enrichment).
Hybrid Pattern (Recommended for Most Systems)
┌─────────────────────────────────┐
User Request │ API Gateway / Load Balancer │
↓ └─────────────┬───────────────────┘
Is real-time? │
├── Yes ──────────────────────┤
│ ↓
│ ┌─────────────────────────┐
│ │ Long-running Server │
│ │ (ECS / K8s) │
│ │ Streaming + Sync │
│ └─────────────────────────┘
│
└── No ───────────────────────┤
↓
┌─────────────────────────┐
│ SQS Queue + Workers │
│ Async / Batch │
└─────────────────────────┘
9. Model Routing
Why Route
Models differ dramatically in cost, speed, and capability. A routing layer ensures you use the cheapest adequate model for each query — not your most expensive model for everything.
| Model | Cost (relative) | Speed | Capability |
|---|---|---|---|
| Haiku | 1x | ~5–10x | Simple tasks only |
| Sonnet | 12x | 2–3x | Most tasks |
| Opus | 60x | 1x | Complex reasoning |
For a system making 10,000 calls/day, routing 70% to Haiku and 30% to Sonnet reduces daily cost by ~85% compared to using Sonnet for everything.
Router Patterns
Pattern 1: Keyword/Rule-Based
Fast, zero cost, but brittle:
def route_by_rules(query: str) -> str:
query_lower = query.lower()
if any(kw in query_lower for kw in ["code", "debug", "analyze", "compare"]):
return "claude-sonnet-4-5-20251001"
if len(query.split()) < 30:
return "claude-haiku-4-5-20251001"
return "claude-sonnet-4-5-20251001"Pattern 2: Classifier Model
Use a small model (Haiku) to classify query complexity before routing:
def route_by_classifier(query: str, client) -> str:
classification = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
system="Classify the complexity of this query. Reply with only: SIMPLE or COMPLEX",
messages=[{"role": "user", "content": query}]
)
label = classification.content[0].text.strip()
if label == "COMPLEX":
return "claude-sonnet-4-5-20251001"
return "claude-haiku-4-5-20251001"The classifier call costs ~0.015+), it pays for itself 150 times over.
Pattern 3: Cost-Based Fallback
Try cheap model first. If confidence is low or output indicates failure, retry with powerful model:
def route_with_fallback(query: str, client) -> str:
# Try cheap first
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
system="If you are not confident in your answer, start your reply with: UNCERTAIN",
messages=[{"role": "user", "content": query}]
)
text = response.content[0].text
if text.startswith("UNCERTAIN"):
# Escalate to powerful model
response = client.messages.create(
model="claude-sonnet-4-5-20251001",
max_tokens=1024,
messages=[{"role": "user", "content": query}]
)
return response.content[0].textPattern 4: Word-Count Based (Simple Heuristic)
def route_by_length(query: str) -> str:
word_count = len(query.split())
if word_count < 50:
return "claude-haiku-4-5-20251001"
elif word_count < 200:
return "claude-sonnet-4-5-20251001"
else:
return "claude-sonnet-4-5-20251001" # Opus for very long, complex inputsWhen Is Routing Worth Implementing?
Routing adds complexity. Only add it when:
- Your system makes > 1,000 calls/day (below this, the savings do not justify the engineering effort)
- You have a clear bimodal distribution of query complexity
- Cost is a meaningful budget constraint
For most production systems handling real users, routing + prompt caching together deliver the largest cost reductions.
10. Interview Flashcards
Q1: How does Anthropic prompt caching work and when should you use it?
A: Prompt caching allows you to mark a prefix of your prompt with cache_control: {type: "ephemeral"}. On the first request, Anthropic stores the KV cache for that prefix. On subsequent requests with the same prefix, the model skips reprocessing and reads from cache — making those tokens ~10x cheaper and ~3x faster. Use it whenever you have a stable prefix (system prompt, tool definitions, reference documents) that is shared across many requests. It pays off when the cached prefix is > 1,024 tokens and the cache hit rate is > ~50%. A system prompt sent 10,000 times/day saves ~75% of input token cost with caching enabled.
Q2: What is the difference between input and output token cost?
A: Input tokens (everything you send: system prompt, user message, conversation history) are billed at one rate. Output tokens (everything the model generates) are billed at 3–5x that rate. This asymmetry means that controlling output length — via max_tokens, structured output formats, and explicit instructions — has a larger cost impact than trimming input. For Claude Haiku, input is ~1.25/MTok. A response that generates 500 tokens instead of 100 costs 5x more in output, even if the input is identical.
Q3: How would you implement retry with exponential backoff?
A: The algorithm is: wait = min(cap, base * 2^attempt) + jitter, where jitter is a small random value (e.g., 0–10% of wait) to prevent thundering herd. In Python, wrap the API call in a decorator that catches RateLimitError (HTTP 429) and 5xx errors, sleeps for the calculated duration, and retries up to a max attempt count. Jitter is essential in distributed systems — without it, every client retries at the same time, spiking load on the API exactly when it is already overloaded.
Q4: What is TTFT and why does it matter?
A: TTFT (Time to First Token) is the elapsed time from when a request is sent to when the first token of the response arrives. It is the user-perceived “wait time” for interactive applications. For streaming UIs, TTFT is what feels like latency — users start reading after TTFT regardless of total generation time. A system with 200ms TTFT and 8s total generation feels responsive; a system with 3s TTFT and 4s total generation feels slow. TTFT is reduced by prompt caching (avoids reprocessing the prefix), model selection (Haiku has lower TTFT than Opus), and geographic region selection (proximity to API servers).
Q5: When would you use the Batch API?
A: Use the Batch API for workloads that: (a) do not need real-time results, (b) involve large volumes of requests, and (c) are cost-sensitive. Examples: running an evaluation suite of 5,000 test cases overnight, enriching a database table with LLM-generated metadata, generating weekly summaries for all users. The Batch API offers a 50% cost discount but may take up to 24 hours. The trade-off is purely latency vs cost. Never use it for user-facing interactive features.
Q6: How do you track cost per user in a production LLM system?
A: Log structured JSON for every LLM call including user_id, feature, input_tokens, output_tokens, cache_read_input_tokens, and cache_creation_input_tokens. Compute cost in the log record using your model’s price table. Ship logs to a data warehouse (BigQuery, Redshift, Snowflake). Query daily/hourly cost aggregated by user_id and feature. Set up alerts when a user’s hourly cost exceeds a threshold (e.g., $10/hour) to detect abuse or runaway loops. This also enables per-user billing for SaaS products.
Q7: What are the trade-offs between serverless vs long-running deployment for LLM APIs?
A: Serverless (Lambda/Cloud Run): zero idle cost, auto-scales, simple ops — but cold starts add latency, 15-minute execution limit is a problem for long agent loops, and stateless design requires external state storage. Long-running servers (ECS/K8s): persistent connections, no cold starts, supports streaming and long-running agents — but always-on cost and operational complexity. For most production LLM APIs: use long-running servers for real-time interactive features (latency-critical), and serverless for async/batch workloads (cost-optimized, not latency-critical).
Q8: How do you handle rate limits in a high-throughput system?
A: Four layers of defense: (1) Exponential backoff with jitter for automatic retry on 429 responses. (2) Client-side rate limiting using a semaphore or token bucket to never exceed TPM/RPM limits — proactive, not reactive. (3) Queue-based architecture where a worker pool with controlled concurrency consumes from a queue, naturally throttling throughput. (4) Use the Batch API for offline workloads, which bypasses real-time rate limits entirely. Monitor anthropic-ratelimit-tokens-remaining headers to anticipate limits before hitting them.
Q9: What would you log for every LLM call?
A: Minimum required fields — Request: request_id, timestamp, model, user_id, feature, system_prompt_hash, temperature, max_tokens. Response: input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens, cost_usd, latency_ms, ttft_ms, finish_reason. For agents: tool_calls, tool_results, steps_taken, total_agent_cost. Log in JSON format, ship to a data warehouse for cost analysis and to a metrics pipeline (Prometheus/Datadog) for latency dashboards. Never log full user messages or outputs in systems with PII — log hashes or truncated previews instead.
Q10: How does model routing work and when is it worth implementing?
A: Model routing directs each query to the most cost-appropriate model. Simple queries go to fast/cheap models (Haiku), complex queries go to powerful/expensive models (Sonnet/Opus). Routing strategies include: rule-based (keyword matching, word count thresholds), classifier-based (a cheap Haiku call classifies complexity before routing), and fallback-based (try cheap first, escalate if confidence is low). Routing is worth implementing when: (a) volume > 1,000 calls/day, (b) query complexity has a clear bimodal distribution, and (c) cost is a meaningful constraint. The overhead of the routing logic (one cheap classifier call) is negligible compared to the savings when routing correctly avoids expensive model calls.