Module 09: Production LLM Systems

This module covers everything you need to deploy, operate, and optimize LLM-powered systems in production. It is interview-dense — every section maps directly to common system design and ML engineering questions.


Table of Contents

  1. Cost Optimization
  2. Prompt Caching (Anthropic)
  3. Streaming
  4. Batching (Anthropic Batch API)
  5. Latency Optimization
  6. Rate Limits and Retry Strategy
  7. Observability
  8. Deployment Patterns
  9. Model Routing
  10. Interview Flashcards

1. Cost Optimization

Token Cost Model

LLM APIs charge per token, and the cost is asymmetric:

Token TypeRelative CostNotes
Input tokens1xEverything you send: system, user, history
Output tokens3–5xEverything the model generates
Cache read tokens~0.1xAnthropic prompt caching (10x cheaper)
Cache write~1.25xOne-time cost to populate the cache

This asymmetry has a direct design implication: controlling output length is more impactful than trimming input. A verbose system prompt costs far less than a verbose model response.

As a concrete example using Claude Haiku pricing (approximate):

  • Input: $0.25 per million tokens
  • Output: $1.25 per million tokens
  • Cache read: $0.03 per million tokens
  • Cache write: $0.30 per million tokens

Estimating and Tracking Costs

The formula for cost per request:

cost = (input_tokens * price_per_input_MTok / 1_000_000)
     + (output_tokens * price_per_output_MTok / 1_000_000)
     + (cache_read_tokens * price_per_cache_read_MTok / 1_000_000)
     + (cache_write_tokens * price_per_cache_write_MTok / 1_000_000)

Every Anthropic API response includes a usage object:

{
  "usage": {
    "input_tokens": 350,
    "output_tokens": 120,
    "cache_read_input_tokens": 800,
    "cache_creation_input_tokens": 0
  }
}

Track cost per request by attaching user ID and feature tag at call time, then aggregating in your data warehouse:

def compute_cost(usage, model="claude-haiku-4-5"):
    PRICES = {
        "claude-haiku-4-5": {
            "input": 0.25,       # per MTok
            "output": 1.25,
            "cache_read": 0.03,
            "cache_write": 0.30,
        },
        "claude-sonnet-4-5": {
            "input": 3.00,
            "output": 15.00,
            "cache_read": 0.30,
            "cache_write": 3.75,
        },
    }
    p = PRICES[model]
    return (
        usage.input_tokens * p["input"] / 1_000_000
        + usage.output_tokens * p["output"] / 1_000_000
        + getattr(usage, "cache_read_input_tokens", 0) * p["cache_read"] / 1_000_000
        + getattr(usage, "cache_creation_input_tokens", 0) * p["cache_write"] / 1_000_000
    )

Strategies to Reduce Cost

1. Prompt Caching (biggest win)

For any system where the system prompt or a large document stays the same across many requests, prompt caching cuts input token costs by ~10x. A 1,000-token system prompt sent 10,000 times/day costs 0.30/day in cache reads (after the first write). See Section 2 for full details.

2. Model Routing

Not every query requires your most powerful model. Route cheap, simple tasks to small models:

Task TypeRecommended ModelRationale
Intent classificationHaikuDeterministic, low token count
Summarization of short textHaikuSimple extraction task
Multi-step reasoningSonnetBalance of cost and accuracy
Complex code generation / analysisSonnet / OpusHigh stakes, needs power

A routing classifier itself can be a small model call (~50 input tokens, ~5 output tokens), which costs nearly nothing but can gate expensive calls.

3. Output Length Control

Always set max_tokens to a reasonable bound. A model given no constraint may produce 2,000 tokens when 200 would suffice. Use structured output (JSON mode or tool-use schemas) to eliminate preamble (“Sure, here is the JSON you requested…”).

response = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=256,  # hard cap — saves cost and latency
    messages=[{"role": "user", "content": prompt}]
)

4. Batch API (50% discount)

For workloads that tolerate up to 24-hour latency — evaluation pipelines, document processing, nightly enrichment — use the Batch API for a 50% cost reduction. See Section 4.

5. Response Caching

For identical inputs (e.g., a FAQ bot), cache the full LLM response in Redis or a CDN. Key on a hash of (model + system_prompt + user_message). Cache TTL depends on content volatility — FAQ answers can be cached for hours, financial data for seconds.

import hashlib, json, redis
 
r = redis.Redis()
 
def cached_completion(system, user, model, client):
    key = hashlib.sha256(
        json.dumps({"model": model, "system": system, "user": user}).encode()
    ).hexdigest()
    cached = r.get(key)
    if cached:
        return json.loads(cached)
    resp = client.messages.create(
        model=model, max_tokens=512,
        system=system,
        messages=[{"role": "user", "content": user}]
    )
    r.setex(key, 3600, json.dumps(resp.content[0].text))
    return resp.content[0].text

Cost vs Quality Matrix

ModelInput Cost (MTok)Output Cost (MTok)SpeedQuality
Claude Haiku~$0.25~$1.25Very fastGood for simple
Claude Sonnet~$3.00~$15.00ModerateStrong general
Claude Opus~$15.00~$75.00SlowestBest complex

Haiku is ~60x cheaper per output token than Opus. For a high-volume system, choosing Haiku over Opus for tasks where Haiku is sufficient can cut costs by 98%.


2. Prompt Caching (Anthropic)

What It Is

Prompt caching allows you to mark a prefix of your prompt as cacheable. On the first request, Anthropic computes and stores the KV cache for that prefix. On subsequent requests that share the same prefix, the model skips recomputation and reads from cache.

Economics:

  • Cache write: ~1.25x the normal input token price (one-time cost)
  • Cache read: ~0.1x the normal input token price (~10x cheaper)
  • Cache read latency: ~3x faster than re-processing the same tokens

For a 1,000-token system prompt used in 10,000 calls:

  • Without caching: 10,000 × 1,000 tokens × 2.50
  • With caching: 1 write + 9,999 reads = 0.30 = $0.60 (76% savings)

How to Implement

Add cache_control: {"type": "ephemeral"} to the last content block of the prefix you want cached. The cache boundary is at the end of that block.

import anthropic
 
client = anthropic.Anthropic()
 
SYSTEM_PROMPT = """
You are an expert financial analyst with deep knowledge of...
[500+ tokens of stable instructions and context]
"""
 
response = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # mark this for caching
        }
    ],
    messages=[
        {"role": "user", "content": "What is the P/E ratio of AAPL?"}
    ]
)
 
usage = response.usage
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens:  {usage.cache_read_input_tokens}")
print(f"Input tokens:       {usage.input_tokens}")

On the first call: cache_creation_input_tokens > 0, cache_read_input_tokens = 0.
On subsequent calls within the cache TTL: cache_read_input_tokens > 0.

What Can Be Cached

Anything that does not vary per request:

  • System prompt (most common)
  • Tool definitions (large tool schemas can be 500+ tokens)
  • Reference documents injected at the start of a conversation
  • Few-shot examples that are the same for all users

Variable content — user messages, conversation history — goes AFTER the cached prefix. The model processes cached tokens instantly and only generates KV states for the new tokens.

Cache Lifetime

The default cache type is ephemeral, with a TTL of approximately 5 minutes. The timer resets on each cache hit. In practice:

  • Actively-used caches (many requests per minute) stay warm indefinitely
  • Idle caches (no requests for 5+ minutes) expire and require a re-write on the next request

There is no persistent cache tier currently — all caches are ephemeral.

Best Practices

  1. Put stable content first, variable content last. The cache key is the exact prefix. Any change to the prefix (even whitespace) invalidates the cache.

  2. Combine all stable context into one block. If you have a system prompt + tool definitions + a static knowledge base, put them all in the same cached block.

  3. Minimum cacheable size. Anthropic requires a minimum of 1,024 tokens for the cached block (for Sonnet and Opus; 2,048 for Haiku). Short system prompts below the threshold will not be cached — pad them with useful context.

  4. Monitor cache hit rate. Calculate: cache_hit_rate = cache_read_tokens / (cache_read_tokens + cache_creation_tokens). Target > 90% for a well-optimized system.

  5. Cache tool definitions. If your agent uses 10 tools with detailed JSON schemas, that can be 2,000+ tokens. Cache the entire tool block.

# Caching tool definitions
response = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=1024,
    tools=[
        {
            "name": "search_database",
            "description": "Search the product database...",
            "input_schema": { ... },
            "cache_control": {"type": "ephemeral"}  # cache this tool block
        }
    ],
    system="You are a helpful assistant.",
    messages=[{"role": "user", "content": user_query}]
)

Calculating Cache Hit Rate

def cache_hit_rate(usage):
    reads = getattr(usage, "cache_read_input_tokens", 0)
    writes = getattr(usage, "cache_creation_input_tokens", 0)
    total = reads + writes
    if total == 0:
        return 0.0
    return reads / total
 
# Use across a batch of responses to get aggregate rate
rates = [cache_hit_rate(r.usage) for r in responses]
print(f"Average cache hit rate: {sum(rates)/len(rates):.1%}")

3. Streaming

Why Stream

Without streaming, the user sees nothing until the model finishes generating — which can be 5–30 seconds for long responses. Streaming sends tokens to the client as they are generated, so:

  • Perceived latency drops dramatically. Users see output after 200–500ms (TTFT) even if total generation takes 10 seconds.
  • Progressive rendering is possible (markdown, code blocks appear incrementally).
  • Early cancellation: if the user’s intent changes, they can cancel mid-generation and avoid paying for unused output tokens.

How to Implement with Anthropic SDK

import anthropic
import time
 
client = anthropic.Anthropic()
 
start = time.time()
first_token_time = None
 
with client.messages.stream(
    model="claude-haiku-4-5-20251001",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain transformers in depth."}]
) as stream:
    for text in stream.text_stream:
        if first_token_time is None:
            first_token_time = time.time()
            print(f"\nTTFT: {first_token_time - start:.3f}s\n")
        print(text, end="", flush=True)
 
total = time.time() - start
print(f"\n\nTotal time: {total:.3f}s")

The .text_stream iterator yields decoded text deltas. For low-level event access:

with client.messages.stream(...) as stream:
    for event in stream:
        print(event.type, event)

Event Types

Event TypePayloadNotes
message_startmessage object with usage (partial)First event; contains model info
content_block_startcontent_block with type/indexStarts a new content block
content_block_deltadelta.text or delta.partial_jsonIncremental token(s)
content_block_stopindexBlock complete
message_deltausage.output_tokens, stop_reasonFinal usage stats
    | `message_stop`          | (empty)                                 | Stream complete                  |

Streaming Tool Calls

When a tool call is being generated, the delta type is input_json_delta with partial_json (a string fragment of the tool input JSON). You must accumulate these fragments and parse the complete JSON only after content_block_stop.

import json
 
tool_input_buffer = ""
 
with client.messages.stream(
    model="claude-haiku-4-5-20251001",
    max_tokens=1024,
    tools=[{"name": "get_weather", "description": "...", "input_schema": {...}}],
    messages=[{"role": "user", "content": "What is the weather in Paris?"}]
) as stream:
    for event in stream:
        if event.type == "content_block_delta":
            delta = event.delta
            if delta.type == "input_json_delta":
                tool_input_buffer += delta.partial_json
        elif event.type == "content_block_stop":
            if tool_input_buffer:
                tool_input = json.loads(tool_input_buffer)
                print(f"Tool called with: {tool_input}")
                tool_input_buffer = ""

SSE for Web Clients

Use Server-Sent Events to push streaming tokens to a browser. In a FastAPI app:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import anthropic
 
app = FastAPI()
client = anthropic.Anthropic()
 
@app.get("/stream")
async def stream_response(query: str):
    def generate():
        with client.messages.stream(
            model="claude-haiku-4-5-20251001",
            max_tokens=512,
            messages=[{"role": "user", "content": query}]
        ) as stream:
            for text in stream.text_stream:
                yield f"data: {text}\n\n"
        yield "data: [DONE]\n\n"
 
    return StreamingResponse(generate(), media_type="text/event-stream")

On the client:

const evtSource = new EventSource("/stream?query=hello");
evtSource.onmessage = (e) => {
  if (e.data === "[DONE]") { evtSource.close(); return; }
  document.getElementById("output").textContent += e.data;
};

When NOT to Stream

  • Batch processing jobs: streaming adds overhead; full responses are needed before the next step anyway.
  • Downstream systems that need the full response: parsers, validators, and structured output consumers should get the complete text.
  • Short responses (< 100 tokens): the streaming overhead (connection setup, event parsing) may exceed the latency benefit.
  • Serverless with cold starts: streaming requires a persistent connection that conflicts with short-lived Lambda invocations.

4. Batching (Anthropic Batch API)

What It Is

The Message Batches API allows you to submit up to 10,000 requests in a single API call. Anthropic processes them asynchronously and makes results available for download. You get a 50% discount on all token costs compared to the real-time API.

The trade-off: latency. Batches can take anywhere from a few minutes to 24 hours to complete, depending on load.

Use Cases

  • Offline evaluation runs: running eval sets of 1,000+ test cases nightly
  • Bulk document processing: summarizing, classifying, or extracting from large corpora
  • Data enrichment pipelines: adding LLM-generated metadata to a database table
  • Report generation: producing weekly summaries that do not need to be real-time

Batch API Flow

1. Create batch  →  2. Poll status  →  3. Download results

Step 1 — Create batch:

import anthropic
 
client = anthropic.Anthropic()
 
requests = [
    {
        "custom_id": f"doc-{i}",
        "params": {
            "model": "claude-haiku-4-5-20251001",
            "max_tokens": 256,
            "messages": [
                {"role": "user", "content": f"Summarize: {docs[i]}"}
            ]
        }
    }
    for i in range(len(docs))
]
 
batch = client.messages.batches.create(requests=requests)
print(f"Batch ID: {batch.id}, Status: {batch.processing_status}")

Step 2 — Poll status:

import time
 
while True:
    batch = client.messages.batches.retrieve(batch.id)
    print(f"Status: {batch.processing_status} | "
          f"Succeeded: {batch.request_counts.succeeded} | "
          f"Errored: {batch.request_counts.errored}")
    if batch.processing_status == "ended":
        break
    time.sleep(60)  # poll every minute

Step 3 — Download results:

results = {}
for result in client.messages.batches.results(batch.id):
    if result.result.type == "succeeded":
        results[result.custom_id] = result.result.message.content[0].text
    else:
        print(f"Failed: {result.custom_id}{result.result.error}")

Trade-off Summary

DimensionReal-time APIBatch API
LatencyMilliseconds to secondsMinutes to 24 hours
CostFull price50% discount
Max requestsRate-limited10,000 per batch
Use caseUser-facing featuresOffline / async jobs
Retry on errorImmediateRe-submit failed items

5. Latency Optimization

Baseline Latency Metrics

  • TTFT (Time to First Token): time from request send to first token received. This is what users perceive as “the wait.” Target: < 500ms for interactive apps.
  • Total generation time: TTFT + (tokens_generated / tokens_per_second). For 500 tokens at 100 tok/s, that is 5 seconds additional.
  • End-to-end latency: includes network round-trips, tool call execution, and any pre/post-processing.

Sources of Latency

SourceTypical ContributionReducible?
API call overhead50–200msYes — region selection
Prompt processing100–500msYes — caching
Token generation1–30 secondsYes — model selection
Tool call (1 round)+1–3s per roundYes — parallelism
Network (each hop)10–100msYes — edge / CDN

Strategies

1. Model Selection

Haiku is 5–10x faster than Opus at token generation. For latency-critical paths (< 1s total), Haiku is the only viable choice. Use a latency budget:

Total budget:  2,000ms
- API overhead:  150ms
- Prompt cache:  100ms
- 200 output tokens @ 80 tok/s Haiku: 2,500ms  ← too slow
- 200 output tokens @ 80 tok/s Haiku with max_tokens=100: 1,250ms  ← ok

2. Prompt Caching

Caching the prompt prefix reduces TTFT by ~3x for cached content. A 2,000-token system prompt that takes 400ms to process takes 130ms when cached.

3. Streaming

Streaming does not reduce total time — it hides latency. The user sees the first token at TTFT, so the experience feels fast even if total generation takes 10 seconds. For interactive UIs, streaming is mandatory.

4. Parallel Tool Calls

In agentic systems, many tool calls are independent. Run them concurrently using asyncio.gather or concurrent.futures.ThreadPoolExecutor:

import asyncio
 
async def run_tools_parallel(tool_calls, tool_functions):
    tasks = [
        tool_functions[tc.name](**tc.input)
        for tc in tool_calls
    ]
    results = await asyncio.gather(*tasks)
    return results

A 3-tool sequential chain at 500ms each = 1,500ms. In parallel = ~500ms.

5. Region Selection

Anthropic has API endpoints in multiple AWS regions. Choose the region closest to your servers. US East is default; US West, EU, and AP are available. Reduces network RTT by 20–100ms.

6. Connection Reuse

Use persistent HTTP connections (the SDK handles this by default). Avoid creating a new anthropic.Anthropic() client per request — create it once at application startup.

P50 vs P99 Latency

MetricDefinitionWhy It Matters
P50Median latency (50th percentile)What most users experience
P9595th percentileCaptures slow tail
P9999th percentile (worst 1%)What your SLA should guarantee

LLM latency distributions are heavy-tailed. P99 can be 5–10x P50 due to:

  • Occasional long outputs
  • Token generation variance
  • Network jitter
  • Cold starts in serverless

Alert on P99, not mean. A mean latency of 1.5s with a P99 of 15s means 1% of users wait 15 seconds — bad for production.

Track P99 in your metrics pipeline:

# Prometheus histogram (Python)
from prometheus_client import Histogram
 
llm_latency = Histogram(
    "llm_request_duration_seconds",
    "LLM request latency",
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)
 
with llm_latency.time():
    response = client.messages.create(...)

6. Rate Limits and Retry Strategy

Anthropic Rate Limits

Limits are enforced at three levels:

Limit TypeAcronymScope
Requests per minuteRPMMax concurrent/burst requests
Tokens per minuteTPMInput + output tokens per 60 seconds
Tokens per dayTPDDaily cap; resets at midnight UTC

Limits vary by tier (Build → Scale → Enterprise). Check your tier limits in the Anthropic console. When you hit a limit, the API returns HTTP 429.

Response headers tell you the current state:

anthropic-ratelimit-requests-limit: 1000
anthropic-ratelimit-requests-remaining: 842
anthropic-ratelimit-requests-reset: 2024-01-01T00:01:00Z
anthropic-ratelimit-tokens-limit: 80000
anthropic-ratelimit-tokens-remaining: 12500

Exponential Backoff with Jitter

The correct algorithm for retrying a 429 (or any transient 5xx):

wait = min(cap, base * 2^attempt) + random(0, jitter)
  • base: initial wait (e.g., 1 second)
  • cap: maximum wait (e.g., 60 seconds)
  • jitter: prevents thundering herd when many clients retry simultaneously
import time
import random
import functools
import anthropic
 
def with_retry(max_attempts=5, base_delay=1.0, max_delay=60.0):
    """Decorator for exponential backoff with jitter."""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_attempts):
                try:
                    return func(*args, **kwargs)
                except anthropic.RateLimitError as e:
                    if attempt == max_attempts - 1:
                        raise
                    delay = min(max_delay, base_delay * (2 ** attempt))
                    jitter = random.uniform(0, delay * 0.1)
                    wait = delay + jitter
                    print(f"Rate limited. Attempt {attempt+1}/{max_attempts}. "
                          f"Waiting {wait:.1f}s...")
                    time.sleep(wait)
                except anthropic.APIStatusError as e:
                    if e.status_code >= 500 and attempt < max_attempts - 1:
                        delay = min(max_delay, base_delay * (2 ** attempt))
                        time.sleep(delay + random.uniform(0, 1))
                    else:
                        raise
        return wrapper
    return decorator
 
@with_retry(max_attempts=5)
def call_claude(prompt):
    client = anthropic.Anthropic()
    return client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}]
    )

Strategies When Hitting Limits

  1. Queue-based rate control: Use a token bucket or sliding window queue (e.g., asyncio.Semaphore) to smooth request bursts.
import asyncio
 
semaphore = asyncio.Semaphore(10)  # max 10 concurrent requests
 
async def throttled_call(prompt):
    async with semaphore:
        return await async_client.messages.create(...)
  1. Reduce parallelism: if RPM is the bottleneck, reduce concurrent workers. If TPM is the bottleneck, reduce tokens per request (trim context, lower max_tokens).

  2. Stagger requests: for batch jobs, add a small delay between requests to stay under TPM limits.

  3. Upgrade tier: for sustained high volume, tier upgrades give 10–100x higher limits.

  4. Use the Batch API: batch requests bypass real-time rate limits. Designed for high-volume offline work.


7. Observability

Good observability is what separates a prototype from a production system. You need to know: what happened, when, how much it cost, and how fast it was — for every single LLM call.

What to Log for Every LLM Call

Request fields:

FieldTypeExample
request_idstringUUID generated by your app
timestampISO86012024-01-15T10:23:45.123Z
modelstringclaude-haiku-4-5-20251001
temperaturefloat0.7
max_tokensint512
system_prompt_hashstringSHA256 of the system prompt
user_idstringFor per-user cost tracking
featurestringchat, summarize, classify
input_previewstringFirst 200 chars of user message

Response fields:

FieldTypeExample
input_tokensint350
output_tokensint128
cache_read_input_tokensint800
cache_creation_input_tokensint0
cost_usdfloat0.000214
latency_msint1243
ttft_msint312
finish_reasonstringend_turn, max_tokens
output_previewstringFirst 200 chars

Agent-specific fields:

FieldTypeExample
tool_callsarray[{"name": "search", "input": {...}}]
tool_resultsarray[{"tool_use_id": "...", "content": "..."}]
steps_takenint4
total_agent_costfloatSum of cost across all agent steps

Structured Logging Format

Use JSON logs. They are trivially ingestible by Datadog, Splunk, CloudWatch, or BigQuery.

import json
import time
import hashlib
import logging
 
logger = logging.getLogger("llm")
 
def log_llm_call(model, system, user, response, latency_ms, ttft_ms,
                 user_id=None, feature=None):
    usage = response.usage
    cost = compute_cost(usage, model)
 
    record = {
        "event": "llm_call",
        "model": model,
        "system_prompt_hash": hashlib.sha256(system.encode()).hexdigest()[:8],
        "user_id": user_id,
        "feature": feature,
        "input_tokens": usage.input_tokens,
        "output_tokens": usage.output_tokens,
        "cache_read_input_tokens": getattr(usage, "cache_read_input_tokens", 0),
        "cache_creation_input_tokens": getattr(usage, "cache_creation_input_tokens", 0),
        "cost_usd": round(cost, 8),
        "latency_ms": latency_ms,
        "ttft_ms": ttft_ms,
        "finish_reason": response.stop_reason,
    }
    logger.info(json.dumps(record))
    return record

Cost Tracking Per User / Per Feature

Aggregate from the structured logs:

-- Daily cost by user
SELECT user_id,
       SUM(cost_usd) AS total_cost,
       SUM(input_tokens + output_tokens) AS total_tokens,
       COUNT(*) AS request_count
FROM llm_calls
WHERE DATE(timestamp) = CURRENT_DATE
GROUP BY user_id
ORDER BY total_cost DESC;
 
-- Cost by feature with cache efficiency
SELECT feature,
       SUM(cost_usd) AS total_cost,
       AVG(cache_read_input_tokens::float /
           NULLIF(cache_read_input_tokens + cache_creation_input_tokens, 0)) AS cache_hit_rate
FROM llm_calls
GROUP BY feature;

Latency Dashboards

Key metrics to track in Grafana / Datadog:

  1. P50 / P95 / P99 TTFT — broken down by model and feature
  2. P50 / P99 total latency — same breakdown
  3. Token throughput — total tokens/min processed
  4. Error rate — 4xx and 5xx by type
  5. Cache hit rate — per feature, over time

Alerting

Set up alerts for anomalies that could indicate runaway cost, service degradation, or abuse:

AlertConditionSeverity
Cost spikeHourly cost > 2x rolling 7-day averageHigh
Error rate spike5xx rate > 5% over 5 minutesCritical
Latency spikeP99 latency > 10s over 5 minutesHigh
Cache coldCache hit rate drops below 70%Medium
Token quota warningTPD > 80% consumed before noonMedium
Unusual user costSingle user cost > $10 in one hourHigh

8. Deployment Patterns

Serverless (AWS Lambda, Google Cloud Run)

Architecture: each API request spins up an isolated function instance. The Anthropic API call happens inside the function. No persistent state.

Pros:

  • Zero cost at idle (pay only per invocation)
  • Auto-scales to thousands of concurrent instances automatically
  • No infrastructure management
  • Simple deployment (zip + deploy)

Cons:

  • Cold starts: 200ms–2s to initialize a new container. Unacceptable for P99 latency on Sonnet/Opus (adds to an already-slow baseline).
  • 15-minute max execution time (AWS Lambda). Long-running agent loops that take > 15 minutes will be killed.
  • No persistent connections: each invocation opens a new TCP connection to the Anthropic API. Mitigate with keep-alive but cannot maintain connection-level state.
  • Stateless: conversation history must be fetched from an external store (DynamoDB, Redis) on every invocation.

Best for: low-to-medium traffic (< 100 req/s), async workloads (webhooks, notifications), simple non-streaming endpoints.

Lambda configuration tips:

  • Set memory to 512MB–1GB (CPU scales with memory)
  • Use Lambda SnapStart (Java) or provisioned concurrency (Python) to eliminate cold starts for critical paths
  • Deploy in the same AWS region as your Anthropic endpoint

Long-Running Server (ECS, Kubernetes)

Architecture: a persistent process (FastAPI, Flask) runs continuously. A load balancer distributes requests across multiple instances.

Pros:

  • Persistent HTTP connections (lower latency, no cold starts)
  • In-memory caching of conversation state, model clients, embeddings
  • Full control over the runtime environment
  • Supports long-running agent loops without timeout risk

Cons:

  • Always-on cost: you pay for idle capacity
  • More operational complexity (health checks, rolling deployments, autoscaling policies)
  • Must handle graceful shutdown for streaming connections during deploys

Best for: high throughput (> 100 req/s), latency-sensitive applications, streaming, agentic workloads with long execution times.

Kubernetes tips:

  • Use horizontal pod autoscaling on CPU or custom metrics (queue depth)
  • Configure PodDisruptionBudgets for rolling updates without dropping connections
  • Use preStop hooks to drain in-flight streaming requests before pod termination

Edge Deployment (Cloudflare Workers)

Architecture: JavaScript/WASM code runs at Cloudflare PoPs worldwide (300+ locations), as close to the user as possible.

Pros:

  • Ultra-low network latency to users globally (< 20ms to nearest PoP)
  • Scales automatically

Cons:

  • Very limited runtime (128MB memory, 50ms CPU time limit for free; 30s for paid)
  • Cannot run Python SDK — must use REST API directly
  • No support for long-running streaming in all cases

Best for: request routing, token validation, lightweight proxy layer, edge caching of LLM responses.

Queue-Based Architecture

Architecture: incoming requests are written to a queue (AWS SQS, Google Pub/Sub, RabbitMQ). Workers poll the queue and call the LLM API.

User → API Gateway → SQS Queue → Worker Pool → Anthropic API → Result Store → Webhook/Poll

Pros:

  • Handles traffic spikes without dropping requests (queue absorbs burst)
  • Natural rate limiting (worker pool size = max concurrency)
  • Retry logic is queue-native (dead-letter queues for failed messages)
  • Decouples frontend from LLM processing

Cons:

  • Inherently async — users cannot get synchronous streaming responses
  • Adds latency (queue round-trip)
  • More complex infrastructure

Best for: high-burst workloads, async workflows, tasks where the user does not wait synchronously (email generation, report building, batch enrichment).

                    ┌─────────────────────────────────┐
User Request        │ API Gateway / Load Balancer      │
    ↓               └─────────────┬───────────────────┘
Is real-time?                     │
    ├── Yes ──────────────────────┤
    │                             ↓
    │               ┌─────────────────────────┐
    │               │  Long-running Server     │
    │               │  (ECS / K8s)            │
    │               │  Streaming + Sync        │
    │               └─────────────────────────┘
    │
    └── No ───────────────────────┤
                                  ↓
                    ┌─────────────────────────┐
                    │  SQS Queue + Workers    │
                    │  Async / Batch          │
                    └─────────────────────────┘

9. Model Routing

Why Route

Models differ dramatically in cost, speed, and capability. A routing layer ensures you use the cheapest adequate model for each query — not your most expensive model for everything.

ModelCost (relative)SpeedCapability
Haiku1x~5–10xSimple tasks only
Sonnet12x2–3xMost tasks
Opus60x1xComplex reasoning

For a system making 10,000 calls/day, routing 70% to Haiku and 30% to Sonnet reduces daily cost by ~85% compared to using Sonnet for everything.

Router Patterns

Pattern 1: Keyword/Rule-Based

Fast, zero cost, but brittle:

def route_by_rules(query: str) -> str:
    query_lower = query.lower()
    if any(kw in query_lower for kw in ["code", "debug", "analyze", "compare"]):
        return "claude-sonnet-4-5-20251001"
    if len(query.split()) < 30:
        return "claude-haiku-4-5-20251001"
    return "claude-sonnet-4-5-20251001"

Pattern 2: Classifier Model

Use a small model (Haiku) to classify query complexity before routing:

def route_by_classifier(query: str, client) -> str:
    classification = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=10,
        system="Classify the complexity of this query. Reply with only: SIMPLE or COMPLEX",
        messages=[{"role": "user", "content": query}]
    )
    label = classification.content[0].text.strip()
    if label == "COMPLEX":
        return "claude-sonnet-4-5-20251001"
    return "claude-haiku-4-5-20251001"

The classifier call costs ~0.015+), it pays for itself 150 times over.

Pattern 3: Cost-Based Fallback

Try cheap model first. If confidence is low or output indicates failure, retry with powerful model:

def route_with_fallback(query: str, client) -> str:
    # Try cheap first
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        system="If you are not confident in your answer, start your reply with: UNCERTAIN",
        messages=[{"role": "user", "content": query}]
    )
    text = response.content[0].text
    if text.startswith("UNCERTAIN"):
        # Escalate to powerful model
        response = client.messages.create(
            model="claude-sonnet-4-5-20251001",
            max_tokens=1024,
            messages=[{"role": "user", "content": query}]
        )
    return response.content[0].text

Pattern 4: Word-Count Based (Simple Heuristic)

def route_by_length(query: str) -> str:
    word_count = len(query.split())
    if word_count < 50:
        return "claude-haiku-4-5-20251001"
    elif word_count < 200:
        return "claude-sonnet-4-5-20251001"
    else:
        return "claude-sonnet-4-5-20251001"  # Opus for very long, complex inputs

When Is Routing Worth Implementing?

Routing adds complexity. Only add it when:

  • Your system makes > 1,000 calls/day (below this, the savings do not justify the engineering effort)
  • You have a clear bimodal distribution of query complexity
  • Cost is a meaningful budget constraint

For most production systems handling real users, routing + prompt caching together deliver the largest cost reductions.


10. Interview Flashcards

Q1: How does Anthropic prompt caching work and when should you use it?

A: Prompt caching allows you to mark a prefix of your prompt with cache_control: {type: "ephemeral"}. On the first request, Anthropic stores the KV cache for that prefix. On subsequent requests with the same prefix, the model skips reprocessing and reads from cache — making those tokens ~10x cheaper and ~3x faster. Use it whenever you have a stable prefix (system prompt, tool definitions, reference documents) that is shared across many requests. It pays off when the cached prefix is > 1,024 tokens and the cache hit rate is > ~50%. A system prompt sent 10,000 times/day saves ~75% of input token cost with caching enabled.


Q2: What is the difference between input and output token cost?

A: Input tokens (everything you send: system prompt, user message, conversation history) are billed at one rate. Output tokens (everything the model generates) are billed at 3–5x that rate. This asymmetry means that controlling output length — via max_tokens, structured output formats, and explicit instructions — has a larger cost impact than trimming input. For Claude Haiku, input is ~1.25/MTok. A response that generates 500 tokens instead of 100 costs 5x more in output, even if the input is identical.


Q3: How would you implement retry with exponential backoff?

A: The algorithm is: wait = min(cap, base * 2^attempt) + jitter, where jitter is a small random value (e.g., 0–10% of wait) to prevent thundering herd. In Python, wrap the API call in a decorator that catches RateLimitError (HTTP 429) and 5xx errors, sleeps for the calculated duration, and retries up to a max attempt count. Jitter is essential in distributed systems — without it, every client retries at the same time, spiking load on the API exactly when it is already overloaded.


Q4: What is TTFT and why does it matter?

A: TTFT (Time to First Token) is the elapsed time from when a request is sent to when the first token of the response arrives. It is the user-perceived “wait time” for interactive applications. For streaming UIs, TTFT is what feels like latency — users start reading after TTFT regardless of total generation time. A system with 200ms TTFT and 8s total generation feels responsive; a system with 3s TTFT and 4s total generation feels slow. TTFT is reduced by prompt caching (avoids reprocessing the prefix), model selection (Haiku has lower TTFT than Opus), and geographic region selection (proximity to API servers).


Q5: When would you use the Batch API?

A: Use the Batch API for workloads that: (a) do not need real-time results, (b) involve large volumes of requests, and (c) are cost-sensitive. Examples: running an evaluation suite of 5,000 test cases overnight, enriching a database table with LLM-generated metadata, generating weekly summaries for all users. The Batch API offers a 50% cost discount but may take up to 24 hours. The trade-off is purely latency vs cost. Never use it for user-facing interactive features.


Q6: How do you track cost per user in a production LLM system?

A: Log structured JSON for every LLM call including user_id, feature, input_tokens, output_tokens, cache_read_input_tokens, and cache_creation_input_tokens. Compute cost in the log record using your model’s price table. Ship logs to a data warehouse (BigQuery, Redshift, Snowflake). Query daily/hourly cost aggregated by user_id and feature. Set up alerts when a user’s hourly cost exceeds a threshold (e.g., $10/hour) to detect abuse or runaway loops. This also enables per-user billing for SaaS products.


Q7: What are the trade-offs between serverless vs long-running deployment for LLM APIs?

A: Serverless (Lambda/Cloud Run): zero idle cost, auto-scales, simple ops — but cold starts add latency, 15-minute execution limit is a problem for long agent loops, and stateless design requires external state storage. Long-running servers (ECS/K8s): persistent connections, no cold starts, supports streaming and long-running agents — but always-on cost and operational complexity. For most production LLM APIs: use long-running servers for real-time interactive features (latency-critical), and serverless for async/batch workloads (cost-optimized, not latency-critical).


Q8: How do you handle rate limits in a high-throughput system?

A: Four layers of defense: (1) Exponential backoff with jitter for automatic retry on 429 responses. (2) Client-side rate limiting using a semaphore or token bucket to never exceed TPM/RPM limits — proactive, not reactive. (3) Queue-based architecture where a worker pool with controlled concurrency consumes from a queue, naturally throttling throughput. (4) Use the Batch API for offline workloads, which bypasses real-time rate limits entirely. Monitor anthropic-ratelimit-tokens-remaining headers to anticipate limits before hitting them.


Q9: What would you log for every LLM call?

A: Minimum required fields — Request: request_id, timestamp, model, user_id, feature, system_prompt_hash, temperature, max_tokens. Response: input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens, cost_usd, latency_ms, ttft_ms, finish_reason. For agents: tool_calls, tool_results, steps_taken, total_agent_cost. Log in JSON format, ship to a data warehouse for cost analysis and to a metrics pipeline (Prometheus/Datadog) for latency dashboards. Never log full user messages or outputs in systems with PII — log hashes or truncated previews instead.


Q10: How does model routing work and when is it worth implementing?

A: Model routing directs each query to the most cost-appropriate model. Simple queries go to fast/cheap models (Haiku), complex queries go to powerful/expensive models (Sonnet/Opus). Routing strategies include: rule-based (keyword matching, word count thresholds), classifier-based (a cheap Haiku call classifies complexity before routing), and fallback-based (try cheap first, escalate if confidence is low). Routing is worth implementing when: (a) volume > 1,000 calls/day, (b) query complexity has a clear bimodal distribution, and (c) cost is a meaningful constraint. The overhead of the routing logic (one cheap classifier call) is negligible compared to the savings when routing correctly avoids expensive model calls.