Module 09 References

Authoritative sources for the concepts covered in this module.
All links are official Anthropic documentation unless otherwise noted.

Anthropic Documentation

Prompt Caching

https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

Complete guide to prompt caching: how to mark content blocks with cache_control,
minimum token requirements per model tier, cache TTL behaviour, supported content
types (system prompt, tools, user messages), and worked examples. Read this before
optimizing any high-volume system.

Key sections:

How prompt caching works (KV cache reuse)
Cache limitations and supported models
cache_control: {type: "ephemeral"} syntax
Tracking cache_creation_input_tokens and cache_read_input_tokens in usage

Batch Processing (Batch API)

https://docs.anthropic.com/en/docs/build-with-claude/batch-processing

Full reference for the Message Batches API: request format, status polling,
result download, error handling, and rate limit interactions. Includes pricing
details confirming the 50% discount and the 24-hour processing window.

Key sections:

Creating a batch with client.messages.batches.create()
Polling processing_status and request_counts
Downloading results with client.messages.batches.results()
Handling individual request errors within a batch

Rate Limits

https://docs.anthropic.com/en/api/rate-limits

Lists current rate limits by tier (Build, Scale, Enterprise) for RPM, TPM, and
TPD across all models. Explains how limits are enforced, what HTTP headers expose
current limit state, and what happens when limits are exceeded (HTTP 429).

Key sections:

Per-model rate limits table
Response header format for remaining quota
Error codes and retry guidance
Requesting tier upgrades

Model Overview (Cost and Speed Comparison)

https://docs.anthropic.com/en/docs/about-claude/models

The canonical reference for all available Claude models. Lists model IDs, context
window sizes, and relative speed/cost positioning. Use this to verify exact model
IDs for your API calls — model IDs change with new releases.

Key sections:

Latest model ID for each tier (Haiku, Sonnet, Opus)
Context window sizes
Model capabilities table

Streaming

https://docs.anthropic.com/en/docs/build-with-claude/streaming

Guide to the streaming API: SSE event format, all event types, how to handle
streaming tool calls, error handling mid-stream, and platform-specific
implementation notes (including how to use stream=True with the Python SDK).

Key sections:

Event type reference (message_start, content_block_delta, message_stop)
Streaming tool use (input_json_delta events)
Handling errors in streaming responses
Server-Sent Events format reference

Pricing

https://www.anthropic.com/pricing

Current token prices for all models. Always verify against this page — prices
change with new model releases and tier adjustments. This is the source of truth
for any cost calculation.

Python SDK

https://github.com/anthropic-sdk/anthropic-python

Source code and API reference for the official Python SDK. Useful for:

Understanding usage object field names
Streaming API interface (messages.stream() vs messages.create(stream=True))
Error types (RateLimitError, APIStatusError, AuthenticationError)
Async client usage (AsyncAnthropic)

Additional Reading

Exponential Backoff and Jitter (AWS Architecture Blog)

https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

The definitive explanation of why jitter is essential in retry logic. Includes
simulation results showing how different jitter strategies affect system recovery
under load. Referenced by most distributed systems textbooks.

Serverless vs Container Latency Trade-offs

https://cloud.google.com/run/docs/about-instance-autoscaling

Google Cloud Run documentation on cold start behaviour, minimum instances, and
concurrency settings. The concepts apply directly to AWS Lambda. Useful background
for the deployment pattern decisions in Section 8 of the README.

Production ML Systems — Google SRE Workbook (Chapter 27)

https://sre.google/workbook/ml-reliability/

Broader perspective on reliability engineering for ML systems: SLOs, error budgets,
canary deployments, and rollback strategies. The LLM-specific patterns in this
module map well to the general ML system patterns described here.

Quick Reference Card

Topic	Key number / rule
Output vs input cost	Output tokens are 3–5x more expensive than input
Cache savings	Cache reads are ~10x cheaper than normal input
Cache min size	2,048 tokens (Haiku), 1,024 tokens (Sonnet/Opus)
Cache TTL	~5 minutes; resets on every cache hit
Batch API discount	50% off all token costs; up to 24h processing time
Haiku vs Opus speed	Haiku is ~5–10x faster token generation
Haiku vs Opus cost	Haiku is ~60x cheaper per output token
Backoff formula	`wait = min(cap, base * 2^attempt) + jitter`
TTFT target	< 500ms for interactive UIs
P99 matters	LLM latency is heavy-tailed; P99 can be 5–10x P50
What to log	model, tokens, cost, latency, TTFT, finish_reason, user_id

Study Notes by Niladri & AI

Explorer

references

Module 09 References

Anthropic Documentation

Prompt Caching

Batch Processing (Batch API)

Rate Limits

Model Overview (Cost and Speed Comparison)

Streaming

Pricing

Python SDK

Additional Reading

Exponential Backoff and Jitter (AWS Architecture Blog)

Serverless vs Container Latency Trade-offs

Production ML Systems — Google SRE Workbook (Chapter 27)

Quick Reference Card

Graph View

Table of Contents