Chapter 9: Inference Optimization
Understanding Inference
Inference = using a trained model to compute an output for a given input.
Inference server = hosts models, allocates hardware, serves requests.
Two types of inference APIs:
- Online API: optimize for latency; process requests immediately
- Batch API: optimize for cost (50% discount at Google/OpenAI); hours of turnaround; useful for synthetic data generation, periodic reporting, knowledge base updates
Streaming mode: return tokens as generated → reduces TTFT but can’t evaluate response before showing it.
Computational Bottlenecks
- Compute-bound: time-to-complete determined by computation (e.g., image generation)
- Memory bandwidth-bound (bandwidth-bound): time-to-complete determined by data transfer speed
LLM inference profile:
- Prefill (process input tokens in parallel): compute-bound
- Decode (generate one output token at a time): memory bandwidth-bound
→ Often decoupled in production onto separate machines.
Performance Metrics
Latency metrics:
- TTFT (Time to First Token): duration of prefill phase; critical for conversational apps
- TPOT (Time Per Output Token): speed of each generated token; ~120 ms (6-8 tokens/s) matches human reading speed for streaming
- Total latency = TTFT + TPOT × (number of output tokens)
- Time to Publish: first token visible to user (different from model’s first token in CoT)
Report latency in percentiles (p50, p90, p95, p99), not averages — outliers skew averages.
Throughput metrics:
- Throughput = output tokens/second (TPS) across all users
- RPS (requests/second) or RPM (requests/minute)
- Higher throughput = lower cost (throughput × cost/hour = cost per million tokens)
- Latency/throughput trade-off: batching doubles throughput but increases latency
Goodput = requests/second that satisfy the SLO (Service Level Objective). Better than pure throughput for evaluating user experience.
Utilization metrics:
- nvidia-smi GPU utilization: % of time GPU is active — misleading (can be 100% doing 1% of possible work)
- MFU (Model FLOP/s Utilization): actual throughput / theoretical max throughput; >50% is good for training
- MBU (Model Bandwidth Utilization): bandwidth used / theoretical bandwidth = (params × bytes/param × tokens/s) / theoretical bandwidth
AI Accelerators
CPU vs. GPU:
- CPU: few powerful cores (up to 64); good for sequential tasks
- GPU: thousands of smaller cores; optimized for parallel tasks like matrix multiplication (90%+ of NN FLOPs)
AI accelerator types:
- NVIDIA GPUs (dominant)
- Google TPU (Tensor Processing Units)
- Intel Habana Gaudi
- Groq LPU (Language Processing Unit) — inference-specialized
- AWS Inferentia, Apple Neural Engine — edge/inference-focused
Key specs:
- FLOP/s: operations per second per precision (H100: ~2000 TFLOP/s in BF16)
- Memory size: HBM (GPU) = 24–80 GB; CPU DRAM = 16 GB–1 TB
- Memory bandwidth: HBM = 256 GB/s–1.5 TB/s; CPU = 25–50 GB/s; on-chip SRAM = >10 TB/s
- Power (TDP): H100 ~700W; A100 ~400W; annual power consumption of H100 at peak ≈ 7,000 kWh
Memory hierarchy (from slow/large to fast/small):
CPU DRAM → GPU HBM → GPU on-chip SRAM (L1/L2 cache)
Model Optimization
Model Compression
Quantization (Chapter 7): reduce precision → reduce memory + increase throughput.
Weight-only quantization is most common; 16-bit → 8-bit → 4-bit.
Distillation (Chapter 8): train smaller model to mimic larger model behavior.
Pruning: set least-important parameters to zero → sparse model.
- Can reduce non-zero params by 90%+ with minimal accuracy loss
- Less common in practice: requires architecture understanding; hardware sparsity support varies
PyTorch optimization case study (Llama-7B):
- torch.compile → +67% throughput
- INT8 quantization → +95% more
- INT4 quantization → further +43%
- Speculative decoding → final boost
Overcoming Autoregressive Decoding Bottleneck
1 output token ≈ 100 input tokens in latency cost (Anyscale finding).
Speculative decoding:
- Draft model (fast, weak) generates K tokens
- Target model verifies all K tokens in parallel (verification is faster than generation)
- Accept longest valid prefix; target model generates one extra token
- Loop
Benefits: verification parallelizes what was sequential; decoding is bandwidth-bound so idle FLOPs can verify for free.
Chinchilla-70B + 4B draft: >50% latency reduction.
DeepMind: 1.8 ms/token (draft) vs 14.1 ms/token (target) → >2× speedup.
Inference with reference: copy tokens from input instead of generating them when output overlaps with input (RAG responses, code editing).
2× speedup for applicable use cases (Yang et al., 2023).
Parallel decoding (Medusa, Lookahead): generate multiple future tokens simultaneously using extra decoding heads; verify with tree attention; NVIDIA: up to 1.9× speedup on Llama 3.1.
Attention Mechanism Optimization
KV cache: store key-value vectors for all previous tokens to avoid recomputation.
KV cache grows linearly with sequence length; attention computation grows quadratically.
KV cache size formula: 2 × B × S × L × H × M
- B = batch size, S = sequence length, L = layers, H = model dimension, M = bytes per value
- Llama 2 13B, batch=32, seq=2048: 54 GB (larger than the model itself)
Redesigning attention (training/finetuning time):
- Local windowed attention: attend to fixed window of nearby tokens; reduces KV cache proportionally
- Multi-query attention: share KV pairs across all query heads → fewer KV pairs
- Grouped-query attention: share KV pairs within groups of query heads (generalization of multi-query)
- Cross-layer attention: share KV pairs across adjacent layers
- Character.AI: combination of these → 20× KV cache reduction
KV cache management (runtime):
- PagedAttention (vLLM): divides KV cache into non-contiguous blocks, reduces fragmentation; fastest-growing inference framework
- KV cache quantization, adaptive compression, selective KV cache
Kernels and compilers:
- FlashAttention: fuses multiple operators for attention computation → significantly faster
- Kernels are hardware-specific (FlashAttention-3 for H100)
- Kernel techniques: vectorization, parallelization, loop tiling, operator fusion
- Compilers: torch.compile, XLA, TensorRT; convert model ops to hardware-optimized kernels
Inference Service Optimization
Service-level techniques don’t change model behavior (unlike model-level).
Batching
Static batching: wait until batch is full; first request waits for last
Dynamic batching: process when batch is full OR time window expires; better latency control
Continuous (in-flight) batching: return completed responses immediately; add new requests in their place; best throughput/latency balance
LinkedIn: doubling/tripling throughput with willingness to sacrifice 20-30% TTFT/TPOT.
Decoupling Prefill and Decode
Prefill (compute-bound) and decode (bandwidth-bound) compete on same GPU → inefficient.
Solution: assign prefill instances and decode instances to different GPUs.
- “DistServe” (Zhong et al., 2024): significant improvement in processed requests with same latency
- Prefill:decode ratio depends on workload: long inputs → 2:1 to 4:1; short inputs → 1:2 to 1:1
- Communication overhead (KV transfer between instances) is not substantial with NVLink
Prompt Caching
Cache overlapping prompt segments (especially system prompts) to avoid reprocessing.
Prompt cache example: if system prompt is 1,000 tokens, 1M API calls/day → saves processing 1B tokens/day!
Anthropic prompt caching results:
| Use case | TTFT w/o caching | TTFT with caching | Cost reduction |
|---|---|---|---|
| Book chat (100K cached) | 11.5 s | 2.4 s (–79%) | –90% |
| 10K many-shot | 1.6 s | 1.1 s (–31%) | –86% |
| Multi-turn (long sys prompt) | ~10 s | ~2.5 s (–75%) | –53% |
Google: 75% discount on cached input tokens; extra storage cost.
Parallelism Strategies
Replica parallelism: create multiple identical model copies; straightforward; higher throughput at cost of more chips.
Tensor parallelism (intra-operator): split tensors across GPUs for same operation; enables large models; reduces latency but adds communication overhead.
Pipeline parallelism: split model layers across GPUs in sequence; enables very large models; adds latency per request → avoid for strict latency; common in training.
Context parallelism: split input sequence across GPUs.
Sequence parallelism: split operations across GPUs.
Most Impactful Techniques Summary
| Technique | Scope | Key benefit |
|---|---|---|
| Quantization | Model | Works across all models; easy to apply |
| Tensor parallelism | Service | Reduces latency AND enables large models |
| Replica parallelism | Service | Simple; improves throughput |
| KV cache / attention optimization | Model | Critical for transformer; long context |
| Continuous batching | Service | Best latency/throughput balance |
| Prompt caching | Service | 50–90% cost/latency reduction for repeated prefixes |
| Speculative decoding | Model | 2× speedup; no quality change |
Key Takeaways
- If using a model API, the provider handles most inference optimization; self-hosting requires implementing these yourself
- LLM inference has two phases: prefill (compute-bound) and decode (bandwidth-bound) → optimize separately
- KV cache is often larger than model weights for long contexts → a critical memory bottleneck
- Quantization is the single best bang-for-buck optimization; 16-bit → 8-bit → 4-bit
- Continuous batching + decoupled prefill/decode are the most impactful service-level changes
- Prompt caching can cut costs by 50-90% for apps with long, repeated system prompts
- Speculative decoding adds 2× speedup for many workloads with no quality change