Chapter 9: Inference Optimization

Understanding Inference

Inference = using a trained model to compute an output for a given input.
Inference server = hosts models, allocates hardware, serves requests.

Two types of inference APIs:

  • Online API: optimize for latency; process requests immediately
  • Batch API: optimize for cost (50% discount at Google/OpenAI); hours of turnaround; useful for synthetic data generation, periodic reporting, knowledge base updates

Streaming mode: return tokens as generated → reduces TTFT but can’t evaluate response before showing it.

Computational Bottlenecks

  • Compute-bound: time-to-complete determined by computation (e.g., image generation)
  • Memory bandwidth-bound (bandwidth-bound): time-to-complete determined by data transfer speed

LLM inference profile:

  • Prefill (process input tokens in parallel): compute-bound
  • Decode (generate one output token at a time): memory bandwidth-bound

→ Often decoupled in production onto separate machines.

Performance Metrics

Latency metrics:

  • TTFT (Time to First Token): duration of prefill phase; critical for conversational apps
  • TPOT (Time Per Output Token): speed of each generated token; ~120 ms (6-8 tokens/s) matches human reading speed for streaming
  • Total latency = TTFT + TPOT × (number of output tokens)
  • Time to Publish: first token visible to user (different from model’s first token in CoT)

Report latency in percentiles (p50, p90, p95, p99), not averages — outliers skew averages.

Throughput metrics:

  • Throughput = output tokens/second (TPS) across all users
  • RPS (requests/second) or RPM (requests/minute)
  • Higher throughput = lower cost (throughput × cost/hour = cost per million tokens)
  • Latency/throughput trade-off: batching doubles throughput but increases latency

Goodput = requests/second that satisfy the SLO (Service Level Objective). Better than pure throughput for evaluating user experience.

Utilization metrics:

  • nvidia-smi GPU utilization: % of time GPU is active — misleading (can be 100% doing 1% of possible work)
  • MFU (Model FLOP/s Utilization): actual throughput / theoretical max throughput; >50% is good for training
  • MBU (Model Bandwidth Utilization): bandwidth used / theoretical bandwidth = (params × bytes/param × tokens/s) / theoretical bandwidth

AI Accelerators

CPU vs. GPU:

  • CPU: few powerful cores (up to 64); good for sequential tasks
  • GPU: thousands of smaller cores; optimized for parallel tasks like matrix multiplication (90%+ of NN FLOPs)

AI accelerator types:

  • NVIDIA GPUs (dominant)
  • Google TPU (Tensor Processing Units)
  • Intel Habana Gaudi
  • Groq LPU (Language Processing Unit) — inference-specialized
  • AWS Inferentia, Apple Neural Engine — edge/inference-focused

Key specs:

  • FLOP/s: operations per second per precision (H100: ~2000 TFLOP/s in BF16)
  • Memory size: HBM (GPU) = 24–80 GB; CPU DRAM = 16 GB–1 TB
  • Memory bandwidth: HBM = 256 GB/s–1.5 TB/s; CPU = 25–50 GB/s; on-chip SRAM = >10 TB/s
  • Power (TDP): H100 ~700W; A100 ~400W; annual power consumption of H100 at peak ≈ 7,000 kWh

Memory hierarchy (from slow/large to fast/small):
CPU DRAM → GPU HBM → GPU on-chip SRAM (L1/L2 cache)


Model Optimization

Model Compression

Quantization (Chapter 7): reduce precision → reduce memory + increase throughput.
Weight-only quantization is most common; 16-bit → 8-bit → 4-bit.

Distillation (Chapter 8): train smaller model to mimic larger model behavior.

Pruning: set least-important parameters to zero → sparse model.

  • Can reduce non-zero params by 90%+ with minimal accuracy loss
  • Less common in practice: requires architecture understanding; hardware sparsity support varies

PyTorch optimization case study (Llama-7B):

  1. torch.compile → +67% throughput
  2. INT8 quantization → +95% more
  3. INT4 quantization → further +43%
  4. Speculative decoding → final boost

Overcoming Autoregressive Decoding Bottleneck

1 output token ≈ 100 input tokens in latency cost (Anyscale finding).

Speculative decoding:

  1. Draft model (fast, weak) generates K tokens
  2. Target model verifies all K tokens in parallel (verification is faster than generation)
  3. Accept longest valid prefix; target model generates one extra token
  4. Loop

Benefits: verification parallelizes what was sequential; decoding is bandwidth-bound so idle FLOPs can verify for free.
Chinchilla-70B + 4B draft: >50% latency reduction.
DeepMind: 1.8 ms/token (draft) vs 14.1 ms/token (target) → >2× speedup.

Inference with reference: copy tokens from input instead of generating them when output overlaps with input (RAG responses, code editing).
2× speedup for applicable use cases (Yang et al., 2023).

Parallel decoding (Medusa, Lookahead): generate multiple future tokens simultaneously using extra decoding heads; verify with tree attention; NVIDIA: up to 1.9× speedup on Llama 3.1.

Attention Mechanism Optimization

KV cache: store key-value vectors for all previous tokens to avoid recomputation.
KV cache grows linearly with sequence length; attention computation grows quadratically.

KV cache size formula: 2 × B × S × L × H × M

  • B = batch size, S = sequence length, L = layers, H = model dimension, M = bytes per value
  • Llama 2 13B, batch=32, seq=2048: 54 GB (larger than the model itself)

Redesigning attention (training/finetuning time):

  • Local windowed attention: attend to fixed window of nearby tokens; reduces KV cache proportionally
  • Multi-query attention: share KV pairs across all query heads → fewer KV pairs
  • Grouped-query attention: share KV pairs within groups of query heads (generalization of multi-query)
  • Cross-layer attention: share KV pairs across adjacent layers
  • Character.AI: combination of these → 20× KV cache reduction

KV cache management (runtime):

  • PagedAttention (vLLM): divides KV cache into non-contiguous blocks, reduces fragmentation; fastest-growing inference framework
  • KV cache quantization, adaptive compression, selective KV cache

Kernels and compilers:

  • FlashAttention: fuses multiple operators for attention computation → significantly faster
  • Kernels are hardware-specific (FlashAttention-3 for H100)
  • Kernel techniques: vectorization, parallelization, loop tiling, operator fusion
  • Compilers: torch.compile, XLA, TensorRT; convert model ops to hardware-optimized kernels

Inference Service Optimization

Service-level techniques don’t change model behavior (unlike model-level).

Batching

Static batching: wait until batch is full; first request waits for last
Dynamic batching: process when batch is full OR time window expires; better latency control
Continuous (in-flight) batching: return completed responses immediately; add new requests in their place; best throughput/latency balance

LinkedIn: doubling/tripling throughput with willingness to sacrifice 20-30% TTFT/TPOT.

Decoupling Prefill and Decode

Prefill (compute-bound) and decode (bandwidth-bound) compete on same GPU → inefficient.

Solution: assign prefill instances and decode instances to different GPUs.

  • “DistServe” (Zhong et al., 2024): significant improvement in processed requests with same latency
  • Prefill:decode ratio depends on workload: long inputs → 2:1 to 4:1; short inputs → 1:2 to 1:1
  • Communication overhead (KV transfer between instances) is not substantial with NVLink

Prompt Caching

Cache overlapping prompt segments (especially system prompts) to avoid reprocessing.

Prompt cache example: if system prompt is 1,000 tokens, 1M API calls/day → saves processing 1B tokens/day!

Anthropic prompt caching results:

Use caseTTFT w/o cachingTTFT with cachingCost reduction
Book chat (100K cached)11.5 s2.4 s (–79%)–90%
10K many-shot1.6 s1.1 s (–31%)–86%
Multi-turn (long sys prompt)~10 s~2.5 s (–75%)–53%

Google: 75% discount on cached input tokens; extra storage cost.

Parallelism Strategies

Replica parallelism: create multiple identical model copies; straightforward; higher throughput at cost of more chips.

Tensor parallelism (intra-operator): split tensors across GPUs for same operation; enables large models; reduces latency but adds communication overhead.

Pipeline parallelism: split model layers across GPUs in sequence; enables very large models; adds latency per request → avoid for strict latency; common in training.

Context parallelism: split input sequence across GPUs.

Sequence parallelism: split operations across GPUs.


Most Impactful Techniques Summary

TechniqueScopeKey benefit
QuantizationModelWorks across all models; easy to apply
Tensor parallelismServiceReduces latency AND enables large models
Replica parallelismServiceSimple; improves throughput
KV cache / attention optimizationModelCritical for transformer; long context
Continuous batchingServiceBest latency/throughput balance
Prompt cachingService50–90% cost/latency reduction for repeated prefixes
Speculative decodingModel2× speedup; no quality change

Key Takeaways

  • If using a model API, the provider handles most inference optimization; self-hosting requires implementing these yourself
  • LLM inference has two phases: prefill (compute-bound) and decode (bandwidth-bound) → optimize separately
  • KV cache is often larger than model weights for long contexts → a critical memory bottleneck
  • Quantization is the single best bang-for-buck optimization; 16-bit → 8-bit → 4-bit
  • Continuous batching + decoupled prefill/decode are the most impactful service-level changes
  • Prompt caching can cut costs by 50-90% for apps with long, repeated system prompts
  • Speculative decoding adds 2× speedup for many workloads with no quality change