Chapter 9: Inference Optimization

Understanding Inference

Inference = using a trained model to compute an output for a given input.
Inference server = hosts models, allocates hardware, serves requests.

Two types of inference APIs:

Online API: optimize for latency; process requests immediately
Batch API: optimize for cost (50% discount at Google/OpenAI); hours of turnaround; useful for synthetic data generation, periodic reporting, knowledge base updates

Streaming mode: return tokens as generated → reduces TTFT but can’t evaluate response before showing it.

Computational Bottlenecks

Compute-bound: time-to-complete determined by computation (e.g., image generation)
Memory bandwidth-bound (bandwidth-bound): time-to-complete determined by data transfer speed

LLM inference profile:

Prefill (process input tokens in parallel): compute-bound
Decode (generate one output token at a time): memory bandwidth-bound

→ Often decoupled in production onto separate machines.

Performance Metrics

Latency metrics:

TTFT (Time to First Token): duration of prefill phase; critical for conversational apps
TPOT (Time Per Output Token): speed of each generated token; ~120 ms (6-8 tokens/s) matches human reading speed for streaming
Total latency = TTFT + TPOT × (number of output tokens)
Time to Publish: first token visible to user (different from model’s first token in CoT)

Report latency in percentiles (p50, p90, p95, p99), not averages — outliers skew averages.

Throughput metrics:

Throughput = output tokens/second (TPS) across all users
RPS (requests/second) or RPM (requests/minute)
Higher throughput = lower cost (throughput × cost/hour = cost per million tokens)
Latency/throughput trade-off: batching doubles throughput but increases latency

Goodput = requests/second that satisfy the SLO (Service Level Objective). Better than pure throughput for evaluating user experience.

Utilization metrics:

nvidia-smi GPU utilization: % of time GPU is active — misleading (can be 100% doing 1% of possible work)
MFU (Model FLOP/s Utilization): actual throughput / theoretical max throughput; >50% is good for training
MBU (Model Bandwidth Utilization): bandwidth used / theoretical bandwidth = (params × bytes/param × tokens/s) / theoretical bandwidth

AI Accelerators

CPU vs. GPU:

CPU: few powerful cores (up to 64); good for sequential tasks
GPU: thousands of smaller cores; optimized for parallel tasks like matrix multiplication (90%+ of NN FLOPs)

AI accelerator types:

NVIDIA GPUs (dominant)
Google TPU (Tensor Processing Units)
Intel Habana Gaudi
Groq LPU (Language Processing Unit) — inference-specialized
AWS Inferentia, Apple Neural Engine — edge/inference-focused

Key specs:

FLOP/s: operations per second per precision (H100: ~2000 TFLOP/s in BF16)
Memory size: HBM (GPU) = 24–80 GB; CPU DRAM = 16 GB–1 TB
Memory bandwidth: HBM = 256 GB/s–1.5 TB/s; CPU = 25–50 GB/s; on-chip SRAM = >10 TB/s
Power (TDP): H100 ~700W; A100 ~400W; annual power consumption of H100 at peak ≈ 7,000 kWh

Memory hierarchy (from slow/large to fast/small):
CPU DRAM → GPU HBM → GPU on-chip SRAM (L1/L2 cache)

Model Optimization

Model Compression

Quantization (Chapter 7): reduce precision → reduce memory + increase throughput.
Weight-only quantization is most common; 16-bit → 8-bit → 4-bit.

Distillation (Chapter 8): train smaller model to mimic larger model behavior.

Pruning: set least-important parameters to zero → sparse model.

Can reduce non-zero params by 90%+ with minimal accuracy loss
Less common in practice: requires architecture understanding; hardware sparsity support varies

PyTorch optimization case study (Llama-7B):

torch.compile → +67% throughput
INT8 quantization → +95% more
INT4 quantization → further +43%
Speculative decoding → final boost

Overcoming Autoregressive Decoding Bottleneck

1 output token ≈ 100 input tokens in latency cost (Anyscale finding).

Speculative decoding:

Draft model (fast, weak) generates K tokens
Target model verifies all K tokens in parallel (verification is faster than generation)
Accept longest valid prefix; target model generates one extra token
Loop

Benefits: verification parallelizes what was sequential; decoding is bandwidth-bound so idle FLOPs can verify for free.
Chinchilla-70B + 4B draft: >50% latency reduction.
DeepMind: 1.8 ms/token (draft) vs 14.1 ms/token (target) → >2× speedup.

Inference with reference: copy tokens from input instead of generating them when output overlaps with input (RAG responses, code editing).
2× speedup for applicable use cases (Yang et al., 2023).

Parallel decoding (Medusa, Lookahead): generate multiple future tokens simultaneously using extra decoding heads; verify with tree attention; NVIDIA: up to 1.9× speedup on Llama 3.1.

Attention Mechanism Optimization

KV cache: store key-value vectors for all previous tokens to avoid recomputation.
KV cache grows linearly with sequence length; attention computation grows quadratically.

KV cache size formula: 2 × B × S × L × H × M

B = batch size, S = sequence length, L = layers, H = model dimension, M = bytes per value
Llama 2 13B, batch=32, seq=2048: 54 GB (larger than the model itself)

Redesigning attention (training/finetuning time):

Local windowed attention: attend to fixed window of nearby tokens; reduces KV cache proportionally
Multi-query attention: share KV pairs across all query heads → fewer KV pairs
Grouped-query attention: share KV pairs within groups of query heads (generalization of multi-query)
Cross-layer attention: share KV pairs across adjacent layers
Character.AI: combination of these → 20× KV cache reduction

KV cache management (runtime):

PagedAttention (vLLM): divides KV cache into non-contiguous blocks, reduces fragmentation; fastest-growing inference framework
KV cache quantization, adaptive compression, selective KV cache

Kernels and compilers:

FlashAttention: fuses multiple operators for attention computation → significantly faster
Kernels are hardware-specific (FlashAttention-3 for H100)
Kernel techniques: vectorization, parallelization, loop tiling, operator fusion
Compilers: torch.compile, XLA, TensorRT; convert model ops to hardware-optimized kernels

Inference Service Optimization

Service-level techniques don’t change model behavior (unlike model-level).

Batching

Static batching: wait until batch is full; first request waits for last
Dynamic batching: process when batch is full OR time window expires; better latency control
Continuous (in-flight) batching: return completed responses immediately; add new requests in their place; best throughput/latency balance

LinkedIn: doubling/tripling throughput with willingness to sacrifice 20-30% TTFT/TPOT.

Decoupling Prefill and Decode

Prefill (compute-bound) and decode (bandwidth-bound) compete on same GPU → inefficient.

Solution: assign prefill instances and decode instances to different GPUs.

“DistServe” (Zhong et al., 2024): significant improvement in processed requests with same latency
Prefill:decode ratio depends on workload: long inputs → 2:1 to 4:1; short inputs → 1:2 to 1:1
Communication overhead (KV transfer between instances) is not substantial with NVLink

Prompt Caching

Cache overlapping prompt segments (especially system prompts) to avoid reprocessing.

Prompt cache example: if system prompt is 1,000 tokens, 1M API calls/day → saves processing 1B tokens/day!

Anthropic prompt caching results:

Use case	TTFT w/o caching	TTFT with caching	Cost reduction
Book chat (100K cached)	11.5 s	2.4 s (–79%)	–90%
10K many-shot	1.6 s	1.1 s (–31%)	–86%
Multi-turn (long sys prompt)	~10 s	~2.5 s (–75%)	–53%

Google: 75% discount on cached input tokens; extra storage cost.

Parallelism Strategies

Replica parallelism: create multiple identical model copies; straightforward; higher throughput at cost of more chips.

Tensor parallelism (intra-operator): split tensors across GPUs for same operation; enables large models; reduces latency but adds communication overhead.

Pipeline parallelism: split model layers across GPUs in sequence; enables very large models; adds latency per request → avoid for strict latency; common in training.

Context parallelism: split input sequence across GPUs.

Sequence parallelism: split operations across GPUs.

Most Impactful Techniques Summary

Technique	Scope	Key benefit
Quantization	Model	Works across all models; easy to apply
Tensor parallelism	Service	Reduces latency AND enables large models
Replica parallelism	Service	Simple; improves throughput
KV cache / attention optimization	Model	Critical for transformer; long context
Continuous batching	Service	Best latency/throughput balance
Prompt caching	Service	50–90% cost/latency reduction for repeated prefixes
Speculative decoding	Model	2× speedup; no quality change

Key Takeaways

If using a model API, the provider handles most inference optimization; self-hosting requires implementing these yourself
LLM inference has two phases: prefill (compute-bound) and decode (bandwidth-bound) → optimize separately
KV cache is often larger than model weights for long contexts → a critical memory bottleneck
Quantization is the single best bang-for-buck optimization; 16-bit → 8-bit → 4-bit
Continuous batching + decoupled prefill/decode are the most impactful service-level changes
Prompt caching can cut costs by 50-90% for apps with long, repeated system prompts
Speculative decoding adds 2× speedup for many workloads with no quality change

Study Notes by Niladri & AI

Explorer

09-inference-optimization