AI Engineering: Building Applications with Foundation Models

Author: Chip Huyen | Publisher: O’Reilly, 2025 | Pages: 535

One-Sentence Summary

A comprehensive, framework-focused guide to building production AI applications on top of foundation models, covering the entire stack from model selection through evaluation, prompt engineering, RAG, finetuning, inference optimization, and architecture.


Core Thesis

Foundation models (LLMs + multimodal models) have created a new engineering discipline — AI engineering — that differs from traditional ML engineering in three key ways:

  1. Adaptation over development: use existing models instead of training from scratch
  2. Open-ended outputs: harder to evaluate than close-ended ML tasks
  3. Scale demands: bigger models require more GPU expertise and inference optimization

The progression for adapting models: Prompting → RAG → Finetuning (each step increases investment and potential gain).


Chapter Summaries

Ch 1: Introduction

AI engineering emerged from three factors: general-purpose AI capabilities, increased investment, and low entry barrier via model APIs. The AI stack has three layers: application development (prompt engineering, evaluation, AI interface), model development (modeling, dataset engineering, inference optimization), and infrastructure. AI engineering sits closer to product development than traditional ML engineering.

Ch 2: Foundation Models

Three factors differentiate models: training data, architecture/size, and post-training. Self-supervision scales training by generating labels from text itself. The transformer dominates via attention (prefill=parallel/compute-bound; decode=sequential/bandwidth-bound). Chinchilla law: 20 training tokens per parameter for compute-optimal training. Post-training (SFT + RLHF/DPO) costs only ~2% of compute but unlocks the model for users. Sampling (temperature, top-k, top-p) is underrated; test time compute can substitute for 30× more parameters.

Ch 3: Evaluation Methodology

Foundation models are hard to evaluate: intelligent outputs require intelligent evaluation; open-ended outputs have no ground truth. Language metrics (perplexity, cross entropy) are cheap proxies. Three evaluation approaches: (1) exact (functional correctness, lexical/semantic similarity), (2) AI as a judge (flexible, subjective, prone to biases), (3) comparative evaluation (Bradley-Terry algorithm; never saturates).

Ch 4: Evaluate AI Systems

Evaluation-driven development: define criteria before building. Four criteria buckets: domain capability, generation capability (factual consistency, safety), instruction-following, cost/latency. Open source vs. API: 7-axis decision (data privacy, lineage, performance, functionality, cost, control, edge deployment). Public benchmarks are contaminated and insufficient; build private evaluation pipelines. Sample size rule: 3× smaller detectable difference → 10× more samples.

Ch 5: Prompt Engineering

Prompting is the cheapest adaptation technique. Best practices: clear instructions with personas + examples + output format, sufficient context, task decomposition, CoT. Version and organize prompts separately from code. Defensive engineering: direct hacking, automated attacks (PAIR), and indirect prompt injection (via RAG/email) all require model-level + prompt-level + system-level defenses.

Ch 6: RAG and Agents

RAG = retrieve relevant context per query before generation; remains valuable even with long contexts. Retrieval: term-based (BM25) is strong and fast; embedding-based (semantic) is better with finetuning but costs more; hybrid combines both. Agents = model + tool inventory + planning. Planning involves: plan generation → validation → execution → reflection. ReAct and Reflexion are key agent frameworks. Memory hierarchy: internal (weights), short-term (context), long-term (external storage).

Ch 7: Finetuning

Finetuning is for form (behavior, style, format); RAG is for facts. Start with prompting → RAG → finetuning. Memory bottleneck: 7B model full finetuning with Adam = 56 GB (beyond most consumer GPUs). PEFT (especially LoRA) reduces trainable parameters by orders of magnitude with minimal quality loss. LoRA works because pre-training compresses intrinsic dimensions. QLoRA enables 65B model finetuning on single 48 GB GPU. Model merging (linear combination, SLERP, task vectors, frankenmerging) enables multi-task models, on-device deployment, and federated learning.

Ch 8: Dataset Engineering

Three golden goals: quantity, quality, diversity. Quality > quantity: 10K careful examples > hundreds of thousands of noisy ones. Llama 3: performance gains from data improvements, not architecture. Synthetic data solves quantity/coverage problems but has limitations: superficial imitation, model collapse (recursive AI training), and obscured data lineage. Model distillation trains smaller student from larger teacher. Manual inspection is irreplaceable.

Ch 9: Inference Optimization

LLM inference: prefill (compute-bound) + decode (bandwidth-bound) → often decoupled in production. Key metrics: TTFT (prefill), TPOT (decode), goodput (requests/s meeting SLO), MFU/MBU (hardware efficiency). Most impactful techniques: quantization (weight-only; 16→8→4 bit), KV cache + PagedAttention, continuous batching, speculative decoding (2× speedup, no quality change), prompt caching (50–90% cost/latency reduction), tensor parallelism.

Ch 10: Architecture and User Feedback

Incremental architecture: query→model (baseline) → context (RAG/tools) → guardrails → router+gateway → caching → agent patterns. Each step adds capability and failure modes. Observability requires instrumenting for inferring internal state from external outputs. User feedback = proprietary data = competitive advantage. Conversational feedback: natural language corrections + edits are rich preference signals. Feedback biases (leniency, position, verbosity) and degenerate feedback loops are real risks.


Key Mental Models

The Three Adaptation Techniques:

TechniqueModifies Weights?EffortWhen to Use
PromptingNoLowAlways first
RAGNoMediumInformation failures
FinetuningYesHighBehavior failures

The Evaluation Pyramid:

  1. Exact (functional correctness, similarity) — deterministic, limited coverage
  2. AI as a judge — flexible, subjective, biased
  3. Human evaluation — gold standard, slow, expensive
    Use all three; no single method suffices.

The Memory Hierarchy for Agents:

  • Internal knowledge (weights) → for things needed in every task
  • Short-term memory (context) → for current session context
  • Long-term memory (external RAG) → for persistent personalization

Inference Optimization Priority:

  1. Quantization — universal, easy, massive impact
  2. Prompt caching — critical for repeated system prompts
  3. KV cache management — critical for long contexts
  4. Continuous batching — best latency/throughput balance
  5. Speculative decoding — 2× speedup, no quality change

Memorable Quotes & Stats

  • “The journey from 0 to 60 is easy; progressing from 60 to 100 becomes exceedingly challenging.” — UltraChat
  • “Finetuning is for form; RAG is for facts.”
  • “A benchmark stops being useful as soon as it becomes public.” (data contamination)
  • “Manual inspection of data has probably the highest value-to-prestige ratio of any activity in ML.” — Greg Brockman
  • GPT-4 + 13-tool agent outperforms GPT-4 alone by +11% on ScienceQA, +17% on TabMWP
  • 100M model + verifier ≈ 3B model without verifier
  • Anthropic prompt caching: 100K-token cached prompt → 79% latency reduction, 90% cost reduction
  • LinkedIn: 1 month to reach 80% of target; 4 more months to reach 95%
  • BloombergGPT ($2.6M, 50B params) outperformed by GPT-4 zero-shot in the same month of release

Book’s Framework: When to Apply Each Technique

Start: Define evaluation criteria + build evaluation pipeline
↓
Step 1: Prompting with best practices (zero-shot → few-shot)
↓ if information failures
Step 2: RAG (term-based first, then embedding-based)
↓ if behavior failures  
Step 3: Finetuning (LoRA first, then full finetuning)
↓ for all
Step 4: Inference optimization (quantization + caching + batching)
↓ when in production
Step 5: Architecture (context → guardrails → router/gateway → cache → agents)
↓ continuous
Step 6: User feedback loop (collect → analyze → improve)