Chapter 1: Introduction to Building AI Applications with Foundation Models

The Rise of AI Engineering

Language Models → LLMs → Foundation Models → AI Engineering

Language models encode statistical information about language — how likely a word appears in a given context. The basic unit is a token (character, part-word, or word).

Two types of language models:

  • Masked LM (e.g., BERT): predicts missing tokens using both preceding and following context. Used for non-generative tasks (sentiment analysis, classification, debugging).
  • Autoregressive LM (e.g., GPT): predicts the next token using only preceding tokens. Dominates text generation. This book defaults to “language model” meaning autoregressive.

Self-supervision is what enabled scale:

  • Supervision requires expensive labeled data
  • Self-supervision infers labels from the input itself — a sentence like “I love street food” yields 6 training samples automatically
  • This allowed training on internet-scale text without labeling costs → LLMs

From LLMs to foundation models:

  • Language models are limited to text; humans perceive the world multimodally
  • Multimodal models (GPT-4V, Gemini, Claude 3) understand images, video, 3D, etc.
  • Foundation models = large language + large multimodal models
  • Mark a shift from task-specific models to general-purpose models
  • CLIP (OpenAI, 2021): used 400M (image, text) pairs scraped from the web — 400× larger than ImageNet, without manual labeling

From foundation models to AI engineering:

  • Three enabling factors:
    1. General-purpose capabilities — AI can now do tasks previously thought impossible
    2. Increased AI investments — Goldman Sachs estimated $200B globally by 2025
    3. Low entrance barrier — model-as-a-service APIs; anyone can build without ML expertise

Foundation Model Use Cases

CategoryConsumerEnterprise
CodingCode generation, completionCode generation, migration, docs
Image/videoProfile photos, editingAd generation, marketing
WritingEmail, essaysCopywriting, SEO, reports
EducationTutoring, quiz generationEmployee onboarding, training
Conversational botsCompanions, therapistsCustomer support, product copilots
Information aggregationTalk-to-your-docs, summarizationMarket research, meeting summaries
Data organizationImage searchDocument processing, knowledge management
Workflow automationTravel planning, form fillingData entry, lead generation

Key insights:

  • Coding is the most popular use case (AI engineers are coders)
  • Companies prefer internal-facing apps first (lower compliance risk)
  • “Journey from 0→60 is easy; 60→100 is exceedingly challenging” (UltraChat, Ding et al. 2023)
  • Eloundou et al. (2023): occupations with 100% AI exposure include mathematicians, tax preparers, financial analysts, writers, web designers

Planning AI Applications

Use Case Evaluation

Three levels of risk/motivation:

  1. Existential threat — competitors with AI can make you obsolete (document processing, creative work)
  2. Opportunity — AI boosts profits and productivity
  3. Exploratory — don’t want to be left behind

AI’s role in the product:

  • Critical vs. complementary: Face ID (critical) vs. Gmail Smart Compose (complementary). More critical → higher quality bar.
  • Reactive vs. proactive: Chatbot (reactive) vs. traffic alerts (proactive). Proactive features have higher quality bar since users didn’t ask for them.
  • Dynamic vs. static: Dynamic = continually updated per user; static = periodic updates for a group.

Human-in-the-loop (HITL): Microsoft’s Crawl-Walk-Run framework:

  1. Crawl: human involvement mandatory
  2. Walk: AI interacts with internal employees
  3. Run: increased automation with external users

AI product defensibility:

  • Low entry barrier is both a blessing and a curse
  • Three competitive advantages: technology, data, distribution
  • Data flywheel: first mover who gathers usage data can continually improve → data moat
  • Foundation model capabilities may subsume features → build on open-source models or cultivate unique data

Setting Expectations

Define a usefulness threshold before shipping:

  • Quality metrics (hallucination rate, accuracy)
  • Latency: TTFT, TPOT, total latency
  • Cost per inference request
  • Fairness, interpretability

Milestone Planning

  • “Initial success can be misleading” — easy to build a demo, hard to build a product
  • LinkedIn: 1 month to reach 80% of target experience; 4 more months to reach 95%
  • Evaluate off-the-shelf models first to understand the starting point

Maintenance

  • AI space moves fast — commit to “riding the bullet train”
  • Good changes (cheaper inference, longer context) still cause friction: prompts, data, workflows need adjustment
  • Regulatory changes can be fatal: GDPR, US export controls on compute, evolving IP law

The AI Engineering Stack

Three Layers

LayerResponsibilities
Application developmentPrompt engineering, evaluation, AI interface
Model developmentModeling & training, dataset engineering, inference optimization
InfrastructureModel serving, compute management, monitoring

Infrastructure layer grew slowest — core needs (resource management, serving, monitoring) remain the same.

AI Engineering vs. ML Engineering

Three key differences:

  1. Use existing models instead of training from scratch → focus shifts to model adaptation not model development
  2. Bigger models → more pressure on inference optimization; need GPU expertise at scale
  3. Open-ended outputs → evaluation is much harder than close-ended tasks

Two categories of model adaptation:

  • Prompt-based: no weight updates; cheaper, easier to start; may not suffice for complex tasks
  • Finetuning: updates weights; more complex, more data needed; unlocks things prompt engineering can’t

Model Development Layer Changes

CategoryTraditional MLFoundation Models
Modeling & trainingML knowledge requiredNice-to-have, not required
Dataset engineeringFeature engineering, tabular dataDeduplication, tokenization, context retrieval, quality control
Inference optimizationImportantMore important (autoregressive = sequential, adds latency)

Training terminology clarified:

  • Pre-training: training from scratch; most resource-intensive (98% of compute for InstructGPT)
  • Finetuning: continuing from pre-trained weights; less data and compute
  • Post-training: finetuning done by model developers (vs. finetuning done by application developers)

Application Development Layer Changes

CategoryTraditional MLFoundation Models
AI interfaceLess importantImportant
Prompt engineeringNot applicableImportant
EvaluationImportantMore important

AI Engineering vs. Full-Stack Engineering

  • Rising importance of AI interfaces pulls AI engineering closer to full-stack development
  • Python-centric → also JavaScript (LangChain.js, Transformers.js, Vercel AI SDK)
  • New workflow: build product first, then invest in data/models (inverted vs. traditional ML)
  • AI engineers are more product-involved than traditional ML engineers

Key Takeaways

  • Foundation models lowered the barrier to building AI apps while raising the ceiling on what’s possible
  • AI engineering is not just ML engineering with new tools — the emphasis shifts from model development to model adaptation and evaluation
  • Evaluation is now the hardest part of AI engineering (open-ended outputs, many adaptation techniques)
  • The data advantage is the most defensible moat for application-layer companies
  • Start simple; use off-the-shelf models; only invest in finetuning when prompting isn’t enough