Chapter 1: Introduction to Building AI Applications with Foundation Models

The Rise of AI Engineering

Language Models → LLMs → Foundation Models → AI Engineering

Language models encode statistical information about language — how likely a word appears in a given context. The basic unit is a token (character, part-word, or word).

Two types of language models:

Masked LM (e.g., BERT): predicts missing tokens using both preceding and following context. Used for non-generative tasks (sentiment analysis, classification, debugging).
Autoregressive LM (e.g., GPT): predicts the next token using only preceding tokens. Dominates text generation. This book defaults to “language model” meaning autoregressive.

Self-supervision is what enabled scale:

Supervision requires expensive labeled data
Self-supervision infers labels from the input itself — a sentence like “I love street food” yields 6 training samples automatically
This allowed training on internet-scale text without labeling costs → LLMs

From LLMs to foundation models:

Language models are limited to text; humans perceive the world multimodally
Multimodal models (GPT-4V, Gemini, Claude 3) understand images, video, 3D, etc.
Foundation models = large language + large multimodal models
Mark a shift from task-specific models to general-purpose models
CLIP (OpenAI, 2021): used 400M (image, text) pairs scraped from the web — 400× larger than ImageNet, without manual labeling

From foundation models to AI engineering:

Three enabling factors:
1. General-purpose capabilities — AI can now do tasks previously thought impossible
2. Increased AI investments — Goldman Sachs estimated $200B globally by 2025
3. Low entrance barrier — model-as-a-service APIs; anyone can build without ML expertise

Foundation Model Use Cases

Category	Consumer	Enterprise
Coding	Code generation, completion	Code generation, migration, docs
Image/video	Profile photos, editing	Ad generation, marketing
Writing	Email, essays	Copywriting, SEO, reports
Education	Tutoring, quiz generation	Employee onboarding, training
Conversational bots	Companions, therapists	Customer support, product copilots
Information aggregation	Talk-to-your-docs, summarization	Market research, meeting summaries
Data organization	Image search	Document processing, knowledge management
Workflow automation	Travel planning, form filling	Data entry, lead generation

Key insights:

Coding is the most popular use case (AI engineers are coders)
Companies prefer internal-facing apps first (lower compliance risk)
“Journey from 0→60 is easy; 60→100 is exceedingly challenging” (UltraChat, Ding et al. 2023)
Eloundou et al. (2023): occupations with 100% AI exposure include mathematicians, tax preparers, financial analysts, writers, web designers

Planning AI Applications

Use Case Evaluation

Three levels of risk/motivation:

Existential threat — competitors with AI can make you obsolete (document processing, creative work)
Opportunity — AI boosts profits and productivity
Exploratory — don’t want to be left behind

AI’s role in the product:

Critical vs. complementary: Face ID (critical) vs. Gmail Smart Compose (complementary). More critical → higher quality bar.
Reactive vs. proactive: Chatbot (reactive) vs. traffic alerts (proactive). Proactive features have higher quality bar since users didn’t ask for them.
Dynamic vs. static: Dynamic = continually updated per user; static = periodic updates for a group.

Human-in-the-loop (HITL): Microsoft’s Crawl-Walk-Run framework:

Crawl: human involvement mandatory
Walk: AI interacts with internal employees
Run: increased automation with external users

AI product defensibility:

Low entry barrier is both a blessing and a curse
Three competitive advantages: technology, data, distribution
Data flywheel: first mover who gathers usage data can continually improve → data moat
Foundation model capabilities may subsume features → build on open-source models or cultivate unique data

Setting Expectations

Define a usefulness threshold before shipping:

Quality metrics (hallucination rate, accuracy)
Latency: TTFT, TPOT, total latency
Cost per inference request
Fairness, interpretability

Milestone Planning

“Initial success can be misleading” — easy to build a demo, hard to build a product
LinkedIn: 1 month to reach 80% of target experience; 4 more months to reach 95%
Evaluate off-the-shelf models first to understand the starting point

Maintenance

AI space moves fast — commit to “riding the bullet train”
Good changes (cheaper inference, longer context) still cause friction: prompts, data, workflows need adjustment
Regulatory changes can be fatal: GDPR, US export controls on compute, evolving IP law

The AI Engineering Stack

Three Layers

Layer	Responsibilities
Application development	Prompt engineering, evaluation, AI interface
Model development	Modeling & training, dataset engineering, inference optimization
Infrastructure	Model serving, compute management, monitoring

Infrastructure layer grew slowest — core needs (resource management, serving, monitoring) remain the same.

AI Engineering vs. ML Engineering

Three key differences:

Use existing models instead of training from scratch → focus shifts to model adaptation not model development
Bigger models → more pressure on inference optimization; need GPU expertise at scale
Open-ended outputs → evaluation is much harder than close-ended tasks

Two categories of model adaptation:

Prompt-based: no weight updates; cheaper, easier to start; may not suffice for complex tasks
Finetuning: updates weights; more complex, more data needed; unlocks things prompt engineering can’t

Model Development Layer Changes

Category	Traditional ML	Foundation Models
Modeling & training	ML knowledge required	Nice-to-have, not required
Dataset engineering	Feature engineering, tabular data	Deduplication, tokenization, context retrieval, quality control
Inference optimization	Important	More important (autoregressive = sequential, adds latency)

Training terminology clarified:

Pre-training: training from scratch; most resource-intensive (98% of compute for InstructGPT)
Finetuning: continuing from pre-trained weights; less data and compute
Post-training: finetuning done by model developers (vs. finetuning done by application developers)

Application Development Layer Changes

Category	Traditional ML	Foundation Models
AI interface	Less important	Important
Prompt engineering	Not applicable	Important
Evaluation	Important	More important

AI Engineering vs. Full-Stack Engineering

Rising importance of AI interfaces pulls AI engineering closer to full-stack development
Python-centric → also JavaScript (LangChain.js, Transformers.js, Vercel AI SDK)
New workflow: build product first, then invest in data/models (inverted vs. traditional ML)
AI engineers are more product-involved than traditional ML engineers

Key Takeaways

Foundation models lowered the barrier to building AI apps while raising the ceiling on what’s possible
AI engineering is not just ML engineering with new tools — the emphasis shifts from model development to model adaptation and evaluation
Evaluation is now the hardest part of AI engineering (open-ended outputs, many adaptation techniques)
The data advantage is the most defensible moat for application-layer companies
Start simple; use off-the-shelf models; only invest in finetuning when prompting isn’t enough

Study Notes by Niladri & AI

Explorer

01-introduction