Chapter 1: Introduction to Building AI Applications with Foundation Models
The Rise of AI Engineering
Language Models → LLMs → Foundation Models → AI Engineering
Language models encode statistical information about language — how likely a word appears in a given context. The basic unit is a token (character, part-word, or word).
Two types of language models:
- Masked LM (e.g., BERT): predicts missing tokens using both preceding and following context. Used for non-generative tasks (sentiment analysis, classification, debugging).
- Autoregressive LM (e.g., GPT): predicts the next token using only preceding tokens. Dominates text generation. This book defaults to “language model” meaning autoregressive.
Self-supervision is what enabled scale:
- Supervision requires expensive labeled data
- Self-supervision infers labels from the input itself — a sentence like “I love street food” yields 6 training samples automatically
- This allowed training on internet-scale text without labeling costs → LLMs
From LLMs to foundation models:
- Language models are limited to text; humans perceive the world multimodally
- Multimodal models (GPT-4V, Gemini, Claude 3) understand images, video, 3D, etc.
- Foundation models = large language + large multimodal models
- Mark a shift from task-specific models to general-purpose models
- CLIP (OpenAI, 2021): used 400M (image, text) pairs scraped from the web — 400× larger than ImageNet, without manual labeling
From foundation models to AI engineering:
- Three enabling factors:
- General-purpose capabilities — AI can now do tasks previously thought impossible
- Increased AI investments — Goldman Sachs estimated $200B globally by 2025
- Low entrance barrier — model-as-a-service APIs; anyone can build without ML expertise
Foundation Model Use Cases
| Category | Consumer | Enterprise |
|---|---|---|
| Coding | Code generation, completion | Code generation, migration, docs |
| Image/video | Profile photos, editing | Ad generation, marketing |
| Writing | Email, essays | Copywriting, SEO, reports |
| Education | Tutoring, quiz generation | Employee onboarding, training |
| Conversational bots | Companions, therapists | Customer support, product copilots |
| Information aggregation | Talk-to-your-docs, summarization | Market research, meeting summaries |
| Data organization | Image search | Document processing, knowledge management |
| Workflow automation | Travel planning, form filling | Data entry, lead generation |
Key insights:
- Coding is the most popular use case (AI engineers are coders)
- Companies prefer internal-facing apps first (lower compliance risk)
- “Journey from 0→60 is easy; 60→100 is exceedingly challenging” (UltraChat, Ding et al. 2023)
- Eloundou et al. (2023): occupations with 100% AI exposure include mathematicians, tax preparers, financial analysts, writers, web designers
Planning AI Applications
Use Case Evaluation
Three levels of risk/motivation:
- Existential threat — competitors with AI can make you obsolete (document processing, creative work)
- Opportunity — AI boosts profits and productivity
- Exploratory — don’t want to be left behind
AI’s role in the product:
- Critical vs. complementary: Face ID (critical) vs. Gmail Smart Compose (complementary). More critical → higher quality bar.
- Reactive vs. proactive: Chatbot (reactive) vs. traffic alerts (proactive). Proactive features have higher quality bar since users didn’t ask for them.
- Dynamic vs. static: Dynamic = continually updated per user; static = periodic updates for a group.
Human-in-the-loop (HITL): Microsoft’s Crawl-Walk-Run framework:
- Crawl: human involvement mandatory
- Walk: AI interacts with internal employees
- Run: increased automation with external users
AI product defensibility:
- Low entry barrier is both a blessing and a curse
- Three competitive advantages: technology, data, distribution
- Data flywheel: first mover who gathers usage data can continually improve → data moat
- Foundation model capabilities may subsume features → build on open-source models or cultivate unique data
Setting Expectations
Define a usefulness threshold before shipping:
- Quality metrics (hallucination rate, accuracy)
- Latency: TTFT, TPOT, total latency
- Cost per inference request
- Fairness, interpretability
Milestone Planning
- “Initial success can be misleading” — easy to build a demo, hard to build a product
- LinkedIn: 1 month to reach 80% of target experience; 4 more months to reach 95%
- Evaluate off-the-shelf models first to understand the starting point
Maintenance
- AI space moves fast — commit to “riding the bullet train”
- Good changes (cheaper inference, longer context) still cause friction: prompts, data, workflows need adjustment
- Regulatory changes can be fatal: GDPR, US export controls on compute, evolving IP law
The AI Engineering Stack
Three Layers
| Layer | Responsibilities |
|---|---|
| Application development | Prompt engineering, evaluation, AI interface |
| Model development | Modeling & training, dataset engineering, inference optimization |
| Infrastructure | Model serving, compute management, monitoring |
Infrastructure layer grew slowest — core needs (resource management, serving, monitoring) remain the same.
AI Engineering vs. ML Engineering
Three key differences:
- Use existing models instead of training from scratch → focus shifts to model adaptation not model development
- Bigger models → more pressure on inference optimization; need GPU expertise at scale
- Open-ended outputs → evaluation is much harder than close-ended tasks
Two categories of model adaptation:
- Prompt-based: no weight updates; cheaper, easier to start; may not suffice for complex tasks
- Finetuning: updates weights; more complex, more data needed; unlocks things prompt engineering can’t
Model Development Layer Changes
| Category | Traditional ML | Foundation Models |
|---|---|---|
| Modeling & training | ML knowledge required | Nice-to-have, not required |
| Dataset engineering | Feature engineering, tabular data | Deduplication, tokenization, context retrieval, quality control |
| Inference optimization | Important | More important (autoregressive = sequential, adds latency) |
Training terminology clarified:
- Pre-training: training from scratch; most resource-intensive (98% of compute for InstructGPT)
- Finetuning: continuing from pre-trained weights; less data and compute
- Post-training: finetuning done by model developers (vs. finetuning done by application developers)
Application Development Layer Changes
| Category | Traditional ML | Foundation Models |
|---|---|---|
| AI interface | Less important | Important |
| Prompt engineering | Not applicable | Important |
| Evaluation | Important | More important |
AI Engineering vs. Full-Stack Engineering
- Rising importance of AI interfaces pulls AI engineering closer to full-stack development
- Python-centric → also JavaScript (LangChain.js, Transformers.js, Vercel AI SDK)
- New workflow: build product first, then invest in data/models (inverted vs. traditional ML)
- AI engineers are more product-involved than traditional ML engineers
Key Takeaways
- Foundation models lowered the barrier to building AI apps while raising the ceiling on what’s possible
- AI engineering is not just ML engineering with new tools — the emphasis shifts from model development to model adaptation and evaluation
- Evaluation is now the hardest part of AI engineering (open-ended outputs, many adaptation techniques)
- The data advantage is the most defensible moat for application-layer companies
- Start simple; use off-the-shelf models; only invest in finetuning when prompting isn’t enough