Chapter 2: Understanding Foundation Models
Training Data
Foundation models learn from their training data — if it’s not in the data, the model can’t do it.
Key data sources:
- Common Crawl: nonprofit crawls 2–3 billion web pages/month; used in GPT-3, Gemini. Contains misinformation, clickbait, racism, conspiracy theories.
- C4 (Colossal Clean Crawled Corpus): Google’s cleaner subset of Common Crawl.
- OpenAI heuristic: used only Reddit links with ≥3 upvotes for GPT-2.
Data quality > quantity: 7B tokens of high-quality coding data → 1.3B-parameter model that outperforms much larger models on coding benchmarks (Gunasekar et al., 2023).
Three golden goals for training data: Quantity, Quality, Diversity.
Multilingual Models
English dominates Common Crawl (45.9% of data). Under-represented languages suffer:
- GPT-4 on MMLU: much better in English than Telugu, Marathi, Punjabi
- GPT-4 fails all six math problems in Burmese and Amharic
- Burmese requires ~10× more tokens than English to express the same content → 10× slower and 10× more expensive
- Translation workaround loses information (e.g., Vietnamese social pronouns → I/you)
Domain-Specific Models
General-purpose models struggle on tasks never seen in training (protein structures, CT scans, etc.).
- AlphaFold: protein structure prediction (100K sequences/structures)
- BioNeMo (NVIDIA): biomolecular drug discovery
- Med-PaLM2 (Google): medical queries
- Domain-specific models can be fine-tuned on top of general-purpose models instead of trained from scratch.
Modeling
Model Architecture: Transformer
Problem it solved: seq2seq (2014) with RNNs had two issues:
- Decoder used only the final hidden state — like answering questions from a book summary
- Sequential input processing is slow (100-token input = 100 sequential steps)
Transformer solution (Vaswani et al., 2017):
- Attention mechanism: computes how much attention to give each previous token when generating the current token
- Key/Value/Query vectors: Q = current query; K = page numbers; V = page contents
- Dot product(Q, K) → attention weights; weighted sum of V = output
- Multi-headed attention: multiple heads attend to different groups of tokens simultaneously
- Processes input tokens in parallel (unlike RNNs)
Inference steps:
- Prefill: process all input tokens in parallel → compute KV cache for all input tokens
- Decode: generate one output token at a time (still sequential)
Context length limitation: each token requires key and value vectors → longer context = more KV memory. This is why extending context length is hard.
Transformer block components:
- Attention module: 4 weight matrices (Q, K, V, output projection)
- MLP module: linear layers + nonlinear activation (ReLU, GELU)
- Embedding module (before blocks): converts tokens and positions to vectors
- Output layer (after blocks): maps output vectors to token probabilities
Model size determined by: hidden dimension, number of blocks, feedforward dimension, vocabulary size.
| Model | Blocks | Dim | Vocab | Context |
|---|---|---|---|---|
| Llama 2-7B | 32 | 4096 | 32K | 4K |
| Llama 3-70B | 80 | 8192 | 128K | 128K |
| Llama 3-405B | 126 | 16384 | 128K | 128K |
Alternative Architectures
- RWKV (Peng et al., 2023): RNN-based but parallelizable; no context length limit in theory but not guaranteed in practice
- SSMs (state space models): better at long-range memory
- Mamba (Gu & Dao, 2023): linear-time scaling with sequence length vs. quadratic for transformers; Mamba-3B outperforms same-size transformers
- Jamba (Lieber et al., 2024): hybrid Transformer+Mamba; 52B total / 12B active params; fits in 80GB GPU; supports 256K context
Model Size
Three signals of a model’s scale:
- Number of parameters → learning capacity
- Number of training tokens → how much learned
- Number of FLOPs → training cost
Memory math: 7B params × 2 bytes (FP16) = 14GB minimum for inference.
Sparse models / Mixture-of-Experts (MoE):
- MoE: model divided into expert groups; only a subset active per token
- Mixtral 8×7B: 46.7B total params, but only 12.9B active per token → cost/speed of a 12.9B model
LLM training dataset sizes:
- Llama 1: 1.4T tokens
- Llama 2: 2T tokens
- Llama 3: 15T tokens
- RedPajama-v2: 30T tokens (= ~450M books)
Training compute (FLOPs): GPT-3-175B needed 3.14×10²³ FLOPs; at 4.1M.
Chinchilla Scaling Law (DeepMind, 2022)
Given a fixed compute budget, the optimal ratio is:
- Training tokens ≈ 20× model parameters
- Model size and training tokens should scale equally (double model size → double tokens)
But scaling law ≠ only metric: Meta chose smaller Llama models than compute-optimal for wider adoption (easier to deploy, cheaper inference).
Inverse scaling: Larger models sometimes perform worse on tasks requiring memorization or strong priors.
Scaling bottlenecks:
- Data: rate of dataset size growth > rate of new internet data being created; ~45% of C4 is now restricted; AI-generated data contaminating future training data
- Electricity: data centers already 1–2% of global electricity; projected 4–20% by 2030
Post-Training
Pre-trained models optimize for text completion, not conversation; may produce harmful or inappropriate content.
Post-training steps:
- Supervised Finetuning (SFT): finetune on (prompt, response) demonstration data to teach conversational behavior
- Preference Finetuning: further finetune to align with human preferences using RL
Pre-training takes ~98% of compute; post-training takes ~2%. Post-training unlocks capabilities the pre-trained model already has.
Mnemonic: Pre-training = learning to read; post-training = learning how to use that knowledge.
Shoggoth analogy: 1) Pre-training = rogue monster (internet data) → 2) SFT = socially acceptable → 3) Preference finetuning = customer-appropriate smiley face.
Supervised Finetuning (SFT)
Demonstration data format: (prompt, response) pairs — also called behavior cloning.
Good labelers are essential. InstructGPT labelers: ~90% have ≥ college degree, >1/3 have master’s degrees. One pair costs ~130K.
Alternatives: volunteer-generated (LAION — skewed demographics), heuristic-filtered internet data (DeepMind Gopher), AI-generated synthetic data.
Preference Finetuning (RLHF)
Goal: Teach model which responses are preferred, not just correct.
Two-step process:
- Train reward model (RM): given (prompt, response), output a score
- Uses comparison data format:
(prompt, winning_response, losing_response) - Pairwise comparison is more consistent than absolute scoring (inter-labeler agreement ~73%)
- Each comparison costs ~25
- Loss function: maximize difference in RM scores for winning vs. losing responses
- Uses comparison data format:
- Finetune with PPO: optimize the SFT model to generate responses that maximize RM scores
Alternatives to RLHF:
- DPO (Direct Preference Optimization): simpler; Meta switched from RLHF (Llama 2) to DPO (Llama 3)
- RLAIF (RL from AI Feedback): uses AI instead of humans as judges
Best of N strategy: generate N outputs, pick the one scored highest by RM — used by Stitch Fix and Grab; skips the RL step entirely.
Sampling
Fundamentals
A language model generates each token by:
- Computing logit vector (one logit per vocabulary token)
- Applying softmax to convert logits to probabilities
- Sampling from the probability distribution
Greedy sampling = always pick the highest-probability token → boring, repetitive outputs.
Logprobs = log probabilities; preferred over raw probs to avoid underflow issues with large vocabularies.
Sampling Strategies
Temperature:
- Logits divided by temperature T before softmax
- Low T (< 1): concentrates probability mass on top token → more deterministic, predictable
- High T (> 1): redistributes probability toward lower-probability tokens → more creative, less coherent
- T = 0: equivalent to always picking the argmax (greedy)
- Recommended: T = 0.7 for creative tasks
Top-k:
- Only sample from the top-k logits (common values: 50–500)
- Reduces computation; smaller k = more predictable
Top-p (nucleus sampling):
- Sum probabilities of tokens in descending order until sum ≥ p (typical: 0.9–0.95)
- Dynamic k: adapts to each context (simple prompts → few tokens; open-ended → many tokens)
- Proven to work well in practice despite less obvious theoretical benefits
Stopping conditions: fixed token count (but cuts off mid-sentence), stop tokens (e.g., EOS). Early stopping can cause malformatted structured outputs.
Test Time Compute
Generate multiple outputs to increase chance of a good response.
Best of N: sample N responses, pick the best one (by logprob, reward model score, or heuristic).
Beam search: generate fixed number of most promising candidates at each step.
Selection methods:
- Highest average logprob (OpenAI API default)
- Reward model scoring
- Most common output (majority voting for exact answers — used by Google for Gemini MMLU evaluation with 32 samples)
Key findings:
- OpenAI (2021): verifier ≈ 30× model size increase in performance; 100M + verifier ≈ 3B without verifier
- DeepMind: scaling test time compute can be more efficient than scaling model parameters
- Diminishing returns beyond ~400 samples; log-linear improvement up to 10K (Stanford)
Structured Outputs
Needed when: (1) task requires structured output (SQL, JSON, regex); (2) output feeds a downstream application.
Approaches (simple → intensive):
- Prompting: instruct model to follow format; unreliable (may produce invalid output 5–30% of time)
- Post-processing: fix common mistakes with scripts; LinkedIn defensive YAML parser: 90% → 99.99% valid
- Test time compute: retry until valid output
- Constrained sampling: filter logit vector to only include tokens valid at each grammar step; requires format-specific grammar; adds latency
- Finetuning: most reliable and general; optionally add classifier head for guaranteed classification output
The Probabilistic Nature of AI
AI is probabilistic — same input can yield different outputs.
Inconsistency:
- Same input, different outputs
- Slightly different input, drastically different outputs
- Mitigations: cache responses, fix temperature/top-p/top-k/seed, but hardware differences can still cause variation
Hallucination:
Two hypotheses:
- Self-delusion (Ortega et al., DeepMind, 2021): model can’t differentiate user-provided tokens from self-generated tokens → treats generated “facts” as real → snowballing hallucinations (Zhang et al., 2023)
- Mitigation: RL-based differentiation between observations and actions; factual/counterfactual training signals
- Mismatched internal knowledge (Leo Gao, OpenAI): SFT labelers teach model to respond using knowledge the model doesn’t have → model learns to hallucinate
- Mitigation: retrieval-based verification; better reward functions that punish fabrication
Hallucination detection discussed in Chapter 4.
Note: RLHF showed mixed results — Schulman says it helps, but InstructGPT paper shows RLHF worsens hallucination (though improves other aspects overall).
Prompt mitigation: “Answer truthfully; say ‘I don’t know’ if unsure.” Concise responses also help.
Key Takeaways
- Training data is the most fundamental factor in a model’s capabilities and biases; Common Crawl is the backbone despite its quality issues
- The transformer’s dominance comes from parallel input processing and flexible attention, but context length is inherently costly (quadratic KV scaling)
- Model scale has three dimensions: parameters, training tokens, FLOPs; Chinchilla law → 20 tokens per parameter for compute-optimal training
- Post-training (SFT + RLHF/DPO) is essential to make pre-trained models useful; it costs only ~2% of total compute
- Sampling strategies (temperature, top-k, top-p) are underrated levers to tune model behavior; test time compute can substitute for 30× more parameters
- Hallucination and inconsistency are fundamental consequences of probabilistic sampling — not bugs to be patched away but properties to be managed