Chapter 2: Understanding Foundation Models

Training Data

Foundation models learn from their training data — if it’s not in the data, the model can’t do it.

Key data sources:

  • Common Crawl: nonprofit crawls 2–3 billion web pages/month; used in GPT-3, Gemini. Contains misinformation, clickbait, racism, conspiracy theories.
  • C4 (Colossal Clean Crawled Corpus): Google’s cleaner subset of Common Crawl.
  • OpenAI heuristic: used only Reddit links with ≥3 upvotes for GPT-2.

Data quality > quantity: 7B tokens of high-quality coding data → 1.3B-parameter model that outperforms much larger models on coding benchmarks (Gunasekar et al., 2023).

Three golden goals for training data: Quantity, Quality, Diversity.

Multilingual Models

English dominates Common Crawl (45.9% of data). Under-represented languages suffer:

  • GPT-4 on MMLU: much better in English than Telugu, Marathi, Punjabi
  • GPT-4 fails all six math problems in Burmese and Amharic
  • Burmese requires ~10× more tokens than English to express the same content → 10× slower and 10× more expensive
  • Translation workaround loses information (e.g., Vietnamese social pronouns → I/you)

Domain-Specific Models

General-purpose models struggle on tasks never seen in training (protein structures, CT scans, etc.).

  • AlphaFold: protein structure prediction (100K sequences/structures)
  • BioNeMo (NVIDIA): biomolecular drug discovery
  • Med-PaLM2 (Google): medical queries
  • Domain-specific models can be fine-tuned on top of general-purpose models instead of trained from scratch.

Modeling

Model Architecture: Transformer

Problem it solved: seq2seq (2014) with RNNs had two issues:

  1. Decoder used only the final hidden state — like answering questions from a book summary
  2. Sequential input processing is slow (100-token input = 100 sequential steps)

Transformer solution (Vaswani et al., 2017):

  • Attention mechanism: computes how much attention to give each previous token when generating the current token
  • Key/Value/Query vectors: Q = current query; K = page numbers; V = page contents
  • Dot product(Q, K) → attention weights; weighted sum of V = output
  • Multi-headed attention: multiple heads attend to different groups of tokens simultaneously
  • Processes input tokens in parallel (unlike RNNs)

Inference steps:

  1. Prefill: process all input tokens in parallel → compute KV cache for all input tokens
  2. Decode: generate one output token at a time (still sequential)

Context length limitation: each token requires key and value vectors → longer context = more KV memory. This is why extending context length is hard.

Transformer block components:

  • Attention module: 4 weight matrices (Q, K, V, output projection)
  • MLP module: linear layers + nonlinear activation (ReLU, GELU)
  • Embedding module (before blocks): converts tokens and positions to vectors
  • Output layer (after blocks): maps output vectors to token probabilities

Model size determined by: hidden dimension, number of blocks, feedforward dimension, vocabulary size.

ModelBlocksDimVocabContext
Llama 2-7B32409632K4K
Llama 3-70B808192128K128K
Llama 3-405B12616384128K128K

Alternative Architectures

  • RWKV (Peng et al., 2023): RNN-based but parallelizable; no context length limit in theory but not guaranteed in practice
  • SSMs (state space models): better at long-range memory
    • Mamba (Gu & Dao, 2023): linear-time scaling with sequence length vs. quadratic for transformers; Mamba-3B outperforms same-size transformers
    • Jamba (Lieber et al., 2024): hybrid Transformer+Mamba; 52B total / 12B active params; fits in 80GB GPU; supports 256K context

Model Size

Three signals of a model’s scale:

  1. Number of parameters → learning capacity
  2. Number of training tokens → how much learned
  3. Number of FLOPs → training cost

Memory math: 7B params × 2 bytes (FP16) = 14GB minimum for inference.

Sparse models / Mixture-of-Experts (MoE):

  • MoE: model divided into expert groups; only a subset active per token
  • Mixtral 8×7B: 46.7B total params, but only 12.9B active per token → cost/speed of a 12.9B model

LLM training dataset sizes:

  • Llama 1: 1.4T tokens
  • Llama 2: 2T tokens
  • Llama 3: 15T tokens
  • RedPajama-v2: 30T tokens (= ~450M books)

Training compute (FLOPs): GPT-3-175B needed 3.14×10²³ FLOPs; at 4.1M.

Chinchilla Scaling Law (DeepMind, 2022)

Given a fixed compute budget, the optimal ratio is:

  • Training tokens ≈ 20× model parameters
  • Model size and training tokens should scale equally (double model size → double tokens)

But scaling law ≠ only metric: Meta chose smaller Llama models than compute-optimal for wider adoption (easier to deploy, cheaper inference).

Inverse scaling: Larger models sometimes perform worse on tasks requiring memorization or strong priors.

Scaling bottlenecks:

  1. Data: rate of dataset size growth > rate of new internet data being created; ~45% of C4 is now restricted; AI-generated data contaminating future training data
  2. Electricity: data centers already 1–2% of global electricity; projected 4–20% by 2030

Post-Training

Pre-trained models optimize for text completion, not conversation; may produce harmful or inappropriate content.

Post-training steps:

  1. Supervised Finetuning (SFT): finetune on (prompt, response) demonstration data to teach conversational behavior
  2. Preference Finetuning: further finetune to align with human preferences using RL

Pre-training takes ~98% of compute; post-training takes ~2%. Post-training unlocks capabilities the pre-trained model already has.

Mnemonic: Pre-training = learning to read; post-training = learning how to use that knowledge.

Shoggoth analogy: 1) Pre-training = rogue monster (internet data) → 2) SFT = socially acceptable → 3) Preference finetuning = customer-appropriate smiley face.

Supervised Finetuning (SFT)

Demonstration data format: (prompt, response) pairs — also called behavior cloning.

Good labelers are essential. InstructGPT labelers: ~90% have ≥ college degree, >1/3 have master’s degrees. One pair costs ~130K.

Alternatives: volunteer-generated (LAION — skewed demographics), heuristic-filtered internet data (DeepMind Gopher), AI-generated synthetic data.

Preference Finetuning (RLHF)

Goal: Teach model which responses are preferred, not just correct.

Two-step process:

  1. Train reward model (RM): given (prompt, response), output a score
    • Uses comparison data format: (prompt, winning_response, losing_response)
    • Pairwise comparison is more consistent than absolute scoring (inter-labeler agreement ~73%)
    • Each comparison costs ~25
    • Loss function: maximize difference in RM scores for winning vs. losing responses
  2. Finetune with PPO: optimize the SFT model to generate responses that maximize RM scores

Alternatives to RLHF:

  • DPO (Direct Preference Optimization): simpler; Meta switched from RLHF (Llama 2) to DPO (Llama 3)
  • RLAIF (RL from AI Feedback): uses AI instead of humans as judges

Best of N strategy: generate N outputs, pick the one scored highest by RM — used by Stitch Fix and Grab; skips the RL step entirely.


Sampling

Fundamentals

A language model generates each token by:

  1. Computing logit vector (one logit per vocabulary token)
  2. Applying softmax to convert logits to probabilities
  3. Sampling from the probability distribution

Greedy sampling = always pick the highest-probability token → boring, repetitive outputs.

Logprobs = log probabilities; preferred over raw probs to avoid underflow issues with large vocabularies.

Sampling Strategies

Temperature:

  • Logits divided by temperature T before softmax
  • Low T (< 1): concentrates probability mass on top token → more deterministic, predictable
  • High T (> 1): redistributes probability toward lower-probability tokens → more creative, less coherent
  • T = 0: equivalent to always picking the argmax (greedy)
  • Recommended: T = 0.7 for creative tasks

Top-k:

  • Only sample from the top-k logits (common values: 50–500)
  • Reduces computation; smaller k = more predictable

Top-p (nucleus sampling):

  • Sum probabilities of tokens in descending order until sum ≥ p (typical: 0.9–0.95)
  • Dynamic k: adapts to each context (simple prompts → few tokens; open-ended → many tokens)
  • Proven to work well in practice despite less obvious theoretical benefits

Stopping conditions: fixed token count (but cuts off mid-sentence), stop tokens (e.g., EOS). Early stopping can cause malformatted structured outputs.

Test Time Compute

Generate multiple outputs to increase chance of a good response.

Best of N: sample N responses, pick the best one (by logprob, reward model score, or heuristic).

Beam search: generate fixed number of most promising candidates at each step.

Selection methods:

  • Highest average logprob (OpenAI API default)
  • Reward model scoring
  • Most common output (majority voting for exact answers — used by Google for Gemini MMLU evaluation with 32 samples)

Key findings:

  • OpenAI (2021): verifier ≈ 30× model size increase in performance; 100M + verifier ≈ 3B without verifier
  • DeepMind: scaling test time compute can be more efficient than scaling model parameters
  • Diminishing returns beyond ~400 samples; log-linear improvement up to 10K (Stanford)

Structured Outputs

Needed when: (1) task requires structured output (SQL, JSON, regex); (2) output feeds a downstream application.

Approaches (simple → intensive):

  1. Prompting: instruct model to follow format; unreliable (may produce invalid output 5–30% of time)
  2. Post-processing: fix common mistakes with scripts; LinkedIn defensive YAML parser: 90% → 99.99% valid
  3. Test time compute: retry until valid output
  4. Constrained sampling: filter logit vector to only include tokens valid at each grammar step; requires format-specific grammar; adds latency
  5. Finetuning: most reliable and general; optionally add classifier head for guaranteed classification output

The Probabilistic Nature of AI

AI is probabilistic — same input can yield different outputs.

Inconsistency:

  • Same input, different outputs
  • Slightly different input, drastically different outputs
  • Mitigations: cache responses, fix temperature/top-p/top-k/seed, but hardware differences can still cause variation

Hallucination:

Two hypotheses:

  1. Self-delusion (Ortega et al., DeepMind, 2021): model can’t differentiate user-provided tokens from self-generated tokens → treats generated “facts” as real → snowballing hallucinations (Zhang et al., 2023)
    • Mitigation: RL-based differentiation between observations and actions; factual/counterfactual training signals
  2. Mismatched internal knowledge (Leo Gao, OpenAI): SFT labelers teach model to respond using knowledge the model doesn’t have → model learns to hallucinate
    • Mitigation: retrieval-based verification; better reward functions that punish fabrication

Hallucination detection discussed in Chapter 4.

Note: RLHF showed mixed results — Schulman says it helps, but InstructGPT paper shows RLHF worsens hallucination (though improves other aspects overall).

Prompt mitigation: “Answer truthfully; say ‘I don’t know’ if unsure.” Concise responses also help.


Key Takeaways

  • Training data is the most fundamental factor in a model’s capabilities and biases; Common Crawl is the backbone despite its quality issues
  • The transformer’s dominance comes from parallel input processing and flexible attention, but context length is inherently costly (quadratic KV scaling)
  • Model scale has three dimensions: parameters, training tokens, FLOPs; Chinchilla law → 20 tokens per parameter for compute-optimal training
  • Post-training (SFT + RLHF/DPO) is essential to make pre-trained models useful; it costs only ~2% of total compute
  • Sampling strategies (temperature, top-k, top-p) are underrated levers to tune model behavior; test time compute can substitute for 30× more parameters
  • Hallucination and inconsistency are fundamental consequences of probabilistic sampling — not bugs to be patched away but properties to be managed