Chapter 2: Understanding Foundation Models

Training Data

Foundation models learn from their training data — if it’s not in the data, the model can’t do it.

Key data sources:

Common Crawl: nonprofit crawls 2–3 billion web pages/month; used in GPT-3, Gemini. Contains misinformation, clickbait, racism, conspiracy theories.
C4 (Colossal Clean Crawled Corpus): Google’s cleaner subset of Common Crawl.
OpenAI heuristic: used only Reddit links with ≥3 upvotes for GPT-2.

Data quality > quantity: 7B tokens of high-quality coding data → 1.3B-parameter model that outperforms much larger models on coding benchmarks (Gunasekar et al., 2023).

Three golden goals for training data: Quantity, Quality, Diversity.

Multilingual Models

English dominates Common Crawl (45.9% of data). Under-represented languages suffer:

GPT-4 on MMLU: much better in English than Telugu, Marathi, Punjabi
GPT-4 fails all six math problems in Burmese and Amharic
Burmese requires ~10× more tokens than English to express the same content → 10× slower and 10× more expensive
Translation workaround loses information (e.g., Vietnamese social pronouns → I/you)

Domain-Specific Models

General-purpose models struggle on tasks never seen in training (protein structures, CT scans, etc.).

AlphaFold: protein structure prediction (100K sequences/structures)
BioNeMo (NVIDIA): biomolecular drug discovery
Med-PaLM2 (Google): medical queries
Domain-specific models can be fine-tuned on top of general-purpose models instead of trained from scratch.

Modeling

Model Architecture: Transformer

Problem it solved: seq2seq (2014) with RNNs had two issues:

Decoder used only the final hidden state — like answering questions from a book summary
Sequential input processing is slow (100-token input = 100 sequential steps)

Transformer solution (Vaswani et al., 2017):

Attention mechanism: computes how much attention to give each previous token when generating the current token
Key/Value/Query vectors: Q = current query; K = page numbers; V = page contents
Dot product(Q, K) → attention weights; weighted sum of V = output
Multi-headed attention: multiple heads attend to different groups of tokens simultaneously
Processes input tokens in parallel (unlike RNNs)

Inference steps:

Prefill: process all input tokens in parallel → compute KV cache for all input tokens
Decode: generate one output token at a time (still sequential)

Context length limitation: each token requires key and value vectors → longer context = more KV memory. This is why extending context length is hard.

Transformer block components:

Attention module: 4 weight matrices (Q, K, V, output projection)
MLP module: linear layers + nonlinear activation (ReLU, GELU)
Embedding module (before blocks): converts tokens and positions to vectors
Output layer (after blocks): maps output vectors to token probabilities

Model size determined by: hidden dimension, number of blocks, feedforward dimension, vocabulary size.

Model	Blocks	Dim	Vocab	Context
Llama 2-7B	32	4096	32K	4K
Llama 3-70B	80	8192	128K	128K
Llama 3-405B	126	16384	128K	128K

Alternative Architectures

RWKV (Peng et al., 2023): RNN-based but parallelizable; no context length limit in theory but not guaranteed in practice
SSMs (state space models): better at long-range memory
- Mamba (Gu & Dao, 2023): linear-time scaling with sequence length vs. quadratic for transformers; Mamba-3B outperforms same-size transformers
- Jamba (Lieber et al., 2024): hybrid Transformer+Mamba; 52B total / 12B active params; fits in 80GB GPU; supports 256K context

Model Size

Three signals of a model’s scale:

Number of parameters → learning capacity
Number of training tokens → how much learned
Number of FLOPs → training cost

Memory math: 7B params × 2 bytes (FP16) = 14GB minimum for inference.

Sparse models / Mixture-of-Experts (MoE):

MoE: model divided into expert groups; only a subset active per token
Mixtral 8×7B: 46.7B total params, but only 12.9B active per token → cost/speed of a 12.9B model

LLM training dataset sizes:

Llama 1: 1.4T tokens
Llama 2: 2T tokens
Llama 3: 15T tokens
RedPajama-v2: 30T tokens (= ~450M books)

Training compute (FLOPs): GPT-3-175B needed 3.14×10²³ FLOPs; at $2/ H 100 - h r, 256 H 100 s, 70$ 4.1M.

Chinchilla Scaling Law (DeepMind, 2022)

Given a fixed compute budget, the optimal ratio is:

Training tokens ≈ 20× model parameters
Model size and training tokens should scale equally (double model size → double tokens)

But scaling law ≠ only metric: Meta chose smaller Llama models than compute-optimal for wider adoption (easier to deploy, cheaper inference).

Inverse scaling: Larger models sometimes perform worse on tasks requiring memorization or strong priors.

Scaling bottlenecks:

Data: rate of dataset size growth > rate of new internet data being created; ~45% of C4 is now restricted; AI-generated data contaminating future training data
Electricity: data centers already 1–2% of global electricity; projected 4–20% by 2030

Post-Training

Pre-trained models optimize for text completion, not conversation; may produce harmful or inappropriate content.

Post-training steps:

Supervised Finetuning (SFT): finetune on (prompt, response) demonstration data to teach conversational behavior
Preference Finetuning: further finetune to align with human preferences using RL

Pre-training takes ~98% of compute; post-training takes ~2%. Post-training unlocks capabilities the pre-trained model already has.

Mnemonic: Pre-training = learning to read; post-training = learning how to use that knowledge.

Shoggoth analogy: 1) Pre-training = rogue monster (internet data) → 2) SFT = socially acceptable → 3) Preference finetuning = customer-appropriate smiley face.

Supervised Finetuning (SFT)

Demonstration data format: (prompt, response) pairs — also called behavior cloning.

Good labelers are essential. InstructGPT labelers: ~90% have ≥ college degree, >1/3 have master’s degrees. One pair costs ~ $10 an d t ak es u pt o 30 min .13 K p ai r s u se df or I n s t r u c tGP T =$ 130K.

Alternatives: volunteer-generated (LAION — skewed demographics), heuristic-filtered internet data (DeepMind Gopher), AI-generated synthetic data.

Preference Finetuning (RLHF)

Goal: Teach model which responses are preferred, not just correct.

Two-step process:

Train reward model (RM): given (prompt, response), output a score
- Uses comparison data format: (prompt, winning_response, losing_response)
- Pairwise comparison is more consistent than absolute scoring (inter-labeler agreement ~73%)
- Each comparison costs ~ $3.50; e a c h d e m o n s t r a t i o n r es p o n secos t s$ 25
- Loss function: maximize difference in RM scores for winning vs. losing responses
Finetune with PPO: optimize the SFT model to generate responses that maximize RM scores

Alternatives to RLHF:

DPO (Direct Preference Optimization): simpler; Meta switched from RLHF (Llama 2) to DPO (Llama 3)
RLAIF (RL from AI Feedback): uses AI instead of humans as judges

Best of N strategy: generate N outputs, pick the one scored highest by RM — used by Stitch Fix and Grab; skips the RL step entirely.

Sampling

Fundamentals

A language model generates each token by:

Computing logit vector (one logit per vocabulary token)
Applying softmax to convert logits to probabilities
Sampling from the probability distribution

Greedy sampling = always pick the highest-probability token → boring, repetitive outputs.

Logprobs = log probabilities; preferred over raw probs to avoid underflow issues with large vocabularies.

Sampling Strategies

Temperature:

Logits divided by temperature T before softmax
Low T (< 1): concentrates probability mass on top token → more deterministic, predictable
High T (> 1): redistributes probability toward lower-probability tokens → more creative, less coherent
T = 0: equivalent to always picking the argmax (greedy)
Recommended: T = 0.7 for creative tasks

Top-k:

Only sample from the top-k logits (common values: 50–500)
Reduces computation; smaller k = more predictable

Top-p (nucleus sampling):

Sum probabilities of tokens in descending order until sum ≥ p (typical: 0.9–0.95)
Dynamic k: adapts to each context (simple prompts → few tokens; open-ended → many tokens)
Proven to work well in practice despite less obvious theoretical benefits

Stopping conditions: fixed token count (but cuts off mid-sentence), stop tokens (e.g., EOS). Early stopping can cause malformatted structured outputs.

Test Time Compute

Generate multiple outputs to increase chance of a good response.

Best of N: sample N responses, pick the best one (by logprob, reward model score, or heuristic).

Beam search: generate fixed number of most promising candidates at each step.

Selection methods:

Highest average logprob (OpenAI API default)
Reward model scoring
Most common output (majority voting for exact answers — used by Google for Gemini MMLU evaluation with 32 samples)

Key findings:

OpenAI (2021): verifier ≈ 30× model size increase in performance; 100M + verifier ≈ 3B without verifier
DeepMind: scaling test time compute can be more efficient than scaling model parameters
Diminishing returns beyond ~400 samples; log-linear improvement up to 10K (Stanford)

Structured Outputs

Needed when: (1) task requires structured output (SQL, JSON, regex); (2) output feeds a downstream application.

Approaches (simple → intensive):

Prompting: instruct model to follow format; unreliable (may produce invalid output 5–30% of time)
Post-processing: fix common mistakes with scripts; LinkedIn defensive YAML parser: 90% → 99.99% valid
Test time compute: retry until valid output
Constrained sampling: filter logit vector to only include tokens valid at each grammar step; requires format-specific grammar; adds latency
Finetuning: most reliable and general; optionally add classifier head for guaranteed classification output

The Probabilistic Nature of AI

AI is probabilistic — same input can yield different outputs.

Inconsistency:

Same input, different outputs
Slightly different input, drastically different outputs
Mitigations: cache responses, fix temperature/top-p/top-k/seed, but hardware differences can still cause variation

Hallucination:

Two hypotheses:

Self-delusion (Ortega et al., DeepMind, 2021): model can’t differentiate user-provided tokens from self-generated tokens → treats generated “facts” as real → snowballing hallucinations (Zhang et al., 2023)
- Mitigation: RL-based differentiation between observations and actions; factual/counterfactual training signals
Mismatched internal knowledge (Leo Gao, OpenAI): SFT labelers teach model to respond using knowledge the model doesn’t have → model learns to hallucinate
- Mitigation: retrieval-based verification; better reward functions that punish fabrication

Hallucination detection discussed in Chapter 4.

Note: RLHF showed mixed results — Schulman says it helps, but InstructGPT paper shows RLHF worsens hallucination (though improves other aspects overall).

Prompt mitigation: “Answer truthfully; say ‘I don’t know’ if unsure.” Concise responses also help.

Key Takeaways

Training data is the most fundamental factor in a model’s capabilities and biases; Common Crawl is the backbone despite its quality issues
The transformer’s dominance comes from parallel input processing and flexible attention, but context length is inherently costly (quadratic KV scaling)
Model scale has three dimensions: parameters, training tokens, FLOPs; Chinchilla law → 20 tokens per parameter for compute-optimal training
Post-training (SFT + RLHF/DPO) is essential to make pre-trained models useful; it costs only ~2% of total compute
Sampling strategies (temperature, top-k, top-p) are underrated levers to tune model behavior; test time compute can substitute for 30× more parameters
Hallucination and inconsistency are fundamental consequences of probabilistic sampling — not bugs to be patched away but properties to be managed

Study Notes by Niladri & AI

Explorer

02-understanding-foundation-models