00 — Foundations of Large Language Models
Who this is for: Software engineers building intuition for LLM internals — for interview readiness, debugging weird model behavior, and making smarter architectural decisions.
Time investment: 4–6 hours of reading + hands-on exercises.
Table of Contents
- Transformer Architecture
- Tokenization
- LLM Inference and Sampling
- Context Windows
- Fine-tuning vs Prompting vs RAG — Decision Tree
- Key Papers
- Interview Flashcards
1. Transformer Architecture
The transformer, introduced in “Attention Is All You Need” (Vaswani et al., 2017), replaced recurrent networks as the dominant architecture for sequence modeling. Every major LLM today — GPT-4, Claude, Gemini, LLaMA — is built on this foundation. Understanding the transformer is not optional for AI engineers.
1.1 The Core Idea: Attention Over Sequences
Earlier architectures like RNNs and LSTMs processed tokens sequentially: token 1 → token 2 → … → token N. This created two problems:
- Vanishing gradients: Information from early tokens degraded over long sequences.
- No parallelism: You could not compute token 5’s representation until token 4 was done.
Transformers solve both problems by allowing every token to attend to every other token simultaneously. The “attention” mechanism computes a weighted sum of all token representations, where the weights reflect how relevant each token is to the current one.
1.2 Self-Attention: Intuition and Math
Intuition first: Imagine reading the sentence “The cat sat on the mat because it was tired.” When you process “it”, you instinctively know it refers to “cat” — not “mat”. Self-attention formalizes this: for each token, compute a relevance score against every other token, and use those scores to build a context-aware representation.
The three matrices: Q, K, V
For each token embedding x_i, we project it into three vectors:
Q_i = x_i · W_Q (Query — "what am I looking for?")
K_i = x_i · W_K (Key — "what do I contain?")
V_i = x_i · W_V (Value — "what do I contribute?")
Where W_Q, W_K, W_V are learned weight matrices of shape [d_model, d_k].
Computing attention scores:
Attention(Q, K, V) = softmax( Q · K^T / sqrt(d_k) ) · V
Step by step:
Q · K^T— dot product of queries and keys. Result: an[n, n]matrix where entry[i, j]measures how much tokenishould attend to tokenj./ sqrt(d_k)— scale by the square root of key dimension. Without this, dot products grow large for high-dimensional vectors, pushing softmax into saturation (near-zero gradients).softmax(...)— normalize scores to probabilities. Each row sums to 1.· V— weighted sum of value vectors. Each output token is a mixture of all value vectors, weighted by attention scores.
Why O(n²)?
The Q · K^T step computes a dot product between every pair of tokens. For a sequence of length n, that is n × n dot products — O(n²) in both time and memory. This is the fundamental scaling bottleneck for long contexts. A 100k-token context requires 10 billion attention score computations per layer.
Masking in decoders
Decoder-only models (like GPT) cannot attend to future tokens during training — that would be cheating. A causal mask sets all [i, j] entries where j > i to -infinity before the softmax, effectively zeroing them out. This forces each token to only use past context.
Mask example for n=4 (1=attend, -inf=blocked):
t1 t2 t3 t4
t1 [ 1 -inf -inf -inf ]
t2 [ 1 1 -inf -inf ]
t3 [ 1 1 1 -inf ]
t4 [ 1 1 1 1 ]
1.3 Multi-Head Attention — Why Multiple Heads?
A single attention head computes one set of Q/K/V projections. Multi-head attention runs h attention heads in parallel, each with its own learned projections:
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W_O
where head_i = Attention(Q · W_Q_i, K · W_K_i, V · W_V_i)
Why does this matter?
Each head can specialize in a different type of relationship:
- Head 1 might learn syntactic dependencies (“subject → verb agreement”)
- Head 2 might learn coreference (“it” → “cat”)
- Head 3 might learn positional proximity (“adjacent words”)
- Head 4 might learn semantic similarity
Empirically, probing studies confirm this. Heads in BERT have been shown to specialize in specific linguistic phenomena. With a single head, the model must compress all relational information into one set of weights.
Dimension math: If d_model = 512 and h = 8 heads, each head operates on d_k = 512/8 = 64 dimensions. The total computation is similar to one head at full dimension, but the representational power is richer.
1.4 Positional Encoding — Learned vs Sinusoidal
Attention is permutation-invariant by default. The attention scores depend only on Q and K values, not on token positions. Feeding “cat sat mat” vs “mat sat cat” would produce identical attention patterns — which is wrong.
Positional encodings inject position information into the token embeddings before the attention layers.
Sinusoidal encodings (original transformer):
PE(pos, 2i) = sin( pos / 10000^(2i/d_model) )
PE(pos, 2i+1) = cos( pos / 10000^(2i/d_model) )
- Each position gets a unique vector of sines and cosines at different frequencies.
- The model can extrapolate to longer sequences than seen during training (in theory).
- No learned parameters — works out of the box.
Learned positional embeddings (GPT-2, BERT):
- A lookup table: position 0 → vector_0, position 1 → vector_1, …
- Learned jointly with the model weights.
- More flexible — the model can learn whatever positional signal is most useful.
- Cannot generalize beyond the maximum training length.
Rotary Positional Embedding (RoPE) — used in LLaMA, GPT-NeoX:
Modern models use RoPE, which encodes positional information by rotating the Q and K vectors in the complex plane. Instead of adding a fixed vector to embeddings, it multiplies Q and K by rotation matrices that depend on position. This means the dot product Q·K naturally encodes relative position, not absolute position. RoPE generalizes better to longer sequences and is the current standard for most open-weight models.
1.5 Feed-Forward Layers, Layer Norm, Residual Connections
Each transformer block contains more than just attention. Here is the full composition:
Feed-Forward Network (FFN):
After attention, each token representation passes through a two-layer MLP applied independently:
FFN(x) = max(0, x · W_1 + b_1) · W_2 + b_2
- W_1 expands from
d_modelto4 × d_model(e.g., 512 → 2048) - W_2 projects back down
- ReLU or GELU activation in between
The FFN is where most of the model’s “knowledge” is stored. The attention mechanism routes information; the FFN transforms it. Research shows that factual associations are often stored in specific FFN layers (the “knowledge neurons” hypothesis).
Layer Normalization:
Before attention and FFN operations, inputs are normalized:
LayerNorm(x) = γ · (x - μ) / √(σ² + ε) + β
Where μ and σ² are computed per-token across the feature dimension, and γ, β are learned scale/shift parameters. This stabilizes training by preventing activation values from exploding or vanishing.
Note: The original paper used Post-LN (normalize after sublayer). Most modern LLMs use Pre-LN (normalize before sublayer), which is more stable to train.
Residual Connections:
Every sublayer (attention, FFN) is wrapped in a residual connection:
output = sublayer(LayerNorm(x)) + x ← Pre-LN style
Residual connections allow gradients to flow directly from output to input, solving the vanishing gradient problem for deep networks. They also allow the model to “skip” a layer by learning near-zero weights — giving it the flexibility to use only the depth it needs.
1.6 ASCII Diagram of a Transformer Block
Input Tokens
|
┌─────────▼──────────┐
│ Token Embeddings │
│ + Positional Enc. │
└─────────┬──────────┘
│
┌─────────▼──────────────────────────────────┐
│ TRANSFORMER BLOCK (×N layers) │
│ │
│ ┌──────────────────────────────────┐ │
│ │ Layer Norm │ │
│ └──────────┬───────────────────────┘ │
│ │ │
│ ┌──────────▼───────────────────────┐ │
│ │ Multi-Head Attention │ │
│ │ │ │
│ │ Q = x·W_Q K = x·W_K V = x·W_V│ │
│ │ │ │
│ │ score = softmax(QK^T/√d_k) · V │ │
│ │ │ │
│ │ [head_1 | head_2 | ... | head_h] │ │
│ └──────────┬───────────────────────┘ │
│ │ │
│ ┌──────────▼──┐ (residual add) │
│ │ x + attn │◄────────────────────────┤
│ └──────────┬──┘ │
│ │ │
│ ┌──────────▼───────────────────────┐ │
│ │ Layer Norm │ │
│ └──────────┬───────────────────────┘ │
│ │ │
│ ┌──────────▼───────────────────────┐ │
│ │ Feed-Forward Network (FFN) │ │
│ │ Linear → GELU → Linear │ │
│ │ d_model → 4·d_model → d_model │ │
│ └──────────┬───────────────────────┘ │
│ │ │
│ ┌──────────▼──┐ (residual add) │
│ │ x + ffn(x) │◄────────────────────────┤
│ └──────────┬──┘ │
│ │ │
└──────────────┼─────────────────────────────┘
│ (repeated N times)
┌─────────▼──────────┐
│ Final Layer Norm │
└─────────┬──────────┘
│
┌─────────▼──────────┐
│ Linear (unembedding)│
│ + Softmax │
└─────────┬──────────┘
│
Next token logits
1.7 Encoder-Only, Decoder-Only, Encoder-Decoder
The transformer architecture comes in three main variants. Choosing the right one matters when selecting base models or understanding why a model behaves a certain way.
Encoder-Only (e.g., BERT, RoBERTa, DeBERTa)
Input: [CLS] The cat sat [SEP]
↓
All tokens attend to all other tokens (bidirectional)
↓
Output: Contextual embedding for each token
- Full bidirectional attention — every token sees every other token.
- Produces rich contextual representations, not text.
- Best for: classification, NER, sentence embeddings, semantic search.
- NOT used for text generation (no autoregressive decoding).
Decoder-Only (e.g., GPT-4, Claude, LLaMA, Mistral)
Input: The cat sat on
↓
Each token attends only to previous tokens (causal mask)
↓
Output: Probability distribution over next token
- Causal (left-to-right) attention only.
- Trained to predict the next token given all previous tokens.
- Best for: chat, code generation, summarization, anything generative.
- The dominant architecture for modern LLMs.
Encoder-Decoder (e.g., T5, BART, mT5)
Input sequence → [Encoder: bidirectional attention]
↓
Encoder hidden states
↓
Output so far → [Decoder: causal attention + cross-attention to encoder]
↓
Next output token
- Encoder reads the full input with bidirectional attention.
- Decoder generates output token by token.
- Cross-attention: decoder tokens attend to encoder hidden states.
- Best for: translation, summarization with explicit input/output separation.
- T5 reformulates all NLP tasks as text-to-text.
Summary table:
| Architecture | Attention Type | Use Case | Examples |
|---|---|---|---|
| Encoder-only | Bidirectional | Embeddings, classification | BERT, RoBERTa |
| Decoder-only | Causal (unidirect) | Text generation, chat | GPT-4, Claude, LLaMA |
| Encoder-decoder | Both + cross-attn | Translation, seq2seq | T5, BART |
2. Tokenization
2.1 What Is a Token?
A token is the atomic unit of text that a model processes. It is not a character and not a word — it is a subword unit produced by a tokenizer trained on a large text corpus.
Examples using GPT-4’s tokenizer (cl100k_base):
"Hello, world!" → ["Hello", ",", " world", "!"] = 4 tokens
"unbelievable" → ["un", "bel", "iev", "able"] = 4 tokens
"ChatGPT" → ["Chat", "G", "PT"] = 3 tokens
"12345" → ["123", "45"] = 2 tokens (often)
"こんにちは" → ["こん", "にち", "は"] = 3 tokens (Japanese)
"🚀" → [128640] = 1 token (emoji)
Why tokens matter:
- Cost: Every major LLM API bills per token (input + output). A 100-word email ≈ 130 tokens.
- Context limits: Models have a maximum context window measured in tokens. Exceeding it truncates your input.
- Latency: More tokens = more inference steps = slower output.
- Model behavior: The model never sees raw text. It sees integer IDs. Tokenization artifacts can cause surprising failures.
Rule of thumb (English prose):
- 1 token ≈ 0.75 words, or ~4 characters
- 100 tokens ≈ 75 words ≈ half a paragraph
- 1,000 tokens ≈ 750 words ≈ 1.5 pages
These ratios break down for code (denser), non-Latin scripts (sparser per character), and numbers.
2.2 BPE — Byte Pair Encoding, Step by Step
BPE is the most widely used tokenization algorithm. It starts with individual bytes/characters and iteratively merges the most frequent adjacent pairs into new tokens.
Algorithm:
1. Initialize vocabulary = all unique characters in corpus
e.g., {h, e, l, o, w, r, d, ' '}
2. Tokenize corpus at character level:
"hello world" → ['h','e','l','l','o',' ','w','o','r','l','d']
3. Count all adjacent pairs:
(h,e): 1, (e,l): 1, (l,l): 1, (l,o): 2, (o,' '): 1,
(' ',w): 1, (w,o): 1, (o,r): 1, (r,l): 1, (l,d): 1
4. Merge most frequent pair: (l,o) → "lo"
Vocabulary: {h, e, l, lo, o, w, r, d, ' '}
Corpus: ['h','e','l','lo',' ','w','o','r','l','d']
5. Repeat steps 3-4 until vocabulary reaches target size (e.g., 50,000)
Full worked example (small corpus):
Corpus: "low lower lowest"
Initial character vocab: {l, o, w, e, r, s, t, ' '}
Iteration 1 — pairs and counts:
(l,o):3 (o,w):3 (w,' '):1 (w,e):2 (e,r):2 (r,' '):1
(r,e):1 (e,s):1 (s,t):1
Most frequent: (l,o) AND (o,w) both at 3. Merge (l,o) → "lo"
Iteration 2:
Corpus: "lo w lo w e r lo w e s t"
Most frequent: (lo,w) at 3. Merge → "low"
Iteration 3:
Corpus: "low low e r low e s t"
Most frequent: (low,e) at 2. Merge → "lowe"
Result: "lowest" → ["lowe", "s", "t"]
"lower" → ["lowe", "r"]
"low" → ["low"]
BPE naturally creates subword tokens that capture morphological structure. Common word stems become single tokens; rare words are split into known subwords, avoiding the out-of-vocabulary problem.
tiktoken is OpenAI’s fast BPE implementation used by GPT-3.5, GPT-4, and related models. Claude uses a different tokenizer, but concepts are identical.
2.3 Counting Tokens with tiktoken
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 encoding
text = "Hello, world! How are you today?"
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}") # 9
print(f"Token IDs: {tokens}")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")Pre-flight cost estimation pattern:
def estimate_cost(messages: list[dict], model="gpt-4o") -> dict:
enc = tiktoken.encoding_for_model(model)
# Count tokens per message (includes overhead for role/formatting)
total = 0
for msg in messages:
total += 4 # per-message overhead
total += len(enc.encode(msg["role"]))
total += len(enc.encode(msg["content"]))
total += 2 # reply priming
input_cost = total * 0.000005 # $5 per million tokens (GPT-4o)
return {"tokens": total, "estimated_cost_usd": input_cost}2.4 Tokenization Gotchas
Numbers tokenize unpredictably:
"1" → 1 token
"100" → 1 token
"1000" → 1 token
"10000" → 2 tokens ("100", "00")
"12345678" → 3-4 tokens
This is why LLMs struggle with arithmetic — they never see the full number as a unit; they see fragments of it. “1234 + 5678” might tokenize as [“12”, “34”, ” +”, ” 56”, “78”], making carry-based addition invisible.
Code is token-dense:
# Python code — shorter than it looks in tokens
def fibonacci(n):
if n <= 1: return n
return fibonacci(n-1) + fibonacci(n-2)Operators, underscores, indentation, parentheses — all consume tokens. A 100-line Python file is often 400–600 tokens, not 100.
Non-English text uses more tokens per word:
| Language | ”Hello, how are you?” | Tokens |
|---|---|---|
| English | ”Hello, how are you?“ | 6 |
| Spanish | ”Hola, ¿cómo estás?“ | 8 |
| Chinese | ”你好,你怎么样?“ | 12–15 |
| Arabic | ”مرحبا، كيف حالك؟“ | 20+ |
Non-Latin scripts are often tokenized at the byte level, meaning a single Chinese character can become 3–4 tokens. This has real cost implications for multilingual applications.
Whitespace and capitalization matter:
" hello" ≠ "hello" (leading space = different token in many encoders)
"Hello" ≠ "hello" (different token)
This affects prompting: “Answer:” and ” Answer:” may produce different completions.
The tokenization boundary problem:
"11 + 29 = " → the model must predict "40"
Tokens: ["11", " +", " 29", " =", " "]
The model never sees “11” and “29” as numbers — it sees token IDs that happen to decode to those strings. Getting the right answer requires the model to have learned arithmetic patterns that span token boundaries, which is why LLMs use tools/code for reliable math.
3. LLM Inference and Sampling
After the transformer produces logits (raw scores) over the vocabulary, a sampling strategy converts those logits into the next token. This seemingly small step has massive impact on output quality, creativity, and reproducibility.
3.1 Temperature
Temperature T controls the “sharpness” of the probability distribution:
P(token_i) = softmax(logits / T)
= exp(logit_i / T) / Σ exp(logit_j / T)
What T=0 does (greedy decoding):
As T → 0, the distribution collapses to a one-hot vector. The token with the highest logit gets probability 1.0. This is deterministic: same input → same output, always. Use for: structured output (JSON), code generation where correctness matters, classification.
What T=1 does (standard sampling):
No scaling. The softmax distribution reflects the model’s “natural” confidence. Mid-range tokens remain possible. Use for: conversational responses, general QA.
What T>1 does (flat/creative):
High temperature flattens the distribution — less probable tokens become more likely. This increases diversity but also hallucination. T=2.0 can produce incoherent text.
Example: top 3 token probabilities at different temperatures
Logits: [token_A=3.0, token_B=2.0, token_C=1.0]
T=0.5: [0.88, 0.11, 0.01] ← very peaked on A
T=1.0: [0.67, 0.24, 0.09] ← moderate confidence
T=1.5: [0.54, 0.30, 0.16] ← more uniform
T=2.0: [0.47, 0.33, 0.21] ← nearly flat
Practical temperature guide:
| Use Case | Temperature |
|---|---|
| JSON / structured output | 0–0.2 |
| Code generation | 0–0.3 |
| Factual QA | 0.3–0.5 |
| General chat | 0.7–1.0 |
| Creative writing | 1.0–1.3 |
| Brainstorming | 1.2–1.5 |
| Experimental/random | >1.5 |
3.2 Top-p (Nucleus Sampling) vs Top-k
Top-k: Only the k highest-probability tokens are candidates. All others are set to zero probability and renormalized.
Example: k=3, probabilities=[0.4, 0.3, 0.2, 0.07, 0.03]
After top-k: candidates=[0.4, 0.3, 0.2], renormalized=[0.44, 0.33, 0.22]
Problem: k is a fixed number. When the model is very confident (one token has 0.99 probability), top-k=50 still forces 50 candidates, including terrible ones. When the model is uncertain, top-k=50 might cut off many reasonable options.
Top-p (nucleus sampling): Select the smallest set of tokens whose cumulative probability exceeds p. Use those as candidates.
Example: p=0.9, probabilities=[0.5, 0.3, 0.15, 0.04, 0.01]
Cumulative: [0.5, 0.8, 0.95, ...]
Nucleus at p=0.9: {0.5, 0.3, 0.15} (cumulative 0.95 >= 0.90)
Top-p adapts dynamically: when the model is confident, the nucleus is small (1–2 tokens). When uncertain, it expands. This is generally preferred over top-k for text generation.
How they interact:
Most APIs apply both. The typical process:
1. Apply temperature scaling to logits
2. Filter to top-k tokens (if k set)
3. Filter to nucleus (top-p tokens)
4. Renormalize remaining probabilities
5. Sample one token
Recommendation: set top-p=0.9 (or 0.95) and leave top-k unset, or set top-k=0 to disable it. Tuning both simultaneously is usually counterproductive.
3.3 Repetition Penalty and Frequency Penalty
LLMs have a tendency to repeat themselves — especially at low temperature. Penalties address this.
Repetition penalty (used in HuggingFace Transformers):
logit_adjusted = logit / penalty if token appeared previously
= logit otherwise
A penalty > 1.0 reduces the probability of already-used tokens. Penalty=1.3 is common.
Frequency penalty (OpenAI API):
logit_adjusted = logit - frequency_penalty × count(token in context)
The more times a token has appeared, the more its logit is reduced. Range: [−2, 2]. Positive values discourage repetition.
Presence penalty (OpenAI API):
logit_adjusted = logit - presence_penalty if token appeared at all
= logit otherwise
Binary — penalizes any previously seen token once, regardless of count. Encourages topic diversity.
When to use:
- Long-form generation: use frequency_penalty=0.2–0.5
- Avoiding repetitive phrasing: use presence_penalty=0.1–0.3
- Creative writing: higher values, but too high = incoherent
3.4 Decoding Strategies Compared
Greedy decoding:
Always pick the highest-probability token. Fast. Deterministic. Often produces stilted, repetitive text. Works well for structured tasks.
Prompt: "The sky is"
Greedy: "The sky is blue." (most probable path)
Beam search:
Maintain b candidate sequences (beams). At each step, expand each beam with the top-b tokens, then keep the b highest-scoring sequences.
b=2 example:
Step 1: beams = ["The sky is blue", "The sky is clear"]
Step 2: expand each:
"The sky is blue and" (score: -2.1)
"The sky is blue." (score: -1.8)
"The sky is clear ." (score: -2.3)
"The sky is clear on"(score: -2.0)
Keep top 2: "The sky is blue." and "The sky is clear on"
Beam search finds higher-probability sequences than greedy but is slower and tends to produce generic, “safe” text. It dominated NLP before sampling became preferred. Still used for translation where faithfulness matters.
Sampling (temperature + top-p):
Non-deterministic. Introduces controlled randomness. Generally produces more natural, varied text. The current standard for chat/generation.
Comparison:
| Strategy | Deterministic? | Quality | Speed | Use Case |
|---|---|---|---|---|
| Greedy | Yes | Safe, repetitive | Fastest | Structured output |
| Beam search | Yes (b=N) | Accurate, generic | Slow | Translation, summariz. |
| Temperature sampling | No | Natural, varied | Fast | Chat, creative writing |
3.5 Why Determinism Is Approximate at Temperature=0
Setting temperature=0 should produce deterministic output — same input, same output. In practice, this is only approximately true:
-
Floating-point non-determinism: GPU operations like matrix multiplications do not always produce bit-identical results across runs. The order of floating-point additions in a parallel reduction can vary, causing tiny differences that compound through softmax.
-
Batch size effects: Running inference on a batch of 1 vs. batch of 4 produces different numerical results due to different memory access patterns.
-
Framework versions and hardware: cuDNN updates, different GPU models, TensorRT optimizations — all can introduce variation.
-
Top-k tie-breaking: If two tokens have identical post-temperature scores (rare but possible), tie-breaking behavior may be non-deterministic.
Practical implication: For truly reproducible results, you need to fix random seeds, run on the same hardware, and use the same software versions. Even then, near-identical rather than bit-identical is the realistic expectation. For production systems requiring reproducibility, log the actual output, not the intent to reproduce it.
4. Context Windows
4.1 What Fills a Context Window
A context window is the total number of tokens the model can “see” at once during a single inference call. Every token costs memory and compute.
Here is what occupies context in a typical LLM application:
┌───────────────────────────────────────────────────┐
│ CONTEXT WINDOW │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ System Prompt │ │
│ │ "You are a helpful assistant. Today is..." │ │
│ │ ~200–2,000 tokens │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Tool Definitions (function calling) │ │
│ │ ~100–500 tokens per tool │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Conversation History │ │
│ │ Turn 1: user + assistant (~200 tokens) │ │
│ │ Turn 2: user + assistant (~300 tokens) │ │
│ │ Turn N: ... │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Retrieved Context (RAG documents) │ │
│ │ ~500–10,000 tokens │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Current User Message │ │
│ │ ~50–500 tokens │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Output (generated) │ │
│ │ ~100–4,000 tokens (max_tokens setting) │ │
│ └─────────────────────────────────────────────┘ │
│ │
└───────────────────────────────────────────────────┘
Context windows by model generation (approximate):
| Model | Context Window |
|---|---|
| GPT-3 (2020) | 4,096 tokens |
| GPT-4 (2023) | 8k / 32k tokens |
| Claude 3 Opus | 200,000 tokens |
| Claude 3.5 Sonnet | 200,000 tokens |
| GPT-4o | 128,000 tokens |
| Gemini 1.5 Pro | 1,000,000 tokens |
Budget management in practice:
MAX_CONTEXT = 128_000 # GPT-4o
RESERVED_OUTPUT = 4_000 # max tokens for response
SYSTEM_PROMPT_BUDGET = 2_000
available_for_history = MAX_CONTEXT - RESERVED_OUTPUT - SYSTEM_PROMPT_BUDGET
# = 122,000 tokens
# If history exceeds budget, truncate from oldest messages4.2 KV Cache — What It Is and Why It Matters
The problem: During autoregressive decoding, the model generates one token at a time. To generate token 100, it must run attention over all 99 previous tokens. Without caching, each new token requires recomputing K and V for every previous token — O(n) full forward passes, O(n²) total work.
The solution: Cache the K and V matrices computed during the first pass. For subsequent tokens, only compute Q for the new token and retrieve K/V from cache.
Without KV cache:
Token 1: compute K,V for [t1] → attention over 1 token
Token 2: compute K,V for [t1,t2] → attention over 2 tokens
Token 3: compute K,V for [t1,t2,t3] → attention over 3 tokens
Token N: compute K,V for [t1..tN] → attention over N tokens
Total: O(N²) operations
With KV cache:
Token 1: compute K,V for [t1], store in cache
Token 2: compute K,V for [t2], append to cache, attend over 2
Token N: compute K,V for [tN], append to cache, attend over N
Total: O(N) operations (incremental)
Memory cost of KV cache:
For a model with:
LlayersHattention headsd_kkey/value dimension per headNtokens in context- Using 16-bit floats (2 bytes)
KV cache size = 2 × L × H × d_k × N × 2 bytes
↑key+val ↑ per token ↑ float16
For LLaMA-70B with N=4096 context: ~8 GB of KV cache. This is why long contexts require much more VRAM than short ones.
Prompt caching (Anthropic, OpenAI):
Cloud providers offer prompt caching: if the same prefix is sent repeatedly, the provider stores the KV cache server-side. Subsequent requests with that prefix pay only for new tokens. This is a major cost and latency optimization for applications with large, repeated system prompts or document context.
Without caching: cost = (system_prompt + history + new_msg) × price_per_token
With caching: cost = cache_read_cost × cached_tokens + full_cost × new_tokens
(cache reads are typically 90% cheaper)
4.3 Long-Context Trade-offs
Quadratic attention cost:
Standard attention is O(n²) in memory and compute. Doubling context length quadruples the attention computation. Going from 4k to 128k tokens is a 1,024× increase in attention cost (though optimizations like FlashAttention reduce this in practice with better memory access patterns, the fundamental O(n²) scaling remains).
FlashAttention (2022) uses tiling to compute attention without materializing the full n×n matrix in HBM (high bandwidth memory), reducing memory from O(n²) to O(n). This enables longer contexts but does not change the O(n²) FLOP count.
The lost-in-the-middle problem:
Liu et al. (2023) showed that LLMs are significantly worse at using information placed in the middle of a long context, compared to information at the start or end.
Performance by document position (schematic):
High ↑
|█
|█ █
|█ █ █ █
|█ █ █ █ █ █
|█ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █
Low └─────────────────────────────────→
Start Middle End
of context of context
Practical implications:
- Put the most critical information at the start or end of the context.
- For RAG: return fewer, more relevant documents rather than many mediocre ones.
- For long documents: consider chunking and summarizing rather than fitting everything in context.
- For conversation history: recent messages (end of context) matter more than old ones (middle).
Context != comprehension:
A model with a 200k token context window can technically fit 150 pages of text. Whether it comprehends and correctly uses all of it is a separate question. Needle-in-a-haystack benchmarks show that retrieval accuracy degrades significantly for documents placed deep in very long contexts.
5. Fine-tuning vs Prompting vs RAG — Decision Tree
5.1 Overview
Three main techniques to adapt an LLM to your use case. Choosing wrong wastes months. Here is the breakdown:
Prompting (Prompt Engineering)
- Modify the system prompt, few-shot examples, or instruction format.
- Zero training required. Iterate in minutes.
- Model weights unchanged.
RAG (Retrieval-Augmented Generation)
- At query time, retrieve relevant documents from an external store and inject them into the context.
- Knowledge lives outside the model — easy to update.
- Requires retrieval infrastructure (vector DB, embedding model).
Fine-tuning
- Update model weights on your dataset.
- Bakes knowledge or behavior directly into the model.
- Slow and expensive. Harder to update.
5.2 When to Use Each
Use prompting when:
- You are still exploring and do not know what you need.
- The base model already knows the domain (e.g., general coding, writing).
- You need fast iteration cycles (hours, not weeks).
- You can fit all necessary context in the prompt.
- No labeled training data exists.
Use RAG when:
- Knowledge base is large (thousands of documents).
- Knowledge changes frequently (news, product docs, legal updates).
- You need citations/verifiability — retrieved chunks are auditable.
- Domain knowledge is factual and retrievable (not behavioral).
- You want to avoid hallucination by grounding answers in documents.
Use fine-tuning when:
- You need a specific output style or format the model does not naturally produce.
- You have hundreds to thousands of high-quality labeled examples.
- You need to reduce latency/cost by compressing prompt engineering into weights.
- Domain-specific terminology or jargon is critical (medical, legal, scientific).
- You want behavior adaptation that cannot be achieved via prompting (tone, persona, specialized reasoning).
5.3 Decision Tree (ASCII Art)
START: I need to adapt an LLM to my use case
│
▼
┌──────────────────────┐
│ Do I have a clear │
│ task definition? │
└──────────┬───────────┘
│
No │ Yes
▼ │ ▼
Start with │ ┌──────────────────────────┐
prompting │ │ Does the base model fail │
+ iterate │ │ on this task with good │
│ │ prompts? │
│ └──────────┬───────────────┘
│ │
│ No │ Yes
│ ▼ │ ▼
│ ┌──────┐ │ ┌──────────────────────┐
│ │ DONE │ │ │ Is the gap due to │
│ │Prompt│ │ │ missing knowledge? │
│ │works │ │ └──────────┬───────────┘
│ └──────┘ │ │
│ │ Yes │ No
│ │ ▼ │ ▼
│ │ ┌──────────────┐ ┌──────────────────┐
│ │ │Knowledge is │ │Model lacks the │
│ │ │large/dynamic?│ │right behavior │
│ │ └─────┬────────┘ │(style, format, │
│ │ │ │ reasoning) │
│ │ Yes │ No └────────┬─────────┘
│ │ ▼ │ ▼ │
│ │ RAG │ Fine-tune ▼
│ │ │ (bake it in) Fine-tune
│ │ │
│ ┌──────┘
│ │ Do I have labeled data?
│ ├─── No → Prompt engineer or generate synthetic data
│ └─── Yes → Consider fine-tune on examples
│
└──────────────────────────────────────────────────────►
(iterative cycle)
5.4 Combined Architectures
In production, these are not mutually exclusive:
- RAG + Prompting: Most RAG systems. Retrieve docs, inject into prompt, generate.
- Fine-tune + RAG: Fine-tune for behavior/style, use RAG for knowledge. E.g., fine-tune a customer service tone, use RAG for product documentation.
- Fine-tune + Prompting: Fine-tune reduces prompt length needed (compressed knowledge), then use a short prompt.
Cost and complexity spectrum:
Low Cost/Complexity ──────────────────────────── High Cost/Complexity
│ │ │
Prompting RAG Fine-tuning
│ │ │
Minutes to Days to weeks Weeks to months
iterate to set up to train + eval
│ │ │
No GPU needed Embedding + A100/H100 hours
vector DB + labeled data
6. Key Papers
These four papers are the foundational reading for any LLM engineer. You do not need to read them in full, but you should know what each introduced and why it mattered.
Attention Is All You Need (Vaswani et al., 2017)
Introduced the Transformer architecture, replacing RNNs and CNNs for sequence modeling. Proposed the self-attention mechanism, multi-head attention, positional encodings, and the encoder-decoder structure. Achieved state-of-the-art on machine translation with dramatically better parallelism than recurrent approaches. Every major LLM today descends from this architecture.
What to remember: The core insight — attention lets models relate any two positions in a sequence in O(1) operations, regardless of distance.
GPT-3: Language Models are Few-Shot Learners (Brown et al., 2020)
Showed that scaling language model pretraining to 175 billion parameters produced emergent few-shot learning capabilities without any fine-tuning. By simply providing a few examples in the prompt (in-context learning), GPT-3 matched fine-tuned smaller models on many NLP tasks. This established the “foundation model” paradigm — train once at scale, adapt via prompting.
What to remember: The discovery that in-context learning is a scaling phenomenon. Also introduced the terms “zero-shot”, “one-shot”, and “few-shot” learning in the context of LLMs.
InstructGPT: Training language models to follow instructions with human feedback (Ouyang et al., 2022)
Demonstrated that RLHF (Reinforcement Learning from Human Feedback) dramatically improves GPT-3’s ability to follow instructions safely and helpfully. Fine-tuned with supervised learning on human demonstrations, then used a reward model trained on human preferences to optimize via PPO. The InstructGPT model (1.3B parameters) was preferred over raw GPT-3 (175B) by human raters. This is the alignment technique behind ChatGPT, Claude, and every RLHF-trained assistant.
What to remember: RLHF = supervised fine-tuning on demonstrations → reward model training → RL optimization. Small aligned models > large unaligned models on instruction following.
Constitutional AI: Harmlessness from AI Feedback (Behrens et al., 2022, Anthropic)
Proposed training AI systems to be harmless using AI-generated feedback rather than purely human feedback. A set of principles (a “constitution”) guides a model to critique and revise its own outputs. A preference model trained on AI comparisons is then used for RLHF. Reduced the need for human labelers on harmful content. The technique behind Claude’s training.
What to remember: AI self-critique using constitutional principles → more scalable alignment than pure human labeling.
7. Interview Flashcards
These are the exact questions that come up in LLM engineering interviews. Each answer should take 60–90 seconds to deliver out loud.
Q1: What is self-attention and why is it O(n²)?
Self-attention computes, for each token, a weighted sum of all other tokens’ value vectors. The weights come from a scaled dot product of the query vector (the current token) with the key vectors (all other tokens). Because you compute a similarity score between every pair of tokens — n queries times n keys — you get an n×n matrix, making both time and memory complexity O(n²) in sequence length. This is why long contexts are expensive: going from 4k to 128k tokens increases attention cost by a factor of 1,024.
Q2: What is the difference between encoder and decoder transformers?
Encoder transformers (e.g., BERT) use bidirectional attention — every token can see every other token. They produce rich contextual embeddings for each token but do not generate text. Decoder transformers (e.g., GPT, Claude) use causal (masked) attention — each token can only attend to past tokens. They predict the next token autoregressively and are the architecture behind all modern chat LLMs. Encoder-decoder models (T5, BART) combine both: an encoder reads the full input, a decoder generates the output attending to encoder states via cross-attention.
Q3: What happens when you set temperature=0?
Temperature=0 approaches greedy decoding — at each step, the model always picks the highest-probability token. The output is deterministic in principle (same input → same output). However, in practice, floating-point non-determinism in GPU operations means you get approximately deterministic results, not bit-identical ones across different hardware or batch sizes. Temperature=0 is appropriate for tasks requiring consistency: structured JSON output, code generation, classification. Avoid it for creative tasks where variety matters.
Q4: Why does tokenization matter for cost?
LLM APIs charge per token — both input and output. Tokenization determines how many tokens your text consumes. Non-Latin scripts (Chinese, Arabic) can use 3–5× more tokens per word than English. Numbers and code tokenize differently than prose. A poorly designed prompt that uses verbose phrasing, unnecessary whitespace, or long system prompts wastes tokens (and money). Counting tokens before sending a request — using tiktoken or the model’s native counter — lets you budget accurately and catch expensive mistakes before they hit the API.
Q5: What is the KV cache?
The KV cache is an optimization for autoregressive decoding. Without it, generating each new token would require recomputing the key (K) and value (V) vectors for every previous token — O(n²) total work for an n-token generation. The KV cache stores K and V matrices from previous forward passes. For each new token, only the new K and V need to be computed and appended; existing K/V values are reused. This reduces generation to O(n) total operations. The trade-off is memory: the cache grows linearly with context length and can consume gigabytes for long contexts.
Q6: Fine-tune vs RAG — give me a scenario for each.
Fine-tune scenario: A company wants their LLM to always respond in a specific brand voice — terse, technical, with proprietary terminology their documentation uses. This behavioral and stylistic adaptation is hard to achieve reliably through prompting alone. Fine-tuning on hundreds of example company responses bakes the style into the model weights.
RAG scenario: A legal tech company needs an LLM to answer questions about statutes that are updated monthly. Fine-tuning on static data would become stale. Instead, they index all legal documents in a vector database and retrieve the most relevant chunks at query time, injecting them into the prompt. The knowledge stays current without retraining.
Q7: What is the “lost in the middle” problem?
Research (Liu et al., 2023) found that LLMs perform significantly worse at retrieving information placed in the middle of a long context compared to the beginning or end. When asked a question whose answer is in a document somewhere in a 20-document context, accuracy peaks when the answer is in the first or last document and degrades for middle positions. This is attributed to attention patterns that favor recency (end) and primacy (start). Practical implication: in RAG systems, put the most important context at the top or bottom of the injected text, not buried in the middle.
Q8: How does BPE tokenization work?
Byte Pair Encoding starts with a character-level vocabulary (or byte-level). It then iteratively counts all adjacent character pairs in the training corpus and merges the most frequent pair into a new token. This is repeated until the vocabulary reaches a target size (e.g., 50,000 tokens). The result is a subword vocabulary where common words become single tokens and rare words are split into known subword pieces. For example, “playing” might be a single token, while “tokenization” might be [“token”, “ization”]. BPE avoids out-of-vocabulary words (since any word can be decomposed into bytes) while keeping common sequences compact.
End of 00-foundations README. Proceed to examples/ for runnable code demonstrations.