Chapter 3: Evaluation Methodology
Why Evaluation is Hard for Foundation Models
- Increasing intelligence → harder to evaluate: Evaluating PhD-level math solutions requires PhD-level expertise; verifying complex summaries requires reading the source
- Open-ended outputs: No exhaustive list of correct responses; can’t compare against ground truths like close-ended tasks
- Black-box models: Training data and architecture are often hidden; only outputs are observable
- Rapidly saturating benchmarks: GLUE (2018) → SuperGLUE (2019) → NaturalInstructions (2021) → Super-NaturalInstructions (2022) → MMLU → MMLU-Pro
- Expanded scope: General-purpose models must be evaluated across all possible tasks, including ones they could do that we haven’t discovered yet
Investment gap: Evaluation tools represent a small fraction of AI engineering repos vs. modeling and orchestration tools.
Language Modeling Metrics
All four metrics (cross entropy, perplexity, BPC, BPB) are interconvertible.
Entropy: measures how much information a token carries on average. Higher entropy = more information per token = harder to predict.
Cross entropy H(P,Q): how difficult it is for a model (distribution Q) to predict data from distribution P.
- H(P, Q) = H(P) + KL(P || Q)
- Perfect model: KL = 0, so H(P,Q) = H(P)
- Training objective: minimize cross entropy
BPC (bits-per-character): cross entropy normalized by characters instead of tokens; comparable across different tokenization schemes.
BPB (bits-per-byte): further normalized to bytes; most standardized unit.
Perplexity (PPL): 2^H(P,Q) or e^H(P,Q) (using nats in PyTorch/TF)
- Intuitively: the average number of options the model has when predicting the next token
- PPL = 4 means the model is choosing among ~4 equally likely tokens on average
- Lower = better (less uncertainty)
Perplexity rules of thumb:
- More structured data (HTML) → lower expected perplexity
- Larger vocabulary → higher perplexity
- Longer context → lower perplexity (more info to predict from)
- Post-training typically increases perplexity (model optimizes for tasks, not token prediction)
- Quantization can change perplexity in unexpected ways
Perplexity use cases:
- Proxy for model capability (before downstream evaluation)
- Detecting data contamination: low PPL on a benchmark → model saw that data during training
- Data deduplication: skip adding new data if its PPL is low (model already knows it)
- Detecting abnormal text: very high PPL = gibberish or unusual ideas
Exact Evaluation
Produces unambiguous, reproducible scores.
Functional Correctness
Evaluates whether the output does what it’s supposed to do.
- Best metric for code generation: run the code, check test cases
- pass@k: fraction of problems solved when k code samples are generated per problem
- pass@1 < pass@3 < pass@10 (more samples = more chances)
- Also applicable to game bots, optimization tasks, workflow automation
- Benchmarks: HumanEval (OpenAI), MBPP (Google), Spider/BIRD-SQL/WikiSQL (text-to-SQL)
HumanEval example: generate a gcd(num1, num2) function → run against 7 test cases to verify.
Similarity Measurements Against Reference Data
Four ways to measure similarity:
- Human/AI judgment
- Exact match
- Lexical similarity
- Semantic similarity
Exact match: binary; works only for short, unambiguous responses (math answers, trivia). Variant: accept output containing the reference. Fails for long-form tasks (many valid phrasings of the same idea).
Lexical similarity: overlap of tokens/n-grams.
- Fuzzy matching / edit distance: Levenshtein distance (deletion, insertion, substitution; some include transposition)
- N-gram similarity: what % of n-grams from reference appear in generated text
- Common metrics: BLEU, ROUGE, METEOR++, TER, CIDEr
- Limitation: good responses not in the reference set get low scores; Adept’s Fuyu case study. Also: BLEU doesn’t correlate with functional correctness in coding.
Semantic similarity (embedding similarity):
- Convert texts to embedding vectors → compute cosine similarity
- “What’s up?” and “How are you?” are lexically different but semantically similar
- Metrics: BERTScore (BERT embeddings), MoverScore
- Requires good embedding model; different embedding models give different results
Introduction to Embedding
- Embedding = vector representation that captures meaning
- Size: 100 to 10,000 dimensions
- Common models: BERT (768/1024 dims), OpenAI text-embedding-3-small (1536), CLIP (512)
- MTEB (Massive Text Embedding Benchmark) evaluates embedding quality across tasks
- Multimodal embeddings: CLIP maps text+images to same space; ULIP adds 3D point clouds; ImageBind handles 6 modalities
AI as a Judge
Using AI to evaluate AI outputs. Also called LLM as a judge.
Why it works:
- Fast, cheap, flexible — can evaluate any criteria
- Doesn’t require reference data → usable in production
- GPT-4 ↔ human agreement: 85% (vs. human-human: 81%) on MT-Bench
- AlpacaEval: 0.98 correlation between AI judges and LMSYS Chatbot Arena human rankings
- Can explain its decisions, useful for auditing
How to use AI judges:
- Evaluate response quality given the question (1-5 scale)
- Compare generated vs. reference response (True/False)
- Compare two generated responses (A vs. B) — useful for preference data, test time compute, comparative evaluation
Prompting tips:
- Clearly state task, criteria, and scoring system
- AI judges work better with classification than numerical scoring
- Discrete (1-5) > continuous (0.0-1.0) for numerical scoring
- Include examples of each score level with justifications
- Include rubric with examples → consistency improves from 65% to 77.5%
Limitations of AI Judges
Inconsistency: Same judge, same input → potentially different scores. Fix: set temperature=0, include examples, use fixed model version.
Criteria ambiguity: MLflow, Ragas, LlamaIndex all have “faithfulness” criterion but use different prompts and scoring systems — their scores are not comparable. Rule: don’t trust AI judges without knowing the model and prompt.
Cost and latency: Using GPT-4 for both generation and evaluation → 2× API cost. Three criteria → 4× calls. Mitigate with: spot-checking (evaluate subset), weaker judge models.
Biases:
- Self-bias: models favor their own outputs (Claude-v1: 25% higher win rate for itself; GPT-4: 10%)
- First-position bias: favors first answer in pairwise comparison (opposite of humans’ recency bias)
- Verbosity bias: favors longer responses even when shorter ones are correct; GPT-4 is less prone than GPT-3.5
What Models Can Act as Judges?
- Stronger judge: better judgment; use a fast/cheap model to generate, stronger model to evaluate subset
- Self-evaluation (self-critique): can catch obvious errors and prompt revision; useful for sanity checks
- Weaker judge: judging is easier than generating (anyone can judge a song; not everyone can write one); small specialized judges can outperform large general judges for specific tasks
Specialized judge types:
- Reward models (e.g., Google Cappy, 360M params): score (prompt, response) pairs — 0 to 1
- Reference-based judges (e.g., BLEURT, Prometheus): compare generated vs. reference
- Preference models (e.g., PandaLM, JudgeLM): predict which of two responses users prefer
Ranking Models with Comparative Evaluation
Pointwise evaluation: score each model independently → rank by scores.
Comparative evaluation: pit models against each other → compute ranking from match outcomes.
Comparative is generally easier for subjective outputs — easier to say “A is better than B” than to give A a score.
Process: for each prompt, pick two models → evaluator votes for winner → aggregate win rates → rating algorithm (Elo, Bradley-Terry, TrueSkill) → ranking.
LMSYS Chatbot Arena: anyone visits the site, enters a prompt, gets responses from two anonymous models, votes, then model names revealed. Bradley-Terry (not Elo) used because Elo is sensitive to ordering.
Challenges:
- Scalability: n models → n(n-1)/2 model pairs; grows quadratically. LMSYS (Jan 2024): 57 models × 244K comparisons = 153 per pair (low)
- Transitivity assumption: if A > B and B > C then A > C — may not hold for human preferences or when different evaluators/prompts are used
- Quality control: crowdsourced votes may not fact-check; simple prompts (“hello”) don’t differentiate models; no support for RAG context
- Comparative ≠ absolute: a 51% win rate vs. A doesn’t tell you how much better B is or if either is “good enough”
Future: comparative evaluation scales better than benchmarks (never saturates); hard to game; remains relevant as models surpass human generation ability.
Vs. A/B testing: A/B testing = one model per user at a time; comparative = both models shown simultaneously.
Key Takeaways
- Evaluation is the biggest bottleneck in AI adoption; invest systematically, not ad hoc
- Language modeling metrics (perplexity, cross entropy) are cheap proxies for model capability but degrade after post-training
- Exact evaluation (functional correctness, lexical/semantic similarity) is deterministic; AI as a judge is subjective but flexible and increasingly common
- AI judges are only as good as the model + prompt + scoring system used; always version and lock them
- Biases in AI judges (self-bias, first-position, verbosity) must be measured and mitigated
- Comparative evaluation via leaderboards is hard to game and never saturates — valuable supplement to benchmarks
- Combine methods: cheap classifier on 100% + expensive AI judge on 1% for cost-effective production evaluation