Chapter 3: Evaluation Methodology

Why Evaluation is Hard for Foundation Models

Increasing intelligence → harder to evaluate: Evaluating PhD-level math solutions requires PhD-level expertise; verifying complex summaries requires reading the source
Open-ended outputs: No exhaustive list of correct responses; can’t compare against ground truths like close-ended tasks
Black-box models: Training data and architecture are often hidden; only outputs are observable
Rapidly saturating benchmarks: GLUE (2018) → SuperGLUE (2019) → NaturalInstructions (2021) → Super-NaturalInstructions (2022) → MMLU → MMLU-Pro
Expanded scope: General-purpose models must be evaluated across all possible tasks, including ones they could do that we haven’t discovered yet

Investment gap: Evaluation tools represent a small fraction of AI engineering repos vs. modeling and orchestration tools.

Language Modeling Metrics

All four metrics (cross entropy, perplexity, BPC, BPB) are interconvertible.

Entropy: measures how much information a token carries on average. Higher entropy = more information per token = harder to predict.

Cross entropy H(P,Q): how difficult it is for a model (distribution Q) to predict data from distribution P.

H(P, Q) = H(P) + KL(P || Q)
Perfect model: KL = 0, so H(P,Q) = H(P)
Training objective: minimize cross entropy

BPC (bits-per-character): cross entropy normalized by characters instead of tokens; comparable across different tokenization schemes.

BPB (bits-per-byte): further normalized to bytes; most standardized unit.

Perplexity (PPL): 2^H(P,Q) or e^H(P,Q) (using nats in PyTorch/TF)

Intuitively: the average number of options the model has when predicting the next token
PPL = 4 means the model is choosing among ~4 equally likely tokens on average
Lower = better (less uncertainty)

Perplexity rules of thumb:

More structured data (HTML) → lower expected perplexity
Larger vocabulary → higher perplexity
Longer context → lower perplexity (more info to predict from)
Post-training typically increases perplexity (model optimizes for tasks, not token prediction)
Quantization can change perplexity in unexpected ways

Perplexity use cases:

Proxy for model capability (before downstream evaluation)
Detecting data contamination: low PPL on a benchmark → model saw that data during training
Data deduplication: skip adding new data if its PPL is low (model already knows it)
Detecting abnormal text: very high PPL = gibberish or unusual ideas

Exact Evaluation

Produces unambiguous, reproducible scores.

Functional Correctness

Evaluates whether the output does what it’s supposed to do.

Best metric for code generation: run the code, check test cases
pass@k: fraction of problems solved when k code samples are generated per problem
- pass@1 < pass@3 < pass@10 (more samples = more chances)
Also applicable to game bots, optimization tasks, workflow automation
Benchmarks: HumanEval (OpenAI), MBPP (Google), Spider/BIRD-SQL/WikiSQL (text-to-SQL)

HumanEval example: generate a gcd(num1, num2) function → run against 7 test cases to verify.

Similarity Measurements Against Reference Data

Four ways to measure similarity:

Human/AI judgment
Exact match
Lexical similarity
Semantic similarity

Exact match: binary; works only for short, unambiguous responses (math answers, trivia). Variant: accept output containing the reference. Fails for long-form tasks (many valid phrasings of the same idea).

Lexical similarity: overlap of tokens/n-grams.

Fuzzy matching / edit distance: Levenshtein distance (deletion, insertion, substitution; some include transposition)
N-gram similarity: what % of n-grams from reference appear in generated text
Common metrics: BLEU, ROUGE, METEOR++, TER, CIDEr
Limitation: good responses not in the reference set get low scores; Adept’s Fuyu case study. Also: BLEU doesn’t correlate with functional correctness in coding.

Semantic similarity (embedding similarity):

Convert texts to embedding vectors → compute cosine similarity
“What’s up?” and “How are you?” are lexically different but semantically similar
Metrics: BERTScore (BERT embeddings), MoverScore
Requires good embedding model; different embedding models give different results

Introduction to Embedding

Embedding = vector representation that captures meaning
Size: 100 to 10,000 dimensions
Common models: BERT (768/1024 dims), OpenAI text-embedding-3-small (1536), CLIP (512)
MTEB (Massive Text Embedding Benchmark) evaluates embedding quality across tasks
Multimodal embeddings: CLIP maps text+images to same space; ULIP adds 3D point clouds; ImageBind handles 6 modalities

AI as a Judge

Using AI to evaluate AI outputs. Also called LLM as a judge.

Why it works:

Fast, cheap, flexible — can evaluate any criteria
Doesn’t require reference data → usable in production
GPT-4 ↔ human agreement: 85% (vs. human-human: 81%) on MT-Bench
AlpacaEval: 0.98 correlation between AI judges and LMSYS Chatbot Arena human rankings
Can explain its decisions, useful for auditing

How to use AI judges:

Evaluate response quality given the question (1-5 scale)
Compare generated vs. reference response (True/False)
Compare two generated responses (A vs. B) — useful for preference data, test time compute, comparative evaluation

Prompting tips:

Clearly state task, criteria, and scoring system
AI judges work better with classification than numerical scoring
Discrete (1-5) > continuous (0.0-1.0) for numerical scoring
Include examples of each score level with justifications
Include rubric with examples → consistency improves from 65% to 77.5%

Limitations of AI Judges

Inconsistency: Same judge, same input → potentially different scores. Fix: set temperature=0, include examples, use fixed model version.

Criteria ambiguity: MLflow, Ragas, LlamaIndex all have “faithfulness” criterion but use different prompts and scoring systems — their scores are not comparable. Rule: don’t trust AI judges without knowing the model and prompt.

Cost and latency: Using GPT-4 for both generation and evaluation → 2× API cost. Three criteria → 4× calls. Mitigate with: spot-checking (evaluate subset), weaker judge models.

Biases:

Self-bias: models favor their own outputs (Claude-v1: 25% higher win rate for itself; GPT-4: 10%)
First-position bias: favors first answer in pairwise comparison (opposite of humans’ recency bias)
Verbosity bias: favors longer responses even when shorter ones are correct; GPT-4 is less prone than GPT-3.5

What Models Can Act as Judges?

Stronger judge: better judgment; use a fast/cheap model to generate, stronger model to evaluate subset
Self-evaluation (self-critique): can catch obvious errors and prompt revision; useful for sanity checks
Weaker judge: judging is easier than generating (anyone can judge a song; not everyone can write one); small specialized judges can outperform large general judges for specific tasks

Specialized judge types:

Reward models (e.g., Google Cappy, 360M params): score (prompt, response) pairs — 0 to 1
Reference-based judges (e.g., BLEURT, Prometheus): compare generated vs. reference
Preference models (e.g., PandaLM, JudgeLM): predict which of two responses users prefer

Ranking Models with Comparative Evaluation

Pointwise evaluation: score each model independently → rank by scores.

Comparative evaluation: pit models against each other → compute ranking from match outcomes.

Comparative is generally easier for subjective outputs — easier to say “A is better than B” than to give A a score.

Process: for each prompt, pick two models → evaluator votes for winner → aggregate win rates → rating algorithm (Elo, Bradley-Terry, TrueSkill) → ranking.

LMSYS Chatbot Arena: anyone visits the site, enters a prompt, gets responses from two anonymous models, votes, then model names revealed. Bradley-Terry (not Elo) used because Elo is sensitive to ordering.

Challenges:

Scalability: n models → n(n-1)/2 model pairs; grows quadratically. LMSYS (Jan 2024): 57 models × 244K comparisons = 153 per pair (low)
Transitivity assumption: if A > B and B > C then A > C — may not hold for human preferences or when different evaluators/prompts are used
Quality control: crowdsourced votes may not fact-check; simple prompts (“hello”) don’t differentiate models; no support for RAG context
Comparative ≠ absolute: a 51% win rate vs. A doesn’t tell you how much better B is or if either is “good enough”

Future: comparative evaluation scales better than benchmarks (never saturates); hard to game; remains relevant as models surpass human generation ability.

Vs. A/B testing: A/B testing = one model per user at a time; comparative = both models shown simultaneously.

Key Takeaways

Evaluation is the biggest bottleneck in AI adoption; invest systematically, not ad hoc
Language modeling metrics (perplexity, cross entropy) are cheap proxies for model capability but degrade after post-training
Exact evaluation (functional correctness, lexical/semantic similarity) is deterministic; AI as a judge is subjective but flexible and increasingly common
AI judges are only as good as the model + prompt + scoring system used; always version and lock them
Biases in AI judges (self-bias, first-position, verbosity) must be measured and mitigated
Comparative evaluation via leaderboards is hard to game and never saturates — valuable supplement to benchmarks
Combine methods: cheap classifier on 100% + expensive AI judge on 1% for cost-effective production evaluation

Study Notes by Niladri & AI

Explorer

03-evaluation-methodology