Chapter 3: Evaluation Methodology

Why Evaluation is Hard for Foundation Models

  1. Increasing intelligence → harder to evaluate: Evaluating PhD-level math solutions requires PhD-level expertise; verifying complex summaries requires reading the source
  2. Open-ended outputs: No exhaustive list of correct responses; can’t compare against ground truths like close-ended tasks
  3. Black-box models: Training data and architecture are often hidden; only outputs are observable
  4. Rapidly saturating benchmarks: GLUE (2018) → SuperGLUE (2019) → NaturalInstructions (2021) → Super-NaturalInstructions (2022) → MMLU → MMLU-Pro
  5. Expanded scope: General-purpose models must be evaluated across all possible tasks, including ones they could do that we haven’t discovered yet

Investment gap: Evaluation tools represent a small fraction of AI engineering repos vs. modeling and orchestration tools.


Language Modeling Metrics

All four metrics (cross entropy, perplexity, BPC, BPB) are interconvertible.

Entropy: measures how much information a token carries on average. Higher entropy = more information per token = harder to predict.

Cross entropy H(P,Q): how difficult it is for a model (distribution Q) to predict data from distribution P.

  • H(P, Q) = H(P) + KL(P || Q)
  • Perfect model: KL = 0, so H(P,Q) = H(P)
  • Training objective: minimize cross entropy

BPC (bits-per-character): cross entropy normalized by characters instead of tokens; comparable across different tokenization schemes.

BPB (bits-per-byte): further normalized to bytes; most standardized unit.

Perplexity (PPL): 2^H(P,Q) or e^H(P,Q) (using nats in PyTorch/TF)

  • Intuitively: the average number of options the model has when predicting the next token
  • PPL = 4 means the model is choosing among ~4 equally likely tokens on average
  • Lower = better (less uncertainty)

Perplexity rules of thumb:

  • More structured data (HTML) → lower expected perplexity
  • Larger vocabulary → higher perplexity
  • Longer context → lower perplexity (more info to predict from)
  • Post-training typically increases perplexity (model optimizes for tasks, not token prediction)
  • Quantization can change perplexity in unexpected ways

Perplexity use cases:

  • Proxy for model capability (before downstream evaluation)
  • Detecting data contamination: low PPL on a benchmark → model saw that data during training
  • Data deduplication: skip adding new data if its PPL is low (model already knows it)
  • Detecting abnormal text: very high PPL = gibberish or unusual ideas

Exact Evaluation

Produces unambiguous, reproducible scores.

Functional Correctness

Evaluates whether the output does what it’s supposed to do.

  • Best metric for code generation: run the code, check test cases
  • pass@k: fraction of problems solved when k code samples are generated per problem
    • pass@1 < pass@3 < pass@10 (more samples = more chances)
  • Also applicable to game bots, optimization tasks, workflow automation
  • Benchmarks: HumanEval (OpenAI), MBPP (Google), Spider/BIRD-SQL/WikiSQL (text-to-SQL)

HumanEval example: generate a gcd(num1, num2) function → run against 7 test cases to verify.

Similarity Measurements Against Reference Data

Four ways to measure similarity:

  1. Human/AI judgment
  2. Exact match
  3. Lexical similarity
  4. Semantic similarity

Exact match: binary; works only for short, unambiguous responses (math answers, trivia). Variant: accept output containing the reference. Fails for long-form tasks (many valid phrasings of the same idea).

Lexical similarity: overlap of tokens/n-grams.

  • Fuzzy matching / edit distance: Levenshtein distance (deletion, insertion, substitution; some include transposition)
  • N-gram similarity: what % of n-grams from reference appear in generated text
  • Common metrics: BLEU, ROUGE, METEOR++, TER, CIDEr
  • Limitation: good responses not in the reference set get low scores; Adept’s Fuyu case study. Also: BLEU doesn’t correlate with functional correctness in coding.

Semantic similarity (embedding similarity):

  • Convert texts to embedding vectors → compute cosine similarity
  • “What’s up?” and “How are you?” are lexically different but semantically similar
  • Metrics: BERTScore (BERT embeddings), MoverScore
  • Requires good embedding model; different embedding models give different results

Introduction to Embedding

  • Embedding = vector representation that captures meaning
  • Size: 100 to 10,000 dimensions
  • Common models: BERT (768/1024 dims), OpenAI text-embedding-3-small (1536), CLIP (512)
  • MTEB (Massive Text Embedding Benchmark) evaluates embedding quality across tasks
  • Multimodal embeddings: CLIP maps text+images to same space; ULIP adds 3D point clouds; ImageBind handles 6 modalities

AI as a Judge

Using AI to evaluate AI outputs. Also called LLM as a judge.

Why it works:

  • Fast, cheap, flexible — can evaluate any criteria
  • Doesn’t require reference data → usable in production
  • GPT-4 ↔ human agreement: 85% (vs. human-human: 81%) on MT-Bench
  • AlpacaEval: 0.98 correlation between AI judges and LMSYS Chatbot Arena human rankings
  • Can explain its decisions, useful for auditing

How to use AI judges:

  1. Evaluate response quality given the question (1-5 scale)
  2. Compare generated vs. reference response (True/False)
  3. Compare two generated responses (A vs. B) — useful for preference data, test time compute, comparative evaluation

Prompting tips:

  • Clearly state task, criteria, and scoring system
  • AI judges work better with classification than numerical scoring
  • Discrete (1-5) > continuous (0.0-1.0) for numerical scoring
  • Include examples of each score level with justifications
  • Include rubric with examples → consistency improves from 65% to 77.5%

Limitations of AI Judges

Inconsistency: Same judge, same input → potentially different scores. Fix: set temperature=0, include examples, use fixed model version.

Criteria ambiguity: MLflow, Ragas, LlamaIndex all have “faithfulness” criterion but use different prompts and scoring systems — their scores are not comparable. Rule: don’t trust AI judges without knowing the model and prompt.

Cost and latency: Using GPT-4 for both generation and evaluation → 2× API cost. Three criteria → 4× calls. Mitigate with: spot-checking (evaluate subset), weaker judge models.

Biases:

  • Self-bias: models favor their own outputs (Claude-v1: 25% higher win rate for itself; GPT-4: 10%)
  • First-position bias: favors first answer in pairwise comparison (opposite of humans’ recency bias)
  • Verbosity bias: favors longer responses even when shorter ones are correct; GPT-4 is less prone than GPT-3.5

What Models Can Act as Judges?

  • Stronger judge: better judgment; use a fast/cheap model to generate, stronger model to evaluate subset
  • Self-evaluation (self-critique): can catch obvious errors and prompt revision; useful for sanity checks
  • Weaker judge: judging is easier than generating (anyone can judge a song; not everyone can write one); small specialized judges can outperform large general judges for specific tasks

Specialized judge types:

  • Reward models (e.g., Google Cappy, 360M params): score (prompt, response) pairs — 0 to 1
  • Reference-based judges (e.g., BLEURT, Prometheus): compare generated vs. reference
  • Preference models (e.g., PandaLM, JudgeLM): predict which of two responses users prefer

Ranking Models with Comparative Evaluation

Pointwise evaluation: score each model independently → rank by scores.

Comparative evaluation: pit models against each other → compute ranking from match outcomes.

Comparative is generally easier for subjective outputs — easier to say “A is better than B” than to give A a score.

Process: for each prompt, pick two models → evaluator votes for winner → aggregate win rates → rating algorithm (Elo, Bradley-Terry, TrueSkill) → ranking.

LMSYS Chatbot Arena: anyone visits the site, enters a prompt, gets responses from two anonymous models, votes, then model names revealed. Bradley-Terry (not Elo) used because Elo is sensitive to ordering.

Challenges:

  • Scalability: n models → n(n-1)/2 model pairs; grows quadratically. LMSYS (Jan 2024): 57 models × 244K comparisons = 153 per pair (low)
  • Transitivity assumption: if A > B and B > C then A > C — may not hold for human preferences or when different evaluators/prompts are used
  • Quality control: crowdsourced votes may not fact-check; simple prompts (“hello”) don’t differentiate models; no support for RAG context
  • Comparative ≠ absolute: a 51% win rate vs. A doesn’t tell you how much better B is or if either is “good enough”

Future: comparative evaluation scales better than benchmarks (never saturates); hard to game; remains relevant as models surpass human generation ability.

Vs. A/B testing: A/B testing = one model per user at a time; comparative = both models shown simultaneously.


Key Takeaways

  • Evaluation is the biggest bottleneck in AI adoption; invest systematically, not ad hoc
  • Language modeling metrics (perplexity, cross entropy) are cheap proxies for model capability but degrade after post-training
  • Exact evaluation (functional correctness, lexical/semantic similarity) is deterministic; AI as a judge is subjective but flexible and increasingly common
  • AI judges are only as good as the model + prompt + scoring system used; always version and lock them
  • Biases in AI judges (self-bias, first-position, verbosity) must be measured and mitigated
  • Comparative evaluation via leaderboards is hard to game and never saturates — valuable supplement to benchmarks
  • Combine methods: cheap classifier on 100% + expensive AI judge on 1% for cost-effective production evaluation