Chapter 4: Evaluate AI Systems
Evaluation-Driven Development
Define evaluation criteria before building — analogous to test-driven development.
Why: companies avoid deploying applications with unclear ROI. Most common applications in production are those with clear evaluation criteria:
- Recommender systems: engagement/purchase rates
- Fraud detection: money saved
- Code generation: functional correctness (automated)
- Classification tasks (intent, sentiment): accuracy, F1
Evaluation is the biggest bottleneck to AI adoption. Building reliable evaluation pipelines unlocks applications that currently seem impossible.
Four buckets for evaluation criteria:
- Domain-specific capability
- Generation capability
- Instruction-following capability
- Cost and latency
Evaluation Criteria
Domain-Specific Capability
Does the model have the knowledge needed for your task? Evaluate with domain-specific benchmarks.
Code generation: functional correctness is primary. Also consider:
- Efficiency: BIRD-SQL measures runtime vs. ground truth query
- Readability: requires subjective evaluation (AI as a judge)
Non-coding domains: mostly multiple-choice questions (MCQ) — 75% of Eleuther’s lm-evaluation-harness as of April 2024.
MCQ metrics: accuracy, point systems for multi-answer, F1/precision/recall for classification.
MCQ limitation: performance is sensitive to small prompt changes (extra space, “Choices:” prefix). Tests knowledge/reasoning, not generation capabilities.
Generation Capability
Traditional NLG metrics (still useful for weak models, creative writing, low-resource languages):
- Fluency: grammatically correct and natural
- Coherence: well-structured logically
- Faithfulness: (translation) how faithful to original
- Relevance: (summarization) captures most important aspects
Modern LLMs rarely fail on fluency/coherence, so those metrics matter less now.
New metrics for foundation models:
Factual Consistency
Two settings:
- Local: output vs. explicitly provided context (summarization, customer support, business analysis)
- Global: output vs. open knowledge (general chatbots, fact-checking)
Evaluation approaches:
- AI as a judge (Liu et al. 2023; Luo et al. 2023): GPT-3.5/4 outperform previous methods; GPT-judge achieves 90–96% accuracy at predicting human judgment of truthfulness
- Self-verification (SelfCheckGPT): generate N additional responses; if they disagree with each other → original likely hallucinated. Expensive but reliable.
- Knowledge-augmented (SAFE by Google DeepMind):
- Decompose response into individual statements
- Make each statement self-contained
- Send fact-checking queries to Google Search
- AI determines consistency with search results
- Textual entailment (NLI): classify (premise, hypothesis) as entailment, contradiction, or neutral; specialized models like DeBERTa-v3-base-mnli-fever-anli (184M params)
TruthfulQA benchmark: 817 questions humans commonly get wrong due to misconceptions; 38 categories; GPT-judge finetuned to 90–96% accuracy; human expert baseline is 94%.
Hallucination tips: model tends to hallucinate on niche knowledge and things that don’t exist (“What did X say about Y?” when X never said anything about Y).
Safety
Categories of unsafe content:
- Inappropriate language (profanity, explicit content)
- Harmful recommendations/tutorials
- Hate speech (racist, sexist, homophobic)
- Violence (threats, graphic detail)
- Stereotypes (female nurses, male CEOs)
- Political/religious bias — GPT-4 is left-libertarian; Llama is more authoritarian (Feng et al., 2023)
Evaluation tools: general-purpose AI judges (GPT, Claude, Gemini), OpenAI moderation endpoint, Meta’s Llama Guard, Facebook hate speech model, Perspective API, toxicity classifiers.
Benchmarks: RealToxicityPrompts (100K prompts likely to elicit toxic output), BOLD (bias in open-ended generation).
Instruction-Following Capability
How well does the model follow the instructions given to it? Distinct from domain knowledge — a model may know sentiment analysis but output “HAPPY” instead of “POSITIVE”.
IFEval (Google): 25 automatically verifiable instruction types:
- Keyword inclusion/exclusion/frequency
- Language, length constraints (paragraphs, words, sentences)
- Format constraints (JSON, bullet points, title, sections)
INFOBench (Qin et al., 2024): broader view — also includes:
- Content constraints (“discuss only climate change”)
- Linguistic guidelines (“use Victorian English”)
- Style rules (“use a respectful tone”)
- Evaluated via yes/no question rubrics per instruction; AI judges used for verification
Key insight: model performance issues can be instruction-following OR domain capability — hard to disentangle. Always curate your own instruction-following benchmark for your specific instructions.
Roleplaying: 8th most common LMSYS use case; important for NPCs, AI companions, writing assistants.
- Evaluate both style and knowledge consistency
- RoleLLM benchmark: similarity scores + AI judges
- CharacterEval: human annotators + reward model, 5-point scale
Cost and Latency
Optimize along multiple objectives — Pareto optimization.
Key latency metrics: TTFT (time to first token), time per token, time per query.
Cost models:
- API providers: charge per token (input + output)
- Self-hosted: cost is compute (fixed hardware); cost per token decreases with scale
- At high scale, self-hosted can be much cheaper than APIs
Common GPU memory configurations drive model size choices: 16GB, 24GB, 48GB, 80GB → explains why many models are 7B or 65B parameters.
Model Selection
Model Selection Workflow
- Filter hard attributes (can’t/won’t change): licenses, data privacy, model size, your own policies
- Use public information (benchmarks, leaderboards) to narrow candidates
- Run private experiments with your own evaluation pipeline
- Monitor production and collect feedback
Model Build vs. Buy (Open Source vs. API)
“Open source” terminology:
- Open model: weights + training data public
- Open weight: weights public, training data hidden (majority of “open source” models)
- This book uses “open source” to mean open-weight
License questions to ask:
- Commercial use allowed?
- Restrictions for large-scale use? (Llama-2/3: >700M MAU needs special license)
- Can outputs be used to train other models? (Llama licenses don’t allow; Mistral changed)
7 axes for the API vs. self-host decision:
| Axis | API | Self-hosted |
|---|---|---|
| Data privacy | Must send data externally; risk of leaks (e.g., Samsung → banned ChatGPT) | Data stays internal |
| Data lineage | Commercial contracts protect you from IP risk | Open source may expose you if model has IP issues |
| Performance | Strongest models will be proprietary | Best open source lags behind commercial |
| Functionality | Scaling, function calling, structured outputs out of box; usually no logprobs | Logprobs accessible; may lack function calling/structured outputs |
| Cost | Per-token; expensive at scale | Engineering + hardware cost; cheaper per token at scale |
| Control/transparency | Rate limits; risk of losing access; unpredictable model updates | Can freeze model; full control; more transparent versioning |
| On-device deployment | Impossible without internet | Possible but challenging |
Key rule: at some scale, self-hosted is cheaper. Recalculate regularly.
Navigating Public Benchmarks
Benchmark selection challenges:
- Thousands of benchmarks exist; Google BIG-bench has 214 alone
- Different leaderboards select different benchmarks — not standardized
- Hugging Face leaderboard (2023): 6 benchmarks (ARC-C, MMLU, HellaSwag, TruthfulQA, WinoGrande, GSM-8K)
- HELM leaderboard: 10 benchmarks — only 2 overlap with Hugging Face
Benchmark correlation matters: ARC-C, MMLU, WinoGrande are highly correlated (r > 0.85); don’t need all three. TruthfulQA is only moderately correlated with others.
Aggregation: Hugging Face averages scores; HELM uses mean win rate. Both have limitations (equal weight, no context on what’s harder).
Data contamination — models trained on benchmark data:
- Most contamination is unintentional (internet scraping picks up public benchmarks)
- Schaeffer (2023) satirically demonstrated: 1M-param model trained only on benchmark data achieves near-perfect scores
- Detection methods: n-gram overlap (accurate but expensive; requires training data access), perplexity (cheap but less accurate)
- OpenAI found GPT-3 had 40%+ contamination in 13 benchmarks
- Best practice: disclose % contamination and report performance on clean samples separately
Public benchmarks saturate fast — Hugging Face updated their leaderboard within a year, replacing GSM-8K with MATH lvl 5, MMLU with MMLU-Pro.
Bottom line: public benchmarks narrow the candidate list but won’t find the best model for your use case. You need your own private evaluation pipeline.
Design Your Evaluation Pipeline
Step 1: Evaluate All Components
Evaluate end-to-end AND each component independently.
Example: Resume employer extraction (PDF → text → extract employer).
- If end-to-end fails, you don’t know if it’s the PDF parsing or the extraction
- Evaluate each step separately
Turn-based vs. task-based evaluation:
- Turn-based: quality of each response
- Task-based: did the system accomplish the goal? How many turns?
- Task-based is more important but harder to define boundaries for
- Example: BIG-bench’s “twenty questions” game → evaluate whether model guesses concept within N turns
Step 2: Create an Evaluation Guideline
Most important step. Define:
- What the application should do AND what it should NOT do (out-of-scope inputs)
- What makes a “good” response (not just “correct”)
- LinkedIn Job Assessment: “You are a terrible fit” may be correct but is a bad response — good response explains gap and path to improvement
For each criterion:
- Choose a scoring system: binary, 1-5, 0-1, categorical
- Create rubric with examples: what does score 1, 3, 5 look like and why?
- Validate rubric with humans: if humans can’t follow it, refine it
Map evaluation metrics to business metrics:
- Factual consistency 80% → automate 30% of requests
- Factual consistency 90% → automate 50%
- Factual consistency 98% → automate 90%
Define usefulness threshold: minimum score for the application to be useful at all.
LangChain State of AI 2023: average of 2.3 criteria per application.
Step 3: Define Evaluation Methods and Data
Selecting methods:
- Different criteria may need different methods (toxicity classifier + AI judge for factual consistency + semantic similarity for relevance)
- Mix: cheap classifier on 100% + expensive AI judge on 1% = good balance
- Use logprobs when available (for classification confidence, perplexity-based scores)
- Human evaluation as North Star: LinkedIn manually evaluates up to 500 daily conversations
Annotating evaluation data:
- Use actual production data if possible (natural labels preferred)
- Slice data by tiers, traffic source, usage patterns, error-prone scenarios, out-of-scope examples
- Avoid Simpson’s Paradox: model A may outperform model B on every slice but underperform overall (when slices are unequally sized)
- Evaluation data annotations can be reused for finetuning data
How much data is needed:
| Score difference to detect | Sample size for 95% confidence |
|---|---|
| 30% | ~10 |
| 10% | ~100 |
| 3% | ~1,000 |
| 1% | ~10,000 |
Rule: 3× smaller difference → 10× more samples.
- Median benchmark in lm-evaluation-harness: 1,000 examples
- Inverse Scaling Prize: 300 minimum, prefer 1,000+
Bootstrap validation: draw N samples with replacement from your evaluation set multiple times; if results vary wildly across bootstraps → need a larger evaluation set.
Evaluating your evaluation pipeline:
- Do better responses get higher scores?
- Do better metrics lead to better business outcomes?
- Is the pipeline reproducible? (Set AI judge temperature to 0)
- Are your metrics correlated? (Highly correlated → remove one; uncorrelated → interesting insight or untrustworthy metrics)
- What cost/latency does evaluation add?
Iteration: update criteria, rubrics, examples as needs evolve, but maintain consistency for trend-tracking. Track all evaluation variables in experiment logs.
Key Takeaways
- Evaluation-driven development: define criteria before building; tie AI metrics to business metrics
- Four evaluation buckets: domain capability, generation capability, instruction-following, cost/latency
- For factual consistency: AI judges, self-verification, SAFE (search-augmented), and NLI classifiers all have trade-offs; combine them
- Open source vs. API isn’t a one-time decision — re-evaluate as scale and needs change; data privacy and performance are the top decision factors
- Public benchmarks help filter bad models but not find the best one for you; contamination makes them unreliable for fine comparisons
- Build your own evaluation pipeline: evaluate each component independently, define clear rubrics, slice your data, validate with bootstrapping
- Sample size: detecting 3% difference needs ~1,000 examples; detecting 1% needs ~10,000