Chapter 4: Evaluate AI Systems

Evaluation-Driven Development

Define evaluation criteria before building — analogous to test-driven development.

Why: companies avoid deploying applications with unclear ROI. Most common applications in production are those with clear evaluation criteria:

Recommender systems: engagement/purchase rates
Fraud detection: money saved
Code generation: functional correctness (automated)
Classification tasks (intent, sentiment): accuracy, F1

Evaluation is the biggest bottleneck to AI adoption. Building reliable evaluation pipelines unlocks applications that currently seem impossible.

Four buckets for evaluation criteria:

Domain-specific capability
Generation capability
Instruction-following capability
Cost and latency

Evaluation Criteria

Domain-Specific Capability

Does the model have the knowledge needed for your task? Evaluate with domain-specific benchmarks.

Code generation: functional correctness is primary. Also consider:

Efficiency: BIRD-SQL measures runtime vs. ground truth query
Readability: requires subjective evaluation (AI as a judge)

Non-coding domains: mostly multiple-choice questions (MCQ) — 75% of Eleuther’s lm-evaluation-harness as of April 2024.

MCQ metrics: accuracy, point systems for multi-answer, F1/precision/recall for classification.

MCQ limitation: performance is sensitive to small prompt changes (extra space, “Choices:” prefix). Tests knowledge/reasoning, not generation capabilities.

Generation Capability

Traditional NLG metrics (still useful for weak models, creative writing, low-resource languages):

Fluency: grammatically correct and natural
Coherence: well-structured logically
Faithfulness: (translation) how faithful to original
Relevance: (summarization) captures most important aspects

Modern LLMs rarely fail on fluency/coherence, so those metrics matter less now.

New metrics for foundation models:

Factual Consistency

Two settings:

Local: output vs. explicitly provided context (summarization, customer support, business analysis)
Global: output vs. open knowledge (general chatbots, fact-checking)

Evaluation approaches:

AI as a judge (Liu et al. 2023; Luo et al. 2023): GPT-3.5/4 outperform previous methods; GPT-judge achieves 90–96% accuracy at predicting human judgment of truthfulness
Self-verification (SelfCheckGPT): generate N additional responses; if they disagree with each other → original likely hallucinated. Expensive but reliable.
Knowledge-augmented (SAFE by Google DeepMind):
1. Decompose response into individual statements
2. Make each statement self-contained
3. Send fact-checking queries to Google Search
4. AI determines consistency with search results
Textual entailment (NLI): classify (premise, hypothesis) as entailment, contradiction, or neutral; specialized models like DeBERTa-v3-base-mnli-fever-anli (184M params)

TruthfulQA benchmark: 817 questions humans commonly get wrong due to misconceptions; 38 categories; GPT-judge finetuned to 90–96% accuracy; human expert baseline is 94%.

Hallucination tips: model tends to hallucinate on niche knowledge and things that don’t exist (“What did X say about Y?” when X never said anything about Y).

Safety

Categories of unsafe content:

Inappropriate language (profanity, explicit content)
Harmful recommendations/tutorials
Hate speech (racist, sexist, homophobic)
Violence (threats, graphic detail)
Stereotypes (female nurses, male CEOs)
Political/religious bias — GPT-4 is left-libertarian; Llama is more authoritarian (Feng et al., 2023)

Evaluation tools: general-purpose AI judges (GPT, Claude, Gemini), OpenAI moderation endpoint, Meta’s Llama Guard, Facebook hate speech model, Perspective API, toxicity classifiers.

Benchmarks: RealToxicityPrompts (100K prompts likely to elicit toxic output), BOLD (bias in open-ended generation).

Instruction-Following Capability

How well does the model follow the instructions given to it? Distinct from domain knowledge — a model may know sentiment analysis but output “HAPPY” instead of “POSITIVE”.

IFEval (Google): 25 automatically verifiable instruction types:

Keyword inclusion/exclusion/frequency
Language, length constraints (paragraphs, words, sentences)
Format constraints (JSON, bullet points, title, sections)

INFOBench (Qin et al., 2024): broader view — also includes:

Content constraints (“discuss only climate change”)
Linguistic guidelines (“use Victorian English”)
Style rules (“use a respectful tone”)
Evaluated via yes/no question rubrics per instruction; AI judges used for verification

Key insight: model performance issues can be instruction-following OR domain capability — hard to disentangle. Always curate your own instruction-following benchmark for your specific instructions.

Roleplaying: 8th most common LMSYS use case; important for NPCs, AI companions, writing assistants.

Evaluate both style and knowledge consistency
RoleLLM benchmark: similarity scores + AI judges
CharacterEval: human annotators + reward model, 5-point scale

Cost and Latency

Optimize along multiple objectives — Pareto optimization.

Key latency metrics: TTFT (time to first token), time per token, time per query.

Cost models:

API providers: charge per token (input + output)
Self-hosted: cost is compute (fixed hardware); cost per token decreases with scale
At high scale, self-hosted can be much cheaper than APIs

Common GPU memory configurations drive model size choices: 16GB, 24GB, 48GB, 80GB → explains why many models are 7B or 65B parameters.

Model Selection

Model Selection Workflow

Filter hard attributes (can’t/won’t change): licenses, data privacy, model size, your own policies
Use public information (benchmarks, leaderboards) to narrow candidates
Run private experiments with your own evaluation pipeline
Monitor production and collect feedback

Model Build vs. Buy (Open Source vs. API)

“Open source” terminology:

Open model: weights + training data public
Open weight: weights public, training data hidden (majority of “open source” models)
This book uses “open source” to mean open-weight

License questions to ask:

Commercial use allowed?
Restrictions for large-scale use? (Llama-2/3: >700M MAU needs special license)
Can outputs be used to train other models? (Llama licenses don’t allow; Mistral changed)

7 axes for the API vs. self-host decision:

Axis	API	Self-hosted
Data privacy	Must send data externally; risk of leaks (e.g., Samsung → banned ChatGPT)	Data stays internal
Data lineage	Commercial contracts protect you from IP risk	Open source may expose you if model has IP issues
Performance	Strongest models will be proprietary	Best open source lags behind commercial
Functionality	Scaling, function calling, structured outputs out of box; usually no logprobs	Logprobs accessible; may lack function calling/structured outputs
Cost	Per-token; expensive at scale	Engineering + hardware cost; cheaper per token at scale
Control/transparency	Rate limits; risk of losing access; unpredictable model updates	Can freeze model; full control; more transparent versioning
On-device deployment	Impossible without internet	Possible but challenging

Key rule: at some scale, self-hosted is cheaper. Recalculate regularly.

Navigating Public Benchmarks

Benchmark selection challenges:

Thousands of benchmarks exist; Google BIG-bench has 214 alone
Different leaderboards select different benchmarks — not standardized
Hugging Face leaderboard (2023): 6 benchmarks (ARC-C, MMLU, HellaSwag, TruthfulQA, WinoGrande, GSM-8K)
HELM leaderboard: 10 benchmarks — only 2 overlap with Hugging Face

Benchmark correlation matters: ARC-C, MMLU, WinoGrande are highly correlated (r > 0.85); don’t need all three. TruthfulQA is only moderately correlated with others.

Aggregation: Hugging Face averages scores; HELM uses mean win rate. Both have limitations (equal weight, no context on what’s harder).

Data contamination — models trained on benchmark data:

Most contamination is unintentional (internet scraping picks up public benchmarks)
Schaeffer (2023) satirically demonstrated: 1M-param model trained only on benchmark data achieves near-perfect scores
Detection methods: n-gram overlap (accurate but expensive; requires training data access), perplexity (cheap but less accurate)
OpenAI found GPT-3 had 40%+ contamination in 13 benchmarks
Best practice: disclose % contamination and report performance on clean samples separately

Public benchmarks saturate fast — Hugging Face updated their leaderboard within a year, replacing GSM-8K with MATH lvl 5, MMLU with MMLU-Pro.

Bottom line: public benchmarks narrow the candidate list but won’t find the best model for your use case. You need your own private evaluation pipeline.

Design Your Evaluation Pipeline

Step 1: Evaluate All Components

Evaluate end-to-end AND each component independently.

Example: Resume employer extraction (PDF → text → extract employer).

If end-to-end fails, you don’t know if it’s the PDF parsing or the extraction
Evaluate each step separately

Turn-based vs. task-based evaluation:

Turn-based: quality of each response
Task-based: did the system accomplish the goal? How many turns?
Task-based is more important but harder to define boundaries for
Example: BIG-bench’s “twenty questions” game → evaluate whether model guesses concept within N turns

Step 2: Create an Evaluation Guideline

Most important step. Define:

What the application should do AND what it should NOT do (out-of-scope inputs)
What makes a “good” response (not just “correct”)
- LinkedIn Job Assessment: “You are a terrible fit” may be correct but is a bad response — good response explains gap and path to improvement

For each criterion:

Choose a scoring system: binary, 1-5, 0-1, categorical
Create rubric with examples: what does score 1, 3, 5 look like and why?
Validate rubric with humans: if humans can’t follow it, refine it

Map evaluation metrics to business metrics:

Factual consistency 80% → automate 30% of requests
Factual consistency 90% → automate 50%
Factual consistency 98% → automate 90%

Define usefulness threshold: minimum score for the application to be useful at all.

LangChain State of AI 2023: average of 2.3 criteria per application.

Step 3: Define Evaluation Methods and Data

Selecting methods:

Different criteria may need different methods (toxicity classifier + AI judge for factual consistency + semantic similarity for relevance)
Mix: cheap classifier on 100% + expensive AI judge on 1% = good balance
Use logprobs when available (for classification confidence, perplexity-based scores)
Human evaluation as North Star: LinkedIn manually evaluates up to 500 daily conversations

Annotating evaluation data:

Use actual production data if possible (natural labels preferred)
Slice data by tiers, traffic source, usage patterns, error-prone scenarios, out-of-scope examples
Avoid Simpson’s Paradox: model A may outperform model B on every slice but underperform overall (when slices are unequally sized)
Evaluation data annotations can be reused for finetuning data

How much data is needed:

Score difference to detect	Sample size for 95% confidence
30%	~10
10%	~100
3%	~1,000
1%	~10,000

Rule: 3× smaller difference → 10× more samples.

Median benchmark in lm-evaluation-harness: 1,000 examples
Inverse Scaling Prize: 300 minimum, prefer 1,000+

Bootstrap validation: draw N samples with replacement from your evaluation set multiple times; if results vary wildly across bootstraps → need a larger evaluation set.

Evaluating your evaluation pipeline:

Do better responses get higher scores?
Do better metrics lead to better business outcomes?
Is the pipeline reproducible? (Set AI judge temperature to 0)
Are your metrics correlated? (Highly correlated → remove one; uncorrelated → interesting insight or untrustworthy metrics)
What cost/latency does evaluation add?

Iteration: update criteria, rubrics, examples as needs evolve, but maintain consistency for trend-tracking. Track all evaluation variables in experiment logs.

Key Takeaways

Evaluation-driven development: define criteria before building; tie AI metrics to business metrics
Four evaluation buckets: domain capability, generation capability, instruction-following, cost/latency
For factual consistency: AI judges, self-verification, SAFE (search-augmented), and NLI classifiers all have trade-offs; combine them
Open source vs. API isn’t a one-time decision — re-evaluate as scale and needs change; data privacy and performance are the top decision factors
Public benchmarks help filter bad models but not find the best one for you; contamination makes them unreliable for fine comparisons
Build your own evaluation pipeline: evaluate each component independently, define clear rubrics, slice your data, validate with bootstrapping
Sample size: detecting 3% difference needs ~1,000 examples; detecting 1% needs ~10,000

Study Notes by Niladri & AI

Explorer

04-evaluate-ai-systems