Module 08 References
Frameworks and Tools
-
RAGAS Framework
https://docs.ragas.io
The primary reference for RAG evaluation metrics: faithfulness, answer relevancy,
context precision, context recall. Includes Python SDK and integration guides for
LangChain and LlamaIndex pipelines. -
LangSmith
https://docs.smith.langchain.com
Tracing, evaluation, and dataset management for LangChain and LangGraph applications.
Best-in-class for teams already using the LangChain ecosystem. -
LangFuse
https://langfuse.com/docs
Open-source LLM observability platform. Model-agnostic, self-hostable. Strong prompt
management and human annotation workflows. -
Arize Phoenix
https://phoenix.arize.com
ML observability with LLM support. Embeddings visualizer for finding failure clusters.
Works with OpenTelemetry for vendor-neutral tracing.
Official Documentation
- Anthropic Testing Guide
https://docs.anthropic.com/en/docs/build-with-claude/develop-tests
Anthropic’s guide to building test suites for Claude-based systems. Covers golden
datasets, automated evaluation, and the empirical prompting methodology.
Research Papers
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023)
https://arxiv.org/abs/2306.05685
The foundational paper on using LLMs to evaluate other LLMs. Identifies key biases
(verbosity, position, self-serving) and proposes mitigation strategies. Essential
reading before building any LLM judge.
Key Concepts Quick Reference
RAGAS Metrics
| Metric | What it measures | Low score diagnosis |
|---|---|---|
| Faithfulness | Are answer claims supported by context? | LLM is hallucinating |
| Answer Relevancy | Does the answer address the question? | LLM is off-topic |
| Context Precision | Were retrieved chunks useful? | Retriever is noisy |
| Context Recall | Did retrieval find all needed info? | Retriever is missing docs |
Evaluation Types
| Type | When to use | Cost | Speed |
|---|---|---|---|
| Exact match | Short, deterministic outputs | Low | Fast |
| Semantic similarity | Open-ended, paraphrase-tolerant | Low | Fast |
| LLM-as-judge | Complex quality dimensions | Medium | Medium |
| Human eval | Ground truth, calibration | High | Slow |
| A/B test | Production prompt comparison | Medium | Days |
LLM Judge Biases and Mitigations
| Bias | Mitigation |
|---|---|
| Verbosity bias | Add explicit conciseness criterion; penalize padding |
| Position bias | Swap A/B order; take average of both orderings |
| Self-serving bias | Use a different model as judge than the system model |
| Rubric drift | Add few-shot examples anchoring each score level |
| Anchoring | Explicitly state paraphrases are acceptable |
Minimum Golden Dataset Sizes
| Stage | Recommended size |
|---|---|
| Early development | 20 examples |
| Pre-production | 100 examples |
| Production system | 500+ examples |
| High-stakes (medical/legal) | 1000+ with expert annotation |
Related Modules in This Curriculum
- Module 02: RAG pipeline — the system being evaluated in exercises 1-4
- Module 06: Prompt engineering — understanding what changes to test in A/B experiments
- Module 09: Advanced agents — agent trajectory evaluation extends this module’s concepts