Module 08 References

Frameworks and Tools

RAGAS Framework
https://docs.ragas.io
The primary reference for RAG evaluation metrics: faithfulness, answer relevancy,
context precision, context recall. Includes Python SDK and integration guides for
LangChain and LlamaIndex pipelines.
LangSmith
https://docs.smith.langchain.com
Tracing, evaluation, and dataset management for LangChain and LangGraph applications.
Best-in-class for teams already using the LangChain ecosystem.
LangFuse
https://langfuse.com/docs
Open-source LLM observability platform. Model-agnostic, self-hostable. Strong prompt
management and human annotation workflows.
Arize Phoenix
https://phoenix.arize.com
ML observability with LLM support. Embeddings visualizer for finding failure clusters.
Works with OpenTelemetry for vendor-neutral tracing.

Anthropic Testing Guide
https://docs.anthropic.com/en/docs/build-with-claude/develop-tests
Anthropic’s guide to building test suites for Claude-based systems. Covers golden
datasets, automated evaluation, and the empirical prompting methodology.

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023)
https://arxiv.org/abs/2306.05685
The foundational paper on using LLMs to evaluate other LLMs. Identifies key biases
(verbosity, position, self-serving) and proposes mitigation strategies. Essential
reading before building any LLM judge.

Metric	What it measures	Low score diagnosis
Faithfulness	Are answer claims supported by context?	LLM is hallucinating
Answer Relevancy	Does the answer address the question?	LLM is off-topic
Context Precision	Were retrieved chunks useful?	Retriever is noisy
Context Recall	Did retrieval find all needed info?	Retriever is missing docs

Type	When to use	Cost	Speed
Exact match	Short, deterministic outputs	Low	Fast
Semantic similarity	Open-ended, paraphrase-tolerant	Low	Fast
LLM-as-judge	Complex quality dimensions	Medium	Medium
Human eval	Ground truth, calibration	High	Slow
A/B test	Production prompt comparison	Medium	Days

Bias	Mitigation
Verbosity bias	Add explicit conciseness criterion; penalize padding
Position bias	Swap A/B order; take average of both orderings
Self-serving bias	Use a different model as judge than the system model
Rubric drift	Add few-shot examples anchoring each score level
Anchoring	Explicitly state paraphrases are acceptable

Module 02: RAG pipeline — the system being evaluated in exercises 1-4
Module 06: Prompt engineering — understanding what changes to test in A/B experiments
Module 09: Advanced agents — agent trajectory evaluation extends this module’s concepts