Module 08 References

Frameworks and Tools

  • RAGAS Framework
    https://docs.ragas.io
    The primary reference for RAG evaluation metrics: faithfulness, answer relevancy,
    context precision, context recall. Includes Python SDK and integration guides for
    LangChain and LlamaIndex pipelines.

  • LangSmith
    https://docs.smith.langchain.com
    Tracing, evaluation, and dataset management for LangChain and LangGraph applications.
    Best-in-class for teams already using the LangChain ecosystem.

  • LangFuse
    https://langfuse.com/docs
    Open-source LLM observability platform. Model-agnostic, self-hostable. Strong prompt
    management and human annotation workflows.

  • Arize Phoenix
    https://phoenix.arize.com
    ML observability with LLM support. Embeddings visualizer for finding failure clusters.
    Works with OpenTelemetry for vendor-neutral tracing.

Official Documentation

Research Papers

  • Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023)
    https://arxiv.org/abs/2306.05685
    The foundational paper on using LLMs to evaluate other LLMs. Identifies key biases
    (verbosity, position, self-serving) and proposes mitigation strategies. Essential
    reading before building any LLM judge.

Key Concepts Quick Reference

RAGAS Metrics

MetricWhat it measuresLow score diagnosis
FaithfulnessAre answer claims supported by context?LLM is hallucinating
Answer RelevancyDoes the answer address the question?LLM is off-topic
Context PrecisionWere retrieved chunks useful?Retriever is noisy
Context RecallDid retrieval find all needed info?Retriever is missing docs

Evaluation Types

TypeWhen to useCostSpeed
Exact matchShort, deterministic outputsLowFast
Semantic similarityOpen-ended, paraphrase-tolerantLowFast
LLM-as-judgeComplex quality dimensionsMediumMedium
Human evalGround truth, calibrationHighSlow
A/B testProduction prompt comparisonMediumDays

LLM Judge Biases and Mitigations

BiasMitigation
Verbosity biasAdd explicit conciseness criterion; penalize padding
Position biasSwap A/B order; take average of both orderings
Self-serving biasUse a different model as judge than the system model
Rubric driftAdd few-shot examples anchoring each score level
AnchoringExplicitly state paraphrases are acceptable

Minimum Golden Dataset Sizes

StageRecommended size
Early development20 examples
Pre-production100 examples
Production system500+ examples
High-stakes (medical/legal)1000+ with expert annotation
  • Module 02: RAG pipeline — the system being evaluated in exercises 1-4
  • Module 06: Prompt engineering — understanding what changes to test in A/B experiments
  • Module 09: Advanced agents — agent trajectory evaluation extends this module’s concepts