Implementation Guide — Evaluation Examples

This directory originally contained Python example scripts that were removed from this archive.
An AI assistant can recreate them by following the instructions below.


llm_as_judge.py

Purpose: Use Claude as an automated evaluator (LLM-as-judge) to score model outputs.

What to implement:

  1. Define a rubric with dimensions: correctness (0–3), completeness (0–3), conciseness (0–2), tone (0–2).
  2. Create a judge prompt: “Given the question, reference answer, and candidate answer, score the candidate on each dimension. Return JSON: {correctness, completeness, conciseness, tone, reasoning}.”
  3. Run 5 sample QA pairs (hardcoded) through two “model” responses (can be hardcoded strings simulating different quality levels).
  4. Parse scores with json.loads(), aggregate per-model averages, and print a comparison table.
  5. Show how to detect position bias by swapping candidate order in the prompt.

How to run: python llm_as_judge.py
Dependencies: anthropic


rag_eval_simple.py

Purpose: Evaluate a RAG pipeline on faithfulness and answer relevance.

What to implement:

  1. Define a small golden dataset: 5 {question, context_chunks[], ground_truth_answer} entries (hardcoded).
  2. Faithfulness: For each generated answer, ask Claude: “Does this answer contain only information supported by the given context? Score 0 (hallucinated) or 1 (faithful).”
  3. Answer relevance: Ask Claude: “Does this answer address the question? Score 0–2.”
  4. Context recall: For each ground-truth answer sentence, check if it could be derived from retrieved chunks (simple keyword overlap or another LLM call).
  5. Print a per-question breakdown and overall scores.

How to run: python rag_eval_simple.py
Dependencies: anthropic