Implementation Guide — Evaluation Examples
This directory originally contained Python example scripts that were removed from this archive.
An AI assistant can recreate them by following the instructions below.
llm_as_judge.py
Purpose: Use Claude as an automated evaluator (LLM-as-judge) to score model outputs.
What to implement:
- Define a rubric with dimensions:
correctness (0–3),completeness (0–3),conciseness (0–2),tone (0–2). - Create a judge prompt: “Given the question, reference answer, and candidate answer, score the candidate on each dimension. Return JSON:
{correctness, completeness, conciseness, tone, reasoning}.” - Run 5 sample QA pairs (hardcoded) through two “model” responses (can be hardcoded strings simulating different quality levels).
- Parse scores with
json.loads(), aggregate per-model averages, and print a comparison table. - Show how to detect position bias by swapping candidate order in the prompt.
How to run: python llm_as_judge.py
Dependencies: anthropic
rag_eval_simple.py
Purpose: Evaluate a RAG pipeline on faithfulness and answer relevance.
What to implement:
- Define a small golden dataset: 5
{question, context_chunks[], ground_truth_answer}entries (hardcoded). - Faithfulness: For each generated answer, ask Claude: “Does this answer contain only information supported by the given context? Score 0 (hallucinated) or 1 (faithful).”
- Answer relevance: Ask Claude: “Does this answer address the question? Score 0–2.”
- Context recall: For each ground-truth answer sentence, check if it could be derived from retrieved chunks (simple keyword overlap or another LLM call).
- Print a per-question breakdown and overall scores.
How to run: python rag_eval_simple.py
Dependencies: anthropic