Implementation Guide — Evaluation Examples

This directory originally contained Python example scripts that were removed from this archive.
An AI assistant can recreate them by following the instructions below.

`llm_as_judge.py`

Purpose: Use Claude as an automated evaluator (LLM-as-judge) to score model outputs.

What to implement:

Define a rubric with dimensions: correctness (0–3), completeness (0–3), conciseness (0–2), tone (0–2).
Create a judge prompt: “Given the question, reference answer, and candidate answer, score the candidate on each dimension. Return JSON: {correctness, completeness, conciseness, tone, reasoning}.”
Run 5 sample QA pairs (hardcoded) through two “model” responses (can be hardcoded strings simulating different quality levels).
Parse scores with json.loads(), aggregate per-model averages, and print a comparison table.
Show how to detect position bias by swapping candidate order in the prompt.

How to run: python llm_as_judge.py
Dependencies: anthropic

`rag_eval_simple.py`

Purpose: Evaluate a RAG pipeline on faithfulness and answer relevance.

What to implement:

Define a small golden dataset: 5 {question, context_chunks[], ground_truth_answer} entries (hardcoded).
Faithfulness: For each generated answer, ask Claude: “Does this answer contain only information supported by the given context? Score 0 (hallucinated) or 1 (faithful).”
Answer relevance: Ask Claude: “Does this answer address the question? Score 0–2.”
Context recall: For each ground-truth answer sentence, check if it could be derived from retrieved chunks (simple keyword overlap or another LLM call).
Print a per-question breakdown and overall scores.

How to run: python rag_eval_simple.py
Dependencies: anthropic

Study Notes by Niladri & AI

Explorer

IMPLEMENTATION_GUIDE

Implementation Guide — Evaluation Examples

`llm_as_judge.py`

`rag_eval_simple.py`

Graph View

Table of Contents