RAG Exercises

Five exercises ranging from implementation to system design. Each is designed to
take 30–90 minutes and build directly on concepts from the main README.

Exercise 1: RAG Over a Local Folder of Text Files

Goal: Build a complete RAG pipeline that reads .txt files from a directory,
chunks them, indexes them, and answers questions.

Concepts practiced: Document loading, chunking, indexing, retrieval, generation.

What to Build

Create a script solutions/ex1_folder_rag.py that:

Accepts a folder path as a CLI argument: python ex1_folder_rag.py ./docs/
Recursively finds all .txt files in the folder
Reads each file and applies a chunking strategy of your choice
Adds each chunk to a Chroma collection, storing the source filename as metadata
Enters an interactive loop: prompts user for a query, retrieves top-3 chunks,
generates an answer with Claude, prints source filenames with the answer
Handles edge cases: empty folder, no matching files, very short files

Starter Code Skeleton

import sys
import pathlib
import chromadb
import anthropic
 
def load_txt_files(folder: str) -> list[dict]:
    """Returns list of {filename, text} dicts."""
    ...
 
def chunk_document(text: str, filename: str) -> list[dict]:
    """Returns list of {id, text, source} dicts."""
    ...
 
def build_index(chunks: list[dict]) -> chromadb.Collection:
    ...
 
def answer_query(query: str, collection, client: anthropic.Anthropic) -> str:
    ...
 
def main():
    folder = sys.argv[1] if len(sys.argv) > 1 else "./docs"
    chunks = []
    for doc in load_txt_files(folder):
        chunks.extend(chunk_document(doc["text"], doc["filename"]))
    collection = build_index(chunks)
    while True:
        query = input("\nQuery (or 'quit'): ").strip()
        if query.lower() == "quit":
            break
        print(answer_query(query, collection))

Test Dataset

Create a solutions/test_docs/ folder with 3–5 .txt files on any topic
(Wikipedia article excerpts work well).

Evaluation Criteria

Does it correctly load and chunk multiple files?
Does the answer mention which source file the information came from?
Does it gracefully handle a query that no document can answer?

Exercise 2: Compare Three Chunking Strategies on the Same Document

Goal: See concretely how chunking affects retrieval quality.

Concepts practiced: Fixed-size chunking, sentence chunking, semantic chunking,
retrieval evaluation.

What to Build

Create solutions/ex2_chunking_comparison.py:

Load one long document (at least 2000 words). A Wikipedia article export works well.
Implement three chunking strategies:
- Fixed-size: 300-character chunks, 30-character overlap
- Sentence-based: group sentences into chunks of max 5 sentences
- Semantic: split when cosine similarity between adjacent sentence embeddings
  drops below 0.5
For each strategy, build a separate Chroma collection
Define 5 test queries that have clear, specific answers in the document
For each query and each chunking strategy:
- Retrieve top-3 chunks
- Print whether the relevant passage was in the top-3 (manual inspection)
- Print the retrieved chunk that contains the answer (if any)
Print a summary table:

Query                       | Fixed  | Sentence | Semantic
"Who founded X?"            |  YES   |   YES    |   YES
"What happened in 1995?"    |   NO   |   YES    |   YES
"How does Y work?"          |  YES   |   NO     |   YES
...

Key Observation to Find

You should observe that fixed-size chunking produces at least one “split” failure —
a relevant passage that spans a chunk boundary and gets partially cut off. Document
this in a comment.

Extension Challenge

Add a fourth chunking strategy: hierarchical (parent=800 chars, child=150 chars).
Use child chunks for indexing but return parent chunks to the LLM.

Exercise 3: Add Hybrid Retrieval to the Basic RAG Example

Goal: Extend examples/basic_rag.py with BM25 + RRF fusion.

Concepts practiced: Sparse retrieval, BM25 scoring, RRF, comparing retrieval methods.

What to Build

Create solutions/ex3_hybrid_upgrade.py starting from the basic RAG code:

Add a BM25Retriever class wrapping rank_bm25.BM25Okapi
Implement reciprocal_rank_fusion(bm25_results, dense_results, k=60)
In the main query loop, run both retrievers and fuse with RRF
Add a --mode CLI flag: python ex3_hybrid_upgrade.py --mode dense|bm25|hybrid
For a fixed set of test queries, print which chunks each mode retrieves

Specific Test Cases

Design at least two queries that illustrate the difference:

Query A (BM25 wins): A query containing a specific term that appears verbatim
in only one chunk. Dense retrieval may match a semantically related but wrong chunk.
Example: if your corpus contains “BDFL” as an acronym, query for “BDFL” — BM25 will
find it; dense may not.

Query B (Dense wins): A conceptual query where the answer uses different
vocabulary. Example: query for “Python’s design principle around readable code” when
the document says “emphasizes code readability” (no direct keyword overlap).

What to Submit

A markdown comment block at the top of the file explaining which mode won for each
test query and why.

Exercise 4: Implement a Simple Faithfulness Evaluator

Goal: Build a lightweight faithfulness checker without using RAGAS.

Concepts practiced: LLM-as-judge evaluation, faithfulness metric, prompt engineering.

What to Build

Create solutions/ex4_faithfulness_eval.py:

A faithfulness evaluator takes (question, context, answer) and returns a score 0–1
representing what fraction of the claims in the answer are supported by the context.

Step 1: Claim Extraction
Use Claude to extract atomic claims from the answer:

def extract_claims(answer: str) -> list[str]:
    """
    Prompt Claude to break the answer into atomic factual claims.
    Example: "Python was created in 1991 by Guido van Rossum."
    → ["Python was created in 1991", "Python was created by Guido van Rossum"]
    """
    ...

Step 2: Claim Verification
For each claim, ask Claude: “Is this claim supported by the context? Answer YES or NO.”

def verify_claim(claim: str, context: str) -> bool:
    ...

Step 3: Score

def faithfulness_score(claims: list[str], context: str) -> float:
    supported = sum(verify_claim(c, context) for c in claims)
    return supported / len(claims) if claims else 0.0

Step 4: Test It
Define 3 test cases:

One perfectly faithful answer (all claims in context)
One partially faithful answer (some claims from context, some hallucinated)
One hallucinated answer (claims not present in context)

Expected output:

Test 1 — Perfect faithfulness:
  Claims: ["Python was released in 1991", "Created by Guido van Rossum"]
  Supported: 2/2
  Score: 1.00

Test 2 — Partial faithfulness:
  Claims: ["Python was released in 1991", "Created by Tim Peters"]
  Supported: 1/2
  Score: 0.50

Test 3 — Hallucination:
  Claims: ["Python was released in 1975", "Created by Dennis Ritchie"]
  Supported: 0/2
  Score: 0.00

Extension Challenge

Run your faithfulness evaluator on the output of examples/basic_rag.py for 10
different queries. Report the average faithfulness score. Identify any queries where
Claude hallucinated despite correct context being retrieved.

Exercise 5 (Interview Simulation): Design a RAG System for a Legal Firm’s 50-Year Document Archive

This is a mock system design interview. No code required — write a design document.

Format: Write your answer in solutions/ex5_design.md. Aim for 600–1000 words.
Treat it as if you are speaking in a 45-minute interview.

The Prompt

“A law firm has 50 years of case files, contracts, briefs, and legal opinions —
approximately 2 million documents, mostly scanned PDFs. They want to build an
internal assistant that associates can query to find relevant precedents, contract
clauses, and case history. Walk me through every decision you would make.”

Questions to Address

1. Document Ingestion

How do you handle scanned PDFs? What OCR tooling would you use?
How do you handle document quality variation (old scans, handwritten notes)?
How do you extract metadata (case number, date, parties, jurisdiction)?

2. Chunking Strategy

What chunking strategy would you choose for legal documents?
Legal documents have natural structure: clauses, sections, paragraphs. How do you
exploit that structure?
How would you handle very long briefs (200+ pages)?

3. Embedding Model

Would you use a general-purpose embedding model or a legal-domain-specific one?
What are the trade-offs of a model fine-tuned on legal text (e.g., LegalBERT)?
Would you use a hosted API or self-hosted model? Why?

4. Vector Database

Which vector database would you choose for 2M documents?
How would you handle metadata filtering? (e.g., “only search contracts from 2018–2023
in California jurisdiction”)
How would you handle access control? (not all associates should see all documents)

5. Retrieval Strategy

Would you use hybrid retrieval? Why?
Legal search heavily relies on exact citations (“42 U.S.C. § 1983”). How does this
affect your choice?
Would you implement reranking? What model?

6. Failure Modes

A user asks: “Find all cases where a non-compete clause was ruled unenforceable in
New York.” This is a multi-hop query. How does your system handle it?
How do you handle the case where the relevant precedent exists but uses different
terminology than the query?

7. Evaluation

How would you evaluate this system with domain experts?
What metrics would you track in production?

8. Compliance and Privacy

The documents contain privileged attorney-client communications. What safeguards
do you need?
How do you ensure the system does not “leak” one client’s documents when answering
questions about another client?

Evaluation Rubric (Self-Score)

Area	Strong Answer
OCR + ingestion	Mentions PDF parsing, OCR quality, metadata extraction strategy
Chunking	Exploits legal document structure (sections, clauses)
Embeddings	Discusses legal-domain models vs general-purpose trade-offs
Vector DB	Names a specific DB and justifies with scale + filtering needs
Retrieval	Hybrid + reranking + citation-exact matching (BM25 critical here)
Multi-hop	Agentic retrieval or iterative query decomposition
Evaluation	Mentions domain expert golden set + production metrics
Privacy + access control	Row-level security, metadata-based filtering, no cross-client leak

A strong answer addresses all 8 areas, explains the reasoning behind each choice
(not just what, but why), and acknowledges trade-offs honestly.

Study Notes by Niladri & AI

Explorer

README

RAG Exercises

Exercise 1: RAG Over a Local Folder of Text Files

What to Build

Starter Code Skeleton

Test Dataset

Evaluation Criteria

Exercise 2: Compare Three Chunking Strategies on the Same Document

What to Build

Key Observation to Find

Extension Challenge

Exercise 3: Add Hybrid Retrieval to the Basic RAG Example

What to Build

Specific Test Cases

What to Submit

Exercise 4: Implement a Simple Faithfulness Evaluator

What to Build

Extension Challenge

Exercise 5 (Interview Simulation): Design a RAG System for a Legal Firm’s 50-Year Document Archive

The Prompt

Questions to Address

Evaluation Rubric (Self-Score)

Graph View

Table of Contents