RAG Exercises

Five exercises ranging from implementation to system design. Each is designed to
take 30–90 minutes and build directly on concepts from the main README.


Exercise 1: RAG Over a Local Folder of Text Files

Goal: Build a complete RAG pipeline that reads .txt files from a directory,
chunks them, indexes them, and answers questions.

Concepts practiced: Document loading, chunking, indexing, retrieval, generation.

What to Build

Create a script solutions/ex1_folder_rag.py that:

  1. Accepts a folder path as a CLI argument: python ex1_folder_rag.py ./docs/
  2. Recursively finds all .txt files in the folder
  3. Reads each file and applies a chunking strategy of your choice
  4. Adds each chunk to a Chroma collection, storing the source filename as metadata
  5. Enters an interactive loop: prompts user for a query, retrieves top-3 chunks,
    generates an answer with Claude, prints source filenames with the answer
  6. Handles edge cases: empty folder, no matching files, very short files

Starter Code Skeleton

import sys
import pathlib
import chromadb
import anthropic
 
def load_txt_files(folder: str) -> list[dict]:
    """Returns list of {filename, text} dicts."""
    ...
 
def chunk_document(text: str, filename: str) -> list[dict]:
    """Returns list of {id, text, source} dicts."""
    ...
 
def build_index(chunks: list[dict]) -> chromadb.Collection:
    ...
 
def answer_query(query: str, collection, client: anthropic.Anthropic) -> str:
    ...
 
def main():
    folder = sys.argv[1] if len(sys.argv) > 1 else "./docs"
    chunks = []
    for doc in load_txt_files(folder):
        chunks.extend(chunk_document(doc["text"], doc["filename"]))
    collection = build_index(chunks)
    while True:
        query = input("\nQuery (or 'quit'): ").strip()
        if query.lower() == "quit":
            break
        print(answer_query(query, collection))

Test Dataset

Create a solutions/test_docs/ folder with 3–5 .txt files on any topic
(Wikipedia article excerpts work well).

Evaluation Criteria

  • Does it correctly load and chunk multiple files?
  • Does the answer mention which source file the information came from?
  • Does it gracefully handle a query that no document can answer?

Exercise 2: Compare Three Chunking Strategies on the Same Document

Goal: See concretely how chunking affects retrieval quality.

Concepts practiced: Fixed-size chunking, sentence chunking, semantic chunking,
retrieval evaluation.

What to Build

Create solutions/ex2_chunking_comparison.py:

  1. Load one long document (at least 2000 words). A Wikipedia article export works well.
  2. Implement three chunking strategies:
    • Fixed-size: 300-character chunks, 30-character overlap
    • Sentence-based: group sentences into chunks of max 5 sentences
    • Semantic: split when cosine similarity between adjacent sentence embeddings
      drops below 0.5
  3. For each strategy, build a separate Chroma collection
  4. Define 5 test queries that have clear, specific answers in the document
  5. For each query and each chunking strategy:
    • Retrieve top-3 chunks
    • Print whether the relevant passage was in the top-3 (manual inspection)
    • Print the retrieved chunk that contains the answer (if any)
  6. Print a summary table:
Query                       | Fixed  | Sentence | Semantic
"Who founded X?"            |  YES   |   YES    |   YES
"What happened in 1995?"    |   NO   |   YES    |   YES
"How does Y work?"          |  YES   |   NO     |   YES
...

Key Observation to Find

You should observe that fixed-size chunking produces at least one “split” failure —
a relevant passage that spans a chunk boundary and gets partially cut off. Document
this in a comment.

Extension Challenge

Add a fourth chunking strategy: hierarchical (parent=800 chars, child=150 chars).
Use child chunks for indexing but return parent chunks to the LLM.


Exercise 3: Add Hybrid Retrieval to the Basic RAG Example

Goal: Extend examples/basic_rag.py with BM25 + RRF fusion.

Concepts practiced: Sparse retrieval, BM25 scoring, RRF, comparing retrieval methods.

What to Build

Create solutions/ex3_hybrid_upgrade.py starting from the basic RAG code:

  1. Add a BM25Retriever class wrapping rank_bm25.BM25Okapi
  2. Implement reciprocal_rank_fusion(bm25_results, dense_results, k=60)
  3. In the main query loop, run both retrievers and fuse with RRF
  4. Add a --mode CLI flag: python ex3_hybrid_upgrade.py --mode dense|bm25|hybrid
  5. For a fixed set of test queries, print which chunks each mode retrieves

Specific Test Cases

Design at least two queries that illustrate the difference:

Query A (BM25 wins): A query containing a specific term that appears verbatim
in only one chunk. Dense retrieval may match a semantically related but wrong chunk.
Example: if your corpus contains “BDFL” as an acronym, query for “BDFL” — BM25 will
find it; dense may not.

Query B (Dense wins): A conceptual query where the answer uses different
vocabulary. Example: query for “Python’s design principle around readable code” when
the document says “emphasizes code readability” (no direct keyword overlap).

What to Submit

A markdown comment block at the top of the file explaining which mode won for each
test query and why.


Exercise 4: Implement a Simple Faithfulness Evaluator

Goal: Build a lightweight faithfulness checker without using RAGAS.

Concepts practiced: LLM-as-judge evaluation, faithfulness metric, prompt engineering.

What to Build

Create solutions/ex4_faithfulness_eval.py:

A faithfulness evaluator takes (question, context, answer) and returns a score 0–1
representing what fraction of the claims in the answer are supported by the context.

Step 1: Claim Extraction
Use Claude to extract atomic claims from the answer:

def extract_claims(answer: str) -> list[str]:
    """
    Prompt Claude to break the answer into atomic factual claims.
    Example: "Python was created in 1991 by Guido van Rossum."
    → ["Python was created in 1991", "Python was created by Guido van Rossum"]
    """
    ...

Step 2: Claim Verification
For each claim, ask Claude: “Is this claim supported by the context? Answer YES or NO.”

def verify_claim(claim: str, context: str) -> bool:
    ...

Step 3: Score

def faithfulness_score(claims: list[str], context: str) -> float:
    supported = sum(verify_claim(c, context) for c in claims)
    return supported / len(claims) if claims else 0.0

Step 4: Test It
Define 3 test cases:

  • One perfectly faithful answer (all claims in context)
  • One partially faithful answer (some claims from context, some hallucinated)
  • One hallucinated answer (claims not present in context)

Expected output:

Test 1 — Perfect faithfulness:
  Claims: ["Python was released in 1991", "Created by Guido van Rossum"]
  Supported: 2/2
  Score: 1.00

Test 2 — Partial faithfulness:
  Claims: ["Python was released in 1991", "Created by Tim Peters"]
  Supported: 1/2
  Score: 0.50

Test 3 — Hallucination:
  Claims: ["Python was released in 1975", "Created by Dennis Ritchie"]
  Supported: 0/2
  Score: 0.00

Extension Challenge

Run your faithfulness evaluator on the output of examples/basic_rag.py for 10
different queries. Report the average faithfulness score. Identify any queries where
Claude hallucinated despite correct context being retrieved.


This is a mock system design interview. No code required — write a design document.

Format: Write your answer in solutions/ex5_design.md. Aim for 600–1000 words.
Treat it as if you are speaking in a 45-minute interview.

The Prompt

“A law firm has 50 years of case files, contracts, briefs, and legal opinions —
approximately 2 million documents, mostly scanned PDFs. They want to build an
internal assistant that associates can query to find relevant precedents, contract
clauses, and case history. Walk me through every decision you would make.”

Questions to Address

1. Document Ingestion

  • How do you handle scanned PDFs? What OCR tooling would you use?
  • How do you handle document quality variation (old scans, handwritten notes)?
  • How do you extract metadata (case number, date, parties, jurisdiction)?

2. Chunking Strategy

  • What chunking strategy would you choose for legal documents?
  • Legal documents have natural structure: clauses, sections, paragraphs. How do you
    exploit that structure?
  • How would you handle very long briefs (200+ pages)?

3. Embedding Model

  • Would you use a general-purpose embedding model or a legal-domain-specific one?
  • What are the trade-offs of a model fine-tuned on legal text (e.g., LegalBERT)?
  • Would you use a hosted API or self-hosted model? Why?

4. Vector Database

  • Which vector database would you choose for 2M documents?
  • How would you handle metadata filtering? (e.g., “only search contracts from 2018–2023
    in California jurisdiction”)
  • How would you handle access control? (not all associates should see all documents)

5. Retrieval Strategy

  • Would you use hybrid retrieval? Why?
  • Legal search heavily relies on exact citations (“42 U.S.C. § 1983”). How does this
    affect your choice?
  • Would you implement reranking? What model?

6. Failure Modes

  • A user asks: “Find all cases where a non-compete clause was ruled unenforceable in
    New York.” This is a multi-hop query. How does your system handle it?
  • How do you handle the case where the relevant precedent exists but uses different
    terminology than the query?

7. Evaluation

  • How would you evaluate this system with domain experts?
  • What metrics would you track in production?

8. Compliance and Privacy

  • The documents contain privileged attorney-client communications. What safeguards
    do you need?
  • How do you ensure the system does not “leak” one client’s documents when answering
    questions about another client?

Evaluation Rubric (Self-Score)

AreaStrong Answer
OCR + ingestionMentions PDF parsing, OCR quality, metadata extraction strategy
ChunkingExploits legal document structure (sections, clauses)
EmbeddingsDiscusses legal-domain models vs general-purpose trade-offs
Vector DBNames a specific DB and justifies with scale + filtering needs
RetrievalHybrid + reranking + citation-exact matching (BM25 critical here)
Multi-hopAgentic retrieval or iterative query decomposition
EvaluationMentions domain expert golden set + production metrics
Privacy + access controlRow-level security, metadata-based filtering, no cross-client leak

A strong answer addresses all 8 areas, explains the reasoning behind each choice
(not just what, but why), and acknowledges trade-offs honestly.