02 — Retrieval-Augmented Generation (RAG)

A comprehensive deep-dive into RAG: from first principles to production system design.
This module is structured for engineers who want interview-ready expertise and the
ability to design and debug RAG systems at scale.

What is RAG and Why
Chunking Strategies
Embedding Models
Vector Databases — Trade-off Matrix
Retrieval Strategies
Advanced RAG Patterns
RAG Evaluation (RAGAS)
When RAG Fails (and What To Do)
Interview Flashcards

1. What is RAG and Why

The Core Problem

Large Language Models are trained on a static snapshot of the world. Once training ends,
their knowledge is frozen. This creates several hard problems for production systems:

Knowledge Cutoff
A model trained through early 2024 knows nothing about events, documents, or data
produced after that date. For enterprise use cases — legal filings, product docs,
internal wikis, real-time pricing — this is a showstopper.

Hallucination on Specific Facts
LLMs are probabilistic next-token predictors. When asked about specific facts not
well-represented in training data (internal docs, niche domains, recent events),
the model generates plausible-sounding but incorrect answers. This is not a bug
you can patch — it is a property of the architecture.

The Cost and Inflexibility of Fine-Tuning
Fine-tuning on private data costs time, money, and compute. More importantly, it
does not solve the problem elegantly:

Fine-tuned knowledge still decays and becomes stale
Fine-tuning teaches the model style and behavior, not reliable fact recall
You cannot easily audit “why did the model say X” from fine-tuned weights
Re-training is expensive every time documents change

RAG solves all three of these problems by externalizing knowledge into a searchable
index and injecting relevant facts into the prompt at query time.

The RAG Pipeline

At a high level, RAG has two phases:

Indexing Phase (offline): Documents are processed, chunked, embedded into vectors,
and stored in a vector database. This is done once (or on a schedule).

Query Phase (online): A user query comes in, gets embedded, similar chunks are
retrieved from the vector DB, and those chunks are stuffed into the LLM prompt as
grounding context before the model generates its answer.

INDEXING PHASE (offline)
========================

  Raw Documents
  (PDFs, HTML, TXT, etc.)
        |
        v
  +------------+
  |  Document  |
  |   Loader   |
  +------------+
        |
        v
  +------------+
  |  Chunking  |  <-- split into ~512 token pieces
  |  Strategy  |
  +------------+
        |
        v
  +------------------+
  |  Embedding Model |  <-- e.g. text-embedding-3-small
  +------------------+
        |
        v  (dense vectors)
  +------------------+
  |  Vector Database |  <-- Chroma, Pinecone, Qdrant, pgvector
  |  (index stored)  |
  +------------------+


QUERY PHASE (online)
====================

  User Query
  "What are the refund policies?"
        |
        v
  +------------------+
  |  Embedding Model |  <-- same model as indexing
  +------------------+
        |
        v  (query vector)
  +------------------+       +------------------+
  |  Vector Database |------>|  Top-K Chunks    |
  |  ANN search      |       |  (e.g. top 3-5)  |
  +------------------+       +------------------+
                                      |
                                      v
                         +------------------------+
                         |  Prompt Construction   |
                         |                        |
                         |  System: You are a ... |
                         |  Context:              |
                         |    Chunk 1: ...        |
                         |    Chunk 2: ...        |
                         |    Chunk 3: ...        |
                         |  Question: {user query}|
                         +------------------------+
                                      |
                                      v
                              +---------------+
                              |  LLM (Claude, |
                              |  GPT-4, etc.) |
                              +---------------+
                                      |
                                      v
                              Final Answer to User

RAG vs Fine-Tuning — When to Use Which

Criteria	RAG	Fine-Tuning
Knowledge is dynamic/updated	RAG (re-index, not retrain)	Poor fit
Need to cite sources	RAG (chunks are traceable)	Not possible
Teaching a new reasoning style	Poor fit	Fine-tuning
Domain-specific vocabulary	Partial (retrieval helps)	Fine-tuning improves tokenization
Cost	Pay per query (embeddings)	Large upfront training cost
Auditability	Excellent (see the chunks)	Black box
Latency sensitivity	Adds retrieval latency	No retrieval step

The most common production pattern is both: fine-tune for style/format/reasoning,
use RAG for facts/knowledge.

2. Chunking Strategies

Chunking is the most underestimated variable in a RAG pipeline. Poor chunking
guarantees poor retrieval, regardless of how good your embedding model is.

The goal: each chunk should be semantically self-contained — a reader should
understand it without reading surrounding text.

2.1 Fixed-Size Chunking

Split text into chunks of N characters or N tokens, with an optional overlap.

def fixed_size_chunk(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap  # overlap ensures continuity
    return chunks

Pros:

Dead simple to implement
Predictable chunk sizes = predictable embedding costs
Uniform index entry sizes

Cons:

Completely ignores document structure
Will split sentences mid-thought, breaking semantic coherence
A chunk about “the company was founded in 1999” might be cut to “the company was
founded in 19” — useless after embedding

When to use: Quick prototypes, when you need a baseline, or when documents are
structurally homogeneous (e.g., database rows).

2.2 Sentence-Based Chunking

Respect sentence boundaries. Use an NLP library (spaCy, NLTK) or regex to detect
sentence endings.

import re
 
def sentence_chunk(text: str, max_sentences: int = 5, overlap: int = 1) -> list[str]:
    # Naive sentence splitter — use spaCy for production
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    chunks = []
    for i in range(0, len(sentences), max_sentences - overlap):
        chunk = " ".join(sentences[i : i + max_sentences])
        if chunk:
            chunks.append(chunk)
    return chunks

Pros:

Preserves natural language units
Each chunk is readable by humans and models alike
Better semantic coherence than character-splitting

Cons:

Sentence length varies wildly — some chunks may be too short, others too long
Technical documents with long sentences can still produce huge chunks
Requires an accurate sentence tokenizer for non-English text

2.3 Semantic Chunking (Split on Meaning Shift)

Split when the meaning changes, not on arbitrary size boundaries. The technique:
embed each sentence, compute cosine similarity between adjacent sentences, and split
at points where similarity drops sharply (a “semantic discontinuity”).

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
 
def semantic_chunk(text: str, threshold: float = 0.4) -> list[str]:
    model = SentenceTransformer("all-MiniLM-L6-v2")
    sentences = text.split(". ")
    embeddings = model.encode(sentences)
 
    chunks = []
    current_chunk = [sentences[0]]
 
    for i in range(1, len(sentences)):
        sim = cosine_similarity(
            embeddings[i - 1].reshape(1, -1),
            embeddings[i].reshape(1, -1)
        )[0][0]
 
        if sim < threshold:
            # Meaning has shifted — start a new chunk
            chunks.append(". ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])
 
    if current_chunk:
        chunks.append(". ".join(current_chunk))
 
    return chunks

Pros:

Chunks align with actual topic changes
Much better retrieval precision — chunks are topically coherent
Works well across different document types

Cons:

Computationally expensive (embed every sentence during indexing)
The threshold parameter requires tuning per domain
Can produce very short or very long chunks if similarity is extreme

When to use: Long-form documents (research papers, reports, books) where topics
shift gradually.

2.4 Hierarchical / Parent-Document Chunking

Store two levels: parent chunks (large, ~1000 tokens, rich context) and
child chunks (small, ~128 tokens, high precision for retrieval).

During retrieval: search using child chunk embeddings (precise), but return the
parent chunk to the LLM (full context).

Document
    |
    +-- Parent Chunk 1 (~1000 tokens)
    |       |
    |       +-- Child Chunk 1a (~128 tokens)  <-- indexed for search
    |       +-- Child Chunk 1b (~128 tokens)  <-- indexed for search
    |       +-- Child Chunk 1c (~128 tokens)  <-- indexed for search
    |
    +-- Parent Chunk 2 (~1000 tokens)
            |
            +-- Child Chunk 2a (~128 tokens)
            +-- Child Chunk 2b (~128 tokens)

def hierarchical_chunk(text: str, parent_size: int = 1000, child_size: int = 128):
    parent_chunks = fixed_size_chunk(text, chunk_size=parent_size, overlap=0)
    index = []  # what gets embedded and searched
    store = {}  # what gets returned to LLM
 
    for i, parent in enumerate(parent_chunks):
        store[f"parent_{i}"] = parent
        children = fixed_size_chunk(parent, chunk_size=child_size, overlap=20)
        for j, child in enumerate(children):
            index.append({
                "id": f"parent_{i}_child_{j}",
                "parent_id": f"parent_{i}",
                "text": child,  # embed this
            })
 
    return index, store

Pros:

Best of both worlds: precise retrieval + rich LLM context
Significantly reduces the “context fragmentation” problem
Works especially well for structured documents

Cons:

More complex to implement and maintain
Requires a two-tier storage strategy
More index entries = higher embedding cost

2.5 Chunk Overlap — Why and How Much

Overlap means the last N tokens of chunk i are repeated as the first N tokens of
chunk i+1. This prevents losing context at chunk boundaries.

WITHOUT overlap:
  Chunk 1: "...The Python programming language was"
  Chunk 2: "created by Guido van Rossum in 1991..."

  A query for "who created Python" may not match Chunk 2 alone.

WITH overlap (30 tokens):
  Chunk 1: "...The Python programming language was created by Guido"
  Chunk 2: "language was created by Guido van Rossum in 1991..."

  Now either chunk can answer the question.

Recommended overlap: 10–15% of chunk size.

For 512-token chunks: 50–75 tokens of overlap
For 256-token chunks: 25–40 tokens of overlap

Beware: too much overlap (>25%) wastes index space and can cause near-duplicate
chunks to rank equally, diluting retrieval diversity.

2.6 Chunk Size Tuning

This is empirical — there is no universally correct answer.

Chunk Size	Precision	Context	Cost	Best For
64–128 tokens	Very high	Low	High	Fact lookup, FAQ, Q&A systems
256–512 tokens	High	Medium	Med	General RAG (most common)
512–1024 tokens	Medium	High	Low	Summarization, long-form answer
1000+ tokens	Low	Very high	Low	Context-heavy reasoning tasks

Rule of thumb: Start with 512 tokens and 10% overlap. Measure retrieval quality
with RAGAS (see Section 7). Tune based on context recall and context precision scores.

Key insight: embedding models have a context window too. Most models cap at 512
tokens. If your chunks exceed the model’s max token limit, the end of the chunk is
simply truncated before embedding — you lose information silently.

3. Embedding Models

An embedding model converts text into a dense numeric vector. Two texts with similar
meaning produce vectors that are close in high-dimensional space (measured by cosine
similarity or dot product). This is the foundation of semantic search.

"What is Python?"  -->  [0.12, -0.34, 0.89, 0.02, ...]  (768 dimensions)
"Python programming language"  -->  [0.14, -0.31, 0.91, 0.03, ...]
cosine_similarity = 0.97  (very close)

"What is Python?"  -->  [0.12, -0.34, 0.89, 0.02, ...]
"I enjoy eating pizza"  -->  [-0.44, 0.12, -0.23, 0.67, ...]
cosine_similarity = 0.12  (very far)

3.1 OpenAI Embedding Models

text-embedding-3-small

Dimensions: 1536 (can be truncated to 512 or 256 with dimensions param)
Cost: $0.02 per 1M tokens
Performance: Excellent for English, solid multilingual
Latency: ~50–100ms per API call
Best for: Production systems where cost matters; most use cases

text-embedding-3-large

Dimensions: 3072 (can be truncated)
Cost: $0.13 per 1M tokens (6.5x more expensive)
Performance: ~10–15% better on MTEB benchmarks
Best for: When retrieval quality is critical and you can afford the cost

Legacy: text-embedding-ada-002

Dimensions: 1536
Outperformed by text-embedding-3-small on most benchmarks at similar cost
Avoid for new projects

3.2 Open-Source Embedding Models

BAAI/bge-m3

Architecture: XLM-RoBERTa base
Dimensions: 1024
Strengths: Multilingual (100+ languages), supports dense + sparse + multi-vector
Context window: 8192 tokens (excellent for long documents)
Best for: Non-English documents, hybrid retrieval, privacy-sensitive use cases

nomic-embed-text

Dimensions: 768
Context window: 8192 tokens
Fully open-source (Apache 2.0), can run locally
Competitive with OpenAI ada-002 on MTEB benchmarks
Best for: Local deployment, open-source stacks

all-MiniLM-L6-v2

Dimensions: 384
Context window: 256 tokens
Extremely fast (runs on CPU)
Lower quality but fine for prototypes
Best for: Local development, resource-constrained environments

3.3 Dimension Trade-offs

More dimensions = more expressive representations, but:

Dimensions	Storage per 1M chunks (float32)	Search speed	Quality
384	~1.5 GB	Fastest	Good
768	~3.0 GB	Fast	Better
1536	~6.1 GB	Medium	Great
3072	~12.2 GB	Slower	Best

At 10M documents, the storage difference between 384 and 3072 dimensions is ~107 GB.
This matters for production planning.

3.4 Asymmetric Embeddings: Query vs Document

Some embedding models are trained with asymmetric pairs: queries and documents are
treated differently because their linguistic structure differs.

A query: "What year was Python created?"
A document: "Python was created by Guido van Rossum and first released in 1991."

For asymmetric models (like bge-m3 or Cohere embed), you pass a different prefix or
mode to the encoder depending on whether you are indexing documents or searching:

# bge-m3 asymmetric usage
from FlagEmbedding import BGEM3FlagModel
 
model = BGEM3FlagModel("BAAI/bge-m3")
 
# At indexing time
doc_embeddings = model.encode(documents, batch_size=12, max_length=8192)
 
# At query time
query_embedding = model.encode_queries(["What year was Python created?"])

Symmetric models (MiniLM) treat queries and documents identically. Asymmetric models
generally outperform symmetric ones for question-answering workloads.

3.5 Local vs API Embeddings

Factor	Local Model (e.g. bge-m3)	API (e.g. OpenAI)
Cost	Compute only (one-time)	Per-token pricing
Latency	No network hop	50–150ms per call
Privacy	Data never leaves your infra	Data sent to vendor
Quality	Competitive (bge-m3 = strong)	Very high (esp. large model)
Setup complexity	High (GPU/CPU tuning)	Trivial (API key)
Throughput	Limited by hardware	Rate limits apply
Best for	Enterprise, regulated, volume	Startups, prototyping, SaaS

For 1M+ document corpora, local embedding often pays for itself in 2–3 months
compared to API costs.

4. Vector Databases — Trade-off Matrix

Database	Hosted / Self-hosted	Scale	Metadata Filtering	Hybrid Search	Cost Model	Best For
Pinecone	Hosted only	Very large	Yes (robust)	Yes	Pay per pod/query	Production SaaS, fastest start
Weaviate	Both	Large	Yes (GraphQL)	Yes (BM25)	Open-source + cloud	Complex queries, knowledge graphs
Chroma	Self-hosted	Small–medium	Yes (basic)	No (dense only)	Free / open-source	Local dev, prototypes, testing
Qdrant	Both	Large	Yes (rich)	Yes	Open-source + cloud	High-perf self-hosted production
pgvector	Self-hosted	Medium	Yes (full SQL)	Partial	PostgreSQL costs	Existing Postgres stack
FAISS	Self-hosted (library)	Very large	No (manual)	No	Free (Meta)	Research, batch processing

Notes:

Pinecone: Easiest to get to production. Serverless tier is generous. No
operational burden. The downside: vendor lock-in and data leaves your infra.
Qdrant: Strong performance benchmarks, excellent filtering API, Docker-based
self-hosting is trivial. Best choice for self-hosted production.
Chroma: Do not use in production at scale. It shines for local development —
zero config, pure Python, runs in-memory or on disk.
pgvector: If your team already manages PostgreSQL, pgvector lets you keep
everything in one system. ACID transactions, joins, full SQL filtering. Trade-off:
ANN performance is weaker than dedicated vector DBs at very large scale.
FAISS: Meta’s library, not a database. No persistence layer, no filtering, no
server. Use when you are building a custom retrieval system with full control.
Weaviate: Unique because it stores both the vector and the object, supports
cross-references between objects (near graph DB). Best for knowledge-graph-style
RAG where documents have relationships.

5. Retrieval Strategies

5.1 Dense Retrieval (Semantic Similarity)

The baseline RAG retrieval method. Embed the query, find the K nearest vectors in the
index using approximate nearest neighbor (ANN) search (cosine similarity or dot product).

import chromadb
 
client = chromadb.Client()
collection = client.get_collection("docs")
 
query_results = collection.query(
    query_texts=["What year was Python released?"],
    n_results=5,
)

How ANN works: Exact nearest neighbor search scales as O(n * d) — unusable at
scale. ANN algorithms (HNSW, IVF, ScaNN) trade a small accuracy loss for ~100x speed
gains. HNSW (Hierarchical Navigable Small World) is the most common: it builds a
layered graph where higher layers are sparser and lower layers are dense. Search
traverses top to bottom.

Strengths: Captures semantic meaning, synonyms, paraphrases
Weaknesses: Poor at exact keyword matches, proper nouns, model numbers, IDs

5.2 Sparse Retrieval (BM25)

BM25 is a probabilistic keyword-based ranking function. It scores a document based on
how often query terms appear in it, normalized by document length and corpus-wide
term frequency (IDF).

BM25(q, d) = SUM over query terms t:
    IDF(t) * (TF(t,d) * (k1 + 1)) / (TF(t,d) + k1 * (1 - b + b * |d| / avgdl))

where:
  IDF(t) = log((N - df(t) + 0.5) / (df(t) + 0.5) + 1)
  TF(t,d) = frequency of term t in document d
  k1 = 1.5  (term saturation, higher = slower saturation)
  b = 0.75  (length normalization, 1 = full, 0 = none)

Strengths: Exact keyword matches, proper nouns, serial numbers, dates
Weaknesses: No semantic understanding; “car” and “automobile” are completely different

5.3 Hybrid Retrieval: RRF (Reciprocal Rank Fusion)

Combine sparse and dense results using Reciprocal Rank Fusion. Each result is scored
by its rank position across both retrieval systems.

def reciprocal_rank_fusion(
    dense_results: list[str],
    sparse_results: list[str],
    k: int = 60
) -> list[tuple[str, float]]:
    """
    RRF score = SUM over each ranker: 1 / (k + rank)
    k=60 is the empirically recommended constant.
    """
    scores: dict[str, float] = {}
 
    for rank, doc_id in enumerate(dense_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
 
    for rank, doc_id in enumerate(sparse_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
 
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Why k=60? It dampens the effect of extreme rank positions. Rank 1 gives 1/61,
rank 2 gives 1/62 — the difference is small. This makes the fusion robust to one
ranker being very confident while the other is unsure.

When does hybrid beat pure dense?

Technical documentation (product codes, model numbers)
Legal text (exact clause references)
Medical literature (drug names, ICD codes)
Any domain with important proper nouns

Benchmarks show hybrid retrieval (BM25 + dense + RRF) outperforms pure dense
retrieval on most BEIR benchmarks by 5–15% on nDCG@10.

5.4 Reranking

After retrieving top-K candidates (e.g., 20 chunks), a reranker scores each
chunk-query pair and re-orders them. Return only the top-N (e.g., 5) to the LLM.

Query ──> Initial Retrieval ──> 20 candidates ──> Reranker ──> Top 5 ──> LLM
                (fast, ANN)                    (slower, precise)

Why rerank? ANN search optimizes for speed by approximating distances. A reranker
uses a cross-encoder (processes query + document together) which is much more accurate
but too slow to scan the full corpus. The two-stage approach gets the best of both.

Cohere Rerank API:

import cohere
 
co = cohere.Client("YOUR_API_KEY")
 
results = co.rerank(
    query="What are Python's key features?",
    documents=[chunk.text for chunk in retrieved_chunks],
    top_n=5,
    model="rerank-english-v3.0",
)
 
reranked_chunks = [retrieved_chunks[r.index] for r in results.results]

ColBERT: An alternative reranking approach using late interaction. Instead of
computing a single similarity score between query and document embeddings, ColBERT
computes token-level interactions (MaxSim operator). Available via Vespa, RAGatouille.

When is reranking worth the latency (~100–200ms extra)?

When retrieval precision matters more than speed
When your initial retrieval pool is noisy
Production Q&A, customer support, legal search

5.5 Multi-Query Retrieval

Generate N rephrased versions of the original query, retrieve for each, and union
the result sets (deduplicating by chunk ID).

def generate_query_variants(query: str, llm_client, n: int = 3) -> list[str]:
    prompt = f"""Generate {n} different ways to ask the following question.
Each variant should approach the question from a different angle.
Output one variant per line, no numbering.
 
Question: {query}"""
 
    response = llm_client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=200,
        messages=[{"role": "user", "content": prompt}]
    )
    variants = response.content[0].text.strip().split("\n")
    return [query] + variants[:n]

Why this works: The original query may not share vocabulary with the way relevant
chunks are written. Multiple variants increase the chance of at least one variant
matching the document’s phrasing.

Cost: 1 extra LLM call to generate variants + N embedding calls. Worth it for
complex queries. Not worth it for simple, well-specified queries.

5.6 HyDE — Hypothetical Document Embeddings

Instead of embedding the raw query, ask the LLM to generate a hypothetical document
that would answer the query, then embed that hypothetical document for retrieval.

def hyde_retrieve(query: str, llm_client, collection) -> list[str]:
    # Step 1: Generate a hypothetical answer
    hyde_prompt = f"""Write a short paragraph that would be a perfect answer to the
following question. Do not indicate uncertainty — write as if you know the answer.
 
Question: {query}"""
 
    hypothetical_doc = llm_client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=150,
        messages=[{"role": "user", "content": hyde_prompt}]
    ).content[0].text
 
    # Step 2: Embed the hypothetical document, not the query
    results = collection.query(
        query_texts=[hypothetical_doc],
        n_results=5,
    )
    return results

Why HyDE works: Queries and documents often have very different linguistic
structure (“What year was Python created?” vs “Python was first released in 1991…”).
The hypothetical document shares the register, vocabulary, and format of actual
corpus documents, leading to better ANN matches.

When to use HyDE:

When queries are very short and vague
When there is a large vocabulary mismatch between queries and documents
Technical or academic domains where documents use formal language

Caution: If the LLM generates a hallucinated hypothetical that is factually wrong,
you might retrieve completely irrelevant chunks. HyDE works best when the LLM has
strong prior knowledge of the domain structure.

5.7 Parent-Document Retrieval

Index small child chunks for precise retrieval, but return their parent chunk (larger
context) to the LLM. See Section 2.4 for the chunking setup.

def parent_doc_retrieve(query: str, child_index, parent_store, n_children: int = 5):
    # Search using small child embeddings
    child_results = child_index.query(query_texts=[query], n_results=n_children)
 
    # Look up parent IDs
    parent_ids = set()
    for metadata in child_results["metadatas"][0]:
        parent_ids.add(metadata["parent_id"])
 
    # Return full parent chunks (deduplicated)
    return [parent_store[pid] for pid in parent_ids]

6. Advanced RAG Patterns

6.1 Query Rewriting / Expansion

Before retrieval, preprocess the query to improve match quality.

Query Expansion: Add synonyms or related terms to the query.

Original:  "car insurance claim process"
Expanded:  "car automobile vehicle insurance claim process procedure steps"

Query Rewriting: Rephrase for clarity. Useful when the user query is ambiguous,
uses pronouns (“What did he say about it?”), or references prior conversation turns.

rewrite_prompt = """Given this conversation history and follow-up question,
rewrite the follow-up as a standalone question.
 
History: {history}
Follow-up: {question}
Standalone question:"""

This is critical for conversational RAG where the user asks follow-up questions
that only make sense in context of the prior exchange.

6.2 Step-Back Prompting

For abstract or complex queries, first ask a more general “step-back” question, retrieve
for that, then combine both sets of results.

User query:  "Why did Microsoft acquire Activision Blizzard?"
Step-back:   "What are Microsoft's strategic interests in the gaming market?"

The step-back retrieves higher-level context that makes the specific answer
interpretable. Implementation: two retrieval passes + merge + send to LLM.

6.3 Contextual Compression

After retrieval, the retrieved chunks may contain a lot of irrelevant text. Send each
chunk to a lightweight LLM call to extract only the sentences relevant to the query.

Original Chunk (512 tokens):
  "Python was created by Guido van Rossum. It was first released in 1991.
  Python's design philosophy emphasizes code readability with the use of
  significant indentation. Guido's favorite color is unknown. The language
  provides constructs that enable clear programming on both small and large
  scales..."

Compressed (for query "when was Python released"):
  "Python was first released in 1991."

Reduces context window usage and improves signal-to-noise in the final prompt.
Trade-off: extra LLM call per retrieved chunk = additional latency.

6.4 FLARE — Forward-Looking Active Retrieval

Instead of retrieving everything upfront, FLARE lets the LLM generate its answer
token by token and trigger retrieval when it becomes uncertain (predicted tokens
have low probability). It then retrieves fresh context and continues generation.

LLM starts generating:
  "Python was created by Guido van Rossum and first released in..."
  [LLM is uncertain about the year — triggers retrieval]
  [Retrieves chunk: "Python 1.0 was released in January 1994"]
  [LLM continues]: "...1994..."

FLARE is powerful but complex to implement. It requires access to the model’s token
probabilities, which is available for self-hosted models but not always via API.
Best suited for long-form generation tasks where not all facts are needed upfront.

6.5 Agentic RAG

The LLM is given retrieval as a tool and decides autonomously:

Whether to retrieve at all
What query to issue
Whether to retrieve again based on the result
How many retrieval rounds are needed

tools = [
    {
        "name": "retrieve_documents",
        "description": "Search the knowledge base for relevant information",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "n_results": {"type": "integer", "default": 5}
            },
            "required": ["query"]
        }
    }
]
 
# The LLM will call retrieve_documents as many times as needed
response = client.messages.create(
    model="claude-opus-4-5",
    tools=tools,
    messages=[{"role": "user", "content": user_question}]
)

Agentic RAG handles multi-hop questions naturally: the LLM retrieves a first fact,
reads it, formulates a follow-up query, retrieves more, and synthesizes the answer.

7. RAG Evaluation (RAGAS)

RAGAS (Retrieval-Augmented Generation Assessment) provides a framework for evaluating
RAG pipelines without manual annotation of every answer.

The Four Core Metrics

1. Faithfulness
Does the generated answer contain only claims that are supported by the retrieved context?
Detects hallucination. Score: 0 to 1.

Context: "Python was released in 1991."
Answer: "Python was released in 1991 and created by Tim Peters."
Faithfulness = 0.5 (second claim is unsupported/wrong)

2. Answer Relevancy
Does the answer actually address the user’s question?
Tests whether the LLM drifts from the question.

Computed by: generate N questions from the answer, measure semantic similarity to the
original question. Low similarity = LLM answered something adjacent but not the real question.

3. Context Precision
Of the retrieved chunks, what fraction were actually useful for answering the question?
Measures retrieval precision — are we retrieving junk along with gold?

4. Context Recall
Were all the facts needed to answer the question present in the retrieved context?
Measures retrieval completeness — are we missing critical information?

Context recall = (facts in answer that appear in context) / (total facts in answer)

Building a Golden Evaluation Set

Select 50–200 representative questions from your domain
Have domain experts write ideal answers
Run your RAG pipeline on each question
Score with RAGAS metrics
Identify the weakest metric and optimize for it

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
 
eval_data = {
    "question": ["What year was Python created?", ...],
    "answer": ["Python was created in 1991.", ...],         # RAG output
    "contexts": [["Python was released in 1991...", ...], ...],  # retrieved chunks
    "ground_truth": ["Python was first released in 1991.", ...], # golden answer
}
 
result = evaluate(Dataset.from_dict(eval_data), metrics=[
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
])
print(result)

8. When RAG Fails (and What To Do)

Failure 1: Bad Retrieval → Hallucination

Symptom: The LLM produces a confident answer that is not in the corpus.
Root cause: The right chunks were never retrieved — the LLM had to invent.
Diagnosis: Check context recall score. Manually inspect retrieved chunks for a
failing query.

Fixes:

Improve chunking (semantic chunking, appropriate chunk size)
Add hybrid retrieval (BM25 captures keywords that dense misses)
Add multi-query retrieval (query rephrasing increases recall)
Use HyDE for abstract queries
Add a “cannot answer” instruction: “If the context does not contain the answer, say so.”

Failure 2: Context Window Overflow

Symptom: Retrieved content is too long to fit in the prompt. Either you truncate
(losing information) or you hit API token limits.
Root cause: Too many retrieved chunks, or chunks are too large.

Fixes:

Reduce K (retrieve fewer chunks, rely on reranking for quality)
Use contextual compression to shrink each chunk before sending
Use parent-document retrieval: retrieve more parents but they cover more ground per token
Switch to a model with a larger context window

Failure 3: Stale Embeddings

Symptom: Queries about recently added documents return old, irrelevant results.
Root cause: Documents were added to the corpus but never re-embedded and indexed.

Fixes:

Implement a change-detection pipeline: checksums on source docs, trigger re-indexing on change
Incremental indexing: only re-embed changed or new documents
Track last_indexed timestamps per document
For rapidly changing data: consider caching + invalidation rather than batch re-index

Failure 4: Query-Document Semantic Mismatch

Symptom: Dense retrieval misses obviously relevant chunks. Example: user asks
“Who founded Python?” but documents use the phrase “Guido van Rossum created Python.”
Root cause: The embedding model encodes queries and documents in slightly different
parts of the space; or domain-specific vocabulary mismatch.

Fixes:

Add BM25 hybrid retrieval (keyword overlap doesn’t care about semantics)
Use HyDE (hypothetical document is in document-space language)
Use query expansion / rewriting
Fine-tune the embedding model on your domain data

Failure 5: Multi-Hop Questions

Symptom: “What is the nationality of the person who created Python?” requires two
reasoning steps: (1) find who created Python, (2) find their nationality.
Single-pass retrieval cannot handle this in one shot.

Fixes:

Iterative retrieval: answer step 1, use that answer as input to step 2
Agentic RAG: let the LLM issue multiple retrieval calls
Knowledge graph augmentation: store entity relationships explicitly
Pre-compute common multi-hop paths during indexing

Failure 6: The “Lost in the Middle” Problem

Research (Liu et al., 2023) shows that LLMs pay the most attention to information at
the beginning and end of a long context. Information in the middle of a large context
window is systematically under-utilized.

Fix: Reranking ensures the most relevant chunks are placed first. If sending many
chunks, put the highest-ranked chunks at the top and bottom, not the middle. Some
systems use recursive summarization to compress the middle context.

9. Interview Flashcards

Q1: What is RAG and when would you use it over fine-tuning?

RAG (Retrieval-Augmented Generation) is a technique where relevant documents are
retrieved from an external knowledge base and injected into the LLM prompt as context
before generation. Use RAG when: knowledge changes frequently, you need citation/
auditability, or you have domain-specific factual data not well covered in training.
Use fine-tuning when: you need to change the model’s behavior or style, the task
requires a new reasoning pattern, or you want to internalize a stable, unchanging body
of knowledge efficiently. In practice, many production systems use both.

Q2: Explain chunking strategies and their trade-offs.

Fixed-size: simple but breaks semantic context at arbitrary boundaries. Sentence-based:
respects language units but variable chunk sizes. Semantic chunking: splits on meaning
shifts (embedding similarity drops), most coherent but computationally expensive.
Hierarchical: stores small chunks for retrieval precision but returns parent chunks for
LLM context — best of both worlds. Key insight: chunk size trades off precision vs
context. Small chunks (128 tokens) enable precise matching; large chunks (1000 tokens)
provide richer context. Overlap (10–15%) prevents losing information at boundaries.
Rule of thumb: 512 tokens, 10% overlap for most cases.

Q3: What is hybrid retrieval and why is it better than pure semantic search?

Hybrid retrieval combines dense (semantic/embedding-based) and sparse (BM25/keyword)
retrieval, then merges results using Reciprocal Rank Fusion (RRF). Dense retrieval
captures semantic meaning and synonyms but struggles with exact keyword matches and
proper nouns. BM25 excels at exact matches but has no semantic understanding. Hybrid
gets both: “Python 3.11 release notes” is found via keywords (BM25) while “Python
programming language history” is found via semantics (dense). On BEIR benchmarks,
hybrid consistently outperforms either alone by 5–15% nDCG.

Q4: What is HyDE and when would you use it?

HyDE (Hypothetical Document Embeddings) generates a hypothetical answer to the query
using the LLM, then embeds that answer for retrieval — instead of embedding the raw
query. It works because queries and documents have different linguistic structure:
a question and a passage that answers it may not be close in embedding space, but the
hypothetical answer is. Use HyDE when: queries are short and vague, there is a large
vocabulary mismatch between query and document styles, or in academic/technical domains.
Caution: if the LLM halluccinates the hypothetical, retrieval degrades.

Q5: How do you evaluate a RAG pipeline?

Use RAGAS metrics: (1) Faithfulness — does the answer only make claims supported by
retrieved context? (2) Answer relevancy — does the answer address the question?
(3) Context precision — are the retrieved chunks relevant? (4) Context recall — did
retrieval find all the necessary information? Build a golden evaluation set: 100–200
manually curated question/answer pairs. Run the pipeline on all questions, score with
RAGAS, identify the lowest-scoring metric, and optimize that component. For production,
add LLM-as-judge evaluation for edge cases and log + sample real user interactions.

Q6: What is reranking and when does it help?

Reranking is a two-stage process: (1) fast ANN retrieval of top-K candidates (~20),
(2) a cross-encoder model scores each candidate against the query and re-orders them.
A cross-encoder processes the full query-document pair jointly (unlike bi-encoder
embeddings that encode separately), giving much more accurate relevance scores. It adds
~100–200ms latency but significantly improves precision. Use reranking when: retrieval
quality is critical, the initial retrieval pool is large and noisy, or users notice
irrelevant answers. Services: Cohere Rerank API. Open-source: ColBERT, ms-marco-MiniLM.

Q7: How do you handle a question that requires information from multiple chunks?

Multi-hop questions require reasoning across multiple pieces of evidence. Approaches:
(1) Agentic RAG — give the LLM retrieval as a tool and let it call it multiple times;
(2) iterative retrieval — answer step 1, use the intermediate answer to form step 2
query; (3) multi-query retrieval with union — generate sub-questions, retrieve for
each, merge context; (4) knowledge graph — pre-compute entity relationships so
multi-hop traversal is a graph query. The lost-in-the-middle problem also applies:
put the most relevant chunks first and last in the context.

Q8: What is the “lost in the middle” problem and how does it affect RAG?

Research (Liu et al. 2023) showed that LLMs pay highest attention to information at
the beginning and end of long prompts, and systematically underweight information in
the middle. For RAG, this means: if the answer-bearing chunk is buried in position 5
of 10 retrieved chunks, the model may ignore it. Fix: use reranking to put the most
relevant chunks first. If you must send many chunks, consider placing critical chunks
at position 1 and position K (not in the middle). Contextual compression also helps by
reducing chunk size, fitting more signal in the same context budget.

Q9: How would you design a RAG system for a 10M-document corpus?

Architecture decisions at this scale:

Embedding: Batch processing pipeline (e.g., Spark/Ray) with a hosted or local
embedding model. Incremental re-indexing with change detection.
Vector DB: Pinecone (managed) or Qdrant (self-hosted with sharding). Need
distributed ANN index across nodes.
Chunking: Semantic or hierarchical chunking with metadata (source, date, section).
~512 tokens per chunk = ~20M index entries.
Retrieval: Hybrid (BM25 + dense) with metadata pre-filtering. Reranking for
top-20 → top-5.
Latency budget: Embedding query (~50ms), ANN search (~20ms), reranking (~150ms),
LLM generation (~1–2s). Total: ~2s acceptable for most use cases.
Observability: Log every retrieval + answer pair. Sample for RAGAS evaluation.
Track cache hit rates, p99 latency, per-query costs.
Freshness: Document ingestion queue, re-index on change, TTL on stale documents.

Q10: What is parent-document retrieval?

Parent-document retrieval is a two-tier chunking strategy that separates the unit used
for indexing/search from the unit sent to the LLM. Small child chunks (~128 tokens)
are embedded and indexed — their compact size makes them precise search targets. When
a child chunk is retrieved, its parent chunk (~1000 tokens) is fetched from a document
store and sent to the LLM instead, providing full context around the matched text.
This solves the tension between retrieval precision (favors small chunks) and generation
quality (favors large context). The trade-off: doubled storage (child index + parent
store), and more complex retrieval logic.

Quick Reference

Chunk size rule of thumb:
  256–512 tokens for most RAG use cases
  10–15% overlap
  Semantic or sentence chunking preferred over fixed-size

Retrieval rule of thumb:
  Retrieve top-20, rerank to top-5
  Use hybrid (BM25 + dense) by default
  Add HyDE for complex/abstract queries

Evaluation rule of thumb:
  Faithfulness > 0.85 = acceptable
  Context recall > 0.80 = acceptable
  If recall is low → fix retrieval
  If faithfulness is low → fix prompt / add grounding instruction

Vector DB selection:
  Local dev → Chroma
  Production self-hosted → Qdrant
  Production managed → Pinecone
  Existing Postgres → pgvector

Next: see examples/basic_rag.py for a working implementation, and exercises/README.md for hands-on practice.

Study Notes by Niladri & AI

Explorer

README

02 — Retrieval-Augmented Generation (RAG)

Table of Contents

1. What is RAG and Why

The Core Problem

The RAG Pipeline

RAG vs Fine-Tuning — When to Use Which

2. Chunking Strategies

2.1 Fixed-Size Chunking

2.2 Sentence-Based Chunking

2.3 Semantic Chunking (Split on Meaning Shift)

2.4 Hierarchical / Parent-Document Chunking

2.5 Chunk Overlap — Why and How Much

2.6 Chunk Size Tuning

3. Embedding Models

3.1 OpenAI Embedding Models

3.2 Open-Source Embedding Models

3.3 Dimension Trade-offs

3.4 Asymmetric Embeddings: Query vs Document

3.5 Local vs API Embeddings

4. Vector Databases — Trade-off Matrix

5. Retrieval Strategies

5.1 Dense Retrieval (Semantic Similarity)

5.2 Sparse Retrieval (BM25)

5.3 Hybrid Retrieval: RRF (Reciprocal Rank Fusion)

5.4 Reranking

5.5 Multi-Query Retrieval

5.6 HyDE — Hypothetical Document Embeddings

5.7 Parent-Document Retrieval

6. Advanced RAG Patterns

6.1 Query Rewriting / Expansion

6.2 Step-Back Prompting

6.3 Contextual Compression

6.4 FLARE — Forward-Looking Active Retrieval

6.5 Agentic RAG

7. RAG Evaluation (RAGAS)

The Four Core Metrics

Building a Golden Evaluation Set

8. When RAG Fails (and What To Do)

Failure 1: Bad Retrieval → Hallucination

Failure 2: Context Window Overflow

Failure 3: Stale Embeddings

Failure 4: Query-Document Semantic Mismatch

Failure 5: Multi-Hop Questions

Failure 6: The “Lost in the Middle” Problem

9. Interview Flashcards

Quick Reference

Graph View

Table of Contents