02 — Retrieval-Augmented Generation (RAG)
A comprehensive deep-dive into RAG: from first principles to production system design.
This module is structured for engineers who want interview-ready expertise and the
ability to design and debug RAG systems at scale.
Table of Contents
- What is RAG and Why
- Chunking Strategies
- Embedding Models
- Vector Databases — Trade-off Matrix
- Retrieval Strategies
- Advanced RAG Patterns
- RAG Evaluation (RAGAS)
- When RAG Fails (and What To Do)
- Interview Flashcards
1. What is RAG and Why
The Core Problem
Large Language Models are trained on a static snapshot of the world. Once training ends,
their knowledge is frozen. This creates several hard problems for production systems:
Knowledge Cutoff
A model trained through early 2024 knows nothing about events, documents, or data
produced after that date. For enterprise use cases — legal filings, product docs,
internal wikis, real-time pricing — this is a showstopper.
Hallucination on Specific Facts
LLMs are probabilistic next-token predictors. When asked about specific facts not
well-represented in training data (internal docs, niche domains, recent events),
the model generates plausible-sounding but incorrect answers. This is not a bug
you can patch — it is a property of the architecture.
The Cost and Inflexibility of Fine-Tuning
Fine-tuning on private data costs time, money, and compute. More importantly, it
does not solve the problem elegantly:
- Fine-tuned knowledge still decays and becomes stale
- Fine-tuning teaches the model style and behavior, not reliable fact recall
- You cannot easily audit “why did the model say X” from fine-tuned weights
- Re-training is expensive every time documents change
RAG solves all three of these problems by externalizing knowledge into a searchable
index and injecting relevant facts into the prompt at query time.
The RAG Pipeline
At a high level, RAG has two phases:
Indexing Phase (offline): Documents are processed, chunked, embedded into vectors,
and stored in a vector database. This is done once (or on a schedule).
Query Phase (online): A user query comes in, gets embedded, similar chunks are
retrieved from the vector DB, and those chunks are stuffed into the LLM prompt as
grounding context before the model generates its answer.
INDEXING PHASE (offline)
========================
Raw Documents
(PDFs, HTML, TXT, etc.)
|
v
+------------+
| Document |
| Loader |
+------------+
|
v
+------------+
| Chunking | <-- split into ~512 token pieces
| Strategy |
+------------+
|
v
+------------------+
| Embedding Model | <-- e.g. text-embedding-3-small
+------------------+
|
v (dense vectors)
+------------------+
| Vector Database | <-- Chroma, Pinecone, Qdrant, pgvector
| (index stored) |
+------------------+
QUERY PHASE (online)
====================
User Query
"What are the refund policies?"
|
v
+------------------+
| Embedding Model | <-- same model as indexing
+------------------+
|
v (query vector)
+------------------+ +------------------+
| Vector Database |------>| Top-K Chunks |
| ANN search | | (e.g. top 3-5) |
+------------------+ +------------------+
|
v
+------------------------+
| Prompt Construction |
| |
| System: You are a ... |
| Context: |
| Chunk 1: ... |
| Chunk 2: ... |
| Chunk 3: ... |
| Question: {user query}|
+------------------------+
|
v
+---------------+
| LLM (Claude, |
| GPT-4, etc.) |
+---------------+
|
v
Final Answer to User
RAG vs Fine-Tuning — When to Use Which
| Criteria | RAG | Fine-Tuning |
|---|---|---|
| Knowledge is dynamic/updated | RAG (re-index, not retrain) | Poor fit |
| Need to cite sources | RAG (chunks are traceable) | Not possible |
| Teaching a new reasoning style | Poor fit | Fine-tuning |
| Domain-specific vocabulary | Partial (retrieval helps) | Fine-tuning improves tokenization |
| Cost | Pay per query (embeddings) | Large upfront training cost |
| Auditability | Excellent (see the chunks) | Black box |
| Latency sensitivity | Adds retrieval latency | No retrieval step |
The most common production pattern is both: fine-tune for style/format/reasoning,
use RAG for facts/knowledge.
2. Chunking Strategies
Chunking is the most underestimated variable in a RAG pipeline. Poor chunking
guarantees poor retrieval, regardless of how good your embedding model is.
The goal: each chunk should be semantically self-contained — a reader should
understand it without reading surrounding text.
2.1 Fixed-Size Chunking
Split text into chunks of N characters or N tokens, with an optional overlap.
def fixed_size_chunk(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start += chunk_size - overlap # overlap ensures continuity
return chunksPros:
- Dead simple to implement
- Predictable chunk sizes = predictable embedding costs
- Uniform index entry sizes
Cons:
- Completely ignores document structure
- Will split sentences mid-thought, breaking semantic coherence
- A chunk about “the company was founded in 1999” might be cut to “the company was
founded in 19” — useless after embedding
When to use: Quick prototypes, when you need a baseline, or when documents are
structurally homogeneous (e.g., database rows).
2.2 Sentence-Based Chunking
Respect sentence boundaries. Use an NLP library (spaCy, NLTK) or regex to detect
sentence endings.
import re
def sentence_chunk(text: str, max_sentences: int = 5, overlap: int = 1) -> list[str]:
# Naive sentence splitter — use spaCy for production
sentences = re.split(r'(?<=[.!?])\s+', text.strip())
chunks = []
for i in range(0, len(sentences), max_sentences - overlap):
chunk = " ".join(sentences[i : i + max_sentences])
if chunk:
chunks.append(chunk)
return chunksPros:
- Preserves natural language units
- Each chunk is readable by humans and models alike
- Better semantic coherence than character-splitting
Cons:
- Sentence length varies wildly — some chunks may be too short, others too long
- Technical documents with long sentences can still produce huge chunks
- Requires an accurate sentence tokenizer for non-English text
2.3 Semantic Chunking (Split on Meaning Shift)
Split when the meaning changes, not on arbitrary size boundaries. The technique:
embed each sentence, compute cosine similarity between adjacent sentences, and split
at points where similarity drops sharply (a “semantic discontinuity”).
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def semantic_chunk(text: str, threshold: float = 0.4) -> list[str]:
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = text.split(". ")
embeddings = model.encode(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
sim = cosine_similarity(
embeddings[i - 1].reshape(1, -1),
embeddings[i].reshape(1, -1)
)[0][0]
if sim < threshold:
# Meaning has shifted — start a new chunk
chunks.append(". ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
if current_chunk:
chunks.append(". ".join(current_chunk))
return chunksPros:
- Chunks align with actual topic changes
- Much better retrieval precision — chunks are topically coherent
- Works well across different document types
Cons:
- Computationally expensive (embed every sentence during indexing)
- The threshold parameter requires tuning per domain
- Can produce very short or very long chunks if similarity is extreme
When to use: Long-form documents (research papers, reports, books) where topics
shift gradually.
2.4 Hierarchical / Parent-Document Chunking
Store two levels: parent chunks (large, ~1000 tokens, rich context) and
child chunks (small, ~128 tokens, high precision for retrieval).
During retrieval: search using child chunk embeddings (precise), but return the
parent chunk to the LLM (full context).
Document
|
+-- Parent Chunk 1 (~1000 tokens)
| |
| +-- Child Chunk 1a (~128 tokens) <-- indexed for search
| +-- Child Chunk 1b (~128 tokens) <-- indexed for search
| +-- Child Chunk 1c (~128 tokens) <-- indexed for search
|
+-- Parent Chunk 2 (~1000 tokens)
|
+-- Child Chunk 2a (~128 tokens)
+-- Child Chunk 2b (~128 tokens)
def hierarchical_chunk(text: str, parent_size: int = 1000, child_size: int = 128):
parent_chunks = fixed_size_chunk(text, chunk_size=parent_size, overlap=0)
index = [] # what gets embedded and searched
store = {} # what gets returned to LLM
for i, parent in enumerate(parent_chunks):
store[f"parent_{i}"] = parent
children = fixed_size_chunk(parent, chunk_size=child_size, overlap=20)
for j, child in enumerate(children):
index.append({
"id": f"parent_{i}_child_{j}",
"parent_id": f"parent_{i}",
"text": child, # embed this
})
return index, storePros:
- Best of both worlds: precise retrieval + rich LLM context
- Significantly reduces the “context fragmentation” problem
- Works especially well for structured documents
Cons:
- More complex to implement and maintain
- Requires a two-tier storage strategy
- More index entries = higher embedding cost
2.5 Chunk Overlap — Why and How Much
Overlap means the last N tokens of chunk i are repeated as the first N tokens of
chunk i+1. This prevents losing context at chunk boundaries.
WITHOUT overlap:
Chunk 1: "...The Python programming language was"
Chunk 2: "created by Guido van Rossum in 1991..."
A query for "who created Python" may not match Chunk 2 alone.
WITH overlap (30 tokens):
Chunk 1: "...The Python programming language was created by Guido"
Chunk 2: "language was created by Guido van Rossum in 1991..."
Now either chunk can answer the question.
Recommended overlap: 10–15% of chunk size.
- For 512-token chunks: 50–75 tokens of overlap
- For 256-token chunks: 25–40 tokens of overlap
Beware: too much overlap (>25%) wastes index space and can cause near-duplicate
chunks to rank equally, diluting retrieval diversity.
2.6 Chunk Size Tuning
This is empirical — there is no universally correct answer.
| Chunk Size | Precision | Context | Cost | Best For |
|---|---|---|---|---|
| 64–128 tokens | Very high | Low | High | Fact lookup, FAQ, Q&A systems |
| 256–512 tokens | High | Medium | Med | General RAG (most common) |
| 512–1024 tokens | Medium | High | Low | Summarization, long-form answer |
| 1000+ tokens | Low | Very high | Low | Context-heavy reasoning tasks |
Rule of thumb: Start with 512 tokens and 10% overlap. Measure retrieval quality
with RAGAS (see Section 7). Tune based on context recall and context precision scores.
Key insight: embedding models have a context window too. Most models cap at 512
tokens. If your chunks exceed the model’s max token limit, the end of the chunk is
simply truncated before embedding — you lose information silently.
3. Embedding Models
An embedding model converts text into a dense numeric vector. Two texts with similar
meaning produce vectors that are close in high-dimensional space (measured by cosine
similarity or dot product). This is the foundation of semantic search.
"What is Python?" --> [0.12, -0.34, 0.89, 0.02, ...] (768 dimensions)
"Python programming language" --> [0.14, -0.31, 0.91, 0.03, ...]
cosine_similarity = 0.97 (very close)
"What is Python?" --> [0.12, -0.34, 0.89, 0.02, ...]
"I enjoy eating pizza" --> [-0.44, 0.12, -0.23, 0.67, ...]
cosine_similarity = 0.12 (very far)
3.1 OpenAI Embedding Models
text-embedding-3-small
- Dimensions: 1536 (can be truncated to 512 or 256 with
dimensionsparam) - Cost: $0.02 per 1M tokens
- Performance: Excellent for English, solid multilingual
- Latency: ~50–100ms per API call
- Best for: Production systems where cost matters; most use cases
text-embedding-3-large
- Dimensions: 3072 (can be truncated)
- Cost: $0.13 per 1M tokens (6.5x more expensive)
- Performance: ~10–15% better on MTEB benchmarks
- Best for: When retrieval quality is critical and you can afford the cost
Legacy: text-embedding-ada-002
- Dimensions: 1536
- Outperformed by text-embedding-3-small on most benchmarks at similar cost
- Avoid for new projects
3.2 Open-Source Embedding Models
BAAI/bge-m3
- Architecture: XLM-RoBERTa base
- Dimensions: 1024
- Strengths: Multilingual (100+ languages), supports dense + sparse + multi-vector
- Context window: 8192 tokens (excellent for long documents)
- Best for: Non-English documents, hybrid retrieval, privacy-sensitive use cases
nomic-embed-text
- Dimensions: 768
- Context window: 8192 tokens
- Fully open-source (Apache 2.0), can run locally
- Competitive with OpenAI ada-002 on MTEB benchmarks
- Best for: Local deployment, open-source stacks
all-MiniLM-L6-v2
- Dimensions: 384
- Context window: 256 tokens
- Extremely fast (runs on CPU)
- Lower quality but fine for prototypes
- Best for: Local development, resource-constrained environments
3.3 Dimension Trade-offs
More dimensions = more expressive representations, but:
| Dimensions | Storage per 1M chunks (float32) | Search speed | Quality |
|---|---|---|---|
| 384 | ~1.5 GB | Fastest | Good |
| 768 | ~3.0 GB | Fast | Better |
| 1536 | ~6.1 GB | Medium | Great |
| 3072 | ~12.2 GB | Slower | Best |
At 10M documents, the storage difference between 384 and 3072 dimensions is ~107 GB.
This matters for production planning.
3.4 Asymmetric Embeddings: Query vs Document
Some embedding models are trained with asymmetric pairs: queries and documents are
treated differently because their linguistic structure differs.
A query: "What year was Python created?"
A document: "Python was created by Guido van Rossum and first released in 1991."
For asymmetric models (like bge-m3 or Cohere embed), you pass a different prefix or
mode to the encoder depending on whether you are indexing documents or searching:
# bge-m3 asymmetric usage
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel("BAAI/bge-m3")
# At indexing time
doc_embeddings = model.encode(documents, batch_size=12, max_length=8192)
# At query time
query_embedding = model.encode_queries(["What year was Python created?"])Symmetric models (MiniLM) treat queries and documents identically. Asymmetric models
generally outperform symmetric ones for question-answering workloads.
3.5 Local vs API Embeddings
| Factor | Local Model (e.g. bge-m3) | API (e.g. OpenAI) |
|---|---|---|
| Cost | Compute only (one-time) | Per-token pricing |
| Latency | No network hop | 50–150ms per call |
| Privacy | Data never leaves your infra | Data sent to vendor |
| Quality | Competitive (bge-m3 = strong) | Very high (esp. large model) |
| Setup complexity | High (GPU/CPU tuning) | Trivial (API key) |
| Throughput | Limited by hardware | Rate limits apply |
| Best for | Enterprise, regulated, volume | Startups, prototyping, SaaS |
For 1M+ document corpora, local embedding often pays for itself in 2–3 months
compared to API costs.
4. Vector Databases — Trade-off Matrix
| Database | Hosted / Self-hosted | Scale | Metadata Filtering | Hybrid Search | Cost Model | Best For |
|---|---|---|---|---|---|---|
| Pinecone | Hosted only | Very large | Yes (robust) | Yes | Pay per pod/query | Production SaaS, fastest start |
| Weaviate | Both | Large | Yes (GraphQL) | Yes (BM25) | Open-source + cloud | Complex queries, knowledge graphs |
| Chroma | Self-hosted | Small–medium | Yes (basic) | No (dense only) | Free / open-source | Local dev, prototypes, testing |
| Qdrant | Both | Large | Yes (rich) | Yes | Open-source + cloud | High-perf self-hosted production |
| pgvector | Self-hosted | Medium | Yes (full SQL) | Partial | PostgreSQL costs | Existing Postgres stack |
| FAISS | Self-hosted (library) | Very large | No (manual) | No | Free (Meta) | Research, batch processing |
Notes:
- Pinecone: Easiest to get to production. Serverless tier is generous. No
operational burden. The downside: vendor lock-in and data leaves your infra. - Qdrant: Strong performance benchmarks, excellent filtering API, Docker-based
self-hosting is trivial. Best choice for self-hosted production. - Chroma: Do not use in production at scale. It shines for local development —
zero config, pure Python, runs in-memory or on disk. - pgvector: If your team already manages PostgreSQL, pgvector lets you keep
everything in one system. ACID transactions, joins, full SQL filtering. Trade-off:
ANN performance is weaker than dedicated vector DBs at very large scale. - FAISS: Meta’s library, not a database. No persistence layer, no filtering, no
server. Use when you are building a custom retrieval system with full control. - Weaviate: Unique because it stores both the vector and the object, supports
cross-references between objects (near graph DB). Best for knowledge-graph-style
RAG where documents have relationships.
5. Retrieval Strategies
5.1 Dense Retrieval (Semantic Similarity)
The baseline RAG retrieval method. Embed the query, find the K nearest vectors in the
index using approximate nearest neighbor (ANN) search (cosine similarity or dot product).
import chromadb
client = chromadb.Client()
collection = client.get_collection("docs")
query_results = collection.query(
query_texts=["What year was Python released?"],
n_results=5,
)How ANN works: Exact nearest neighbor search scales as O(n * d) — unusable at
scale. ANN algorithms (HNSW, IVF, ScaNN) trade a small accuracy loss for ~100x speed
gains. HNSW (Hierarchical Navigable Small World) is the most common: it builds a
layered graph where higher layers are sparser and lower layers are dense. Search
traverses top to bottom.
Strengths: Captures semantic meaning, synonyms, paraphrases
Weaknesses: Poor at exact keyword matches, proper nouns, model numbers, IDs
5.2 Sparse Retrieval (BM25)
BM25 is a probabilistic keyword-based ranking function. It scores a document based on
how often query terms appear in it, normalized by document length and corpus-wide
term frequency (IDF).
BM25(q, d) = SUM over query terms t:
IDF(t) * (TF(t,d) * (k1 + 1)) / (TF(t,d) + k1 * (1 - b + b * |d| / avgdl))
where:
IDF(t) = log((N - df(t) + 0.5) / (df(t) + 0.5) + 1)
TF(t,d) = frequency of term t in document d
k1 = 1.5 (term saturation, higher = slower saturation)
b = 0.75 (length normalization, 1 = full, 0 = none)
Strengths: Exact keyword matches, proper nouns, serial numbers, dates
Weaknesses: No semantic understanding; “car” and “automobile” are completely different
5.3 Hybrid Retrieval: RRF (Reciprocal Rank Fusion)
Combine sparse and dense results using Reciprocal Rank Fusion. Each result is scored
by its rank position across both retrieval systems.
def reciprocal_rank_fusion(
dense_results: list[str],
sparse_results: list[str],
k: int = 60
) -> list[tuple[str, float]]:
"""
RRF score = SUM over each ranker: 1 / (k + rank)
k=60 is the empirically recommended constant.
"""
scores: dict[str, float] = {}
for rank, doc_id in enumerate(dense_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
for rank, doc_id in enumerate(sparse_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)Why k=60? It dampens the effect of extreme rank positions. Rank 1 gives 1/61,
rank 2 gives 1/62 — the difference is small. This makes the fusion robust to one
ranker being very confident while the other is unsure.
When does hybrid beat pure dense?
- Technical documentation (product codes, model numbers)
- Legal text (exact clause references)
- Medical literature (drug names, ICD codes)
- Any domain with important proper nouns
Benchmarks show hybrid retrieval (BM25 + dense + RRF) outperforms pure dense
retrieval on most BEIR benchmarks by 5–15% on nDCG@10.
5.4 Reranking
After retrieving top-K candidates (e.g., 20 chunks), a reranker scores each
chunk-query pair and re-orders them. Return only the top-N (e.g., 5) to the LLM.
Query ──> Initial Retrieval ──> 20 candidates ──> Reranker ──> Top 5 ──> LLM
(fast, ANN) (slower, precise)
Why rerank? ANN search optimizes for speed by approximating distances. A reranker
uses a cross-encoder (processes query + document together) which is much more accurate
but too slow to scan the full corpus. The two-stage approach gets the best of both.
Cohere Rerank API:
import cohere
co = cohere.Client("YOUR_API_KEY")
results = co.rerank(
query="What are Python's key features?",
documents=[chunk.text for chunk in retrieved_chunks],
top_n=5,
model="rerank-english-v3.0",
)
reranked_chunks = [retrieved_chunks[r.index] for r in results.results]ColBERT: An alternative reranking approach using late interaction. Instead of
computing a single similarity score between query and document embeddings, ColBERT
computes token-level interactions (MaxSim operator). Available via Vespa, RAGatouille.
When is reranking worth the latency (~100–200ms extra)?
- When retrieval precision matters more than speed
- When your initial retrieval pool is noisy
- Production Q&A, customer support, legal search
5.5 Multi-Query Retrieval
Generate N rephrased versions of the original query, retrieve for each, and union
the result sets (deduplicating by chunk ID).
def generate_query_variants(query: str, llm_client, n: int = 3) -> list[str]:
prompt = f"""Generate {n} different ways to ask the following question.
Each variant should approach the question from a different angle.
Output one variant per line, no numbering.
Question: {query}"""
response = llm_client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
variants = response.content[0].text.strip().split("\n")
return [query] + variants[:n]Why this works: The original query may not share vocabulary with the way relevant
chunks are written. Multiple variants increase the chance of at least one variant
matching the document’s phrasing.
Cost: 1 extra LLM call to generate variants + N embedding calls. Worth it for
complex queries. Not worth it for simple, well-specified queries.
5.6 HyDE — Hypothetical Document Embeddings
Instead of embedding the raw query, ask the LLM to generate a hypothetical document
that would answer the query, then embed that hypothetical document for retrieval.
def hyde_retrieve(query: str, llm_client, collection) -> list[str]:
# Step 1: Generate a hypothetical answer
hyde_prompt = f"""Write a short paragraph that would be a perfect answer to the
following question. Do not indicate uncertainty — write as if you know the answer.
Question: {query}"""
hypothetical_doc = llm_client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=150,
messages=[{"role": "user", "content": hyde_prompt}]
).content[0].text
# Step 2: Embed the hypothetical document, not the query
results = collection.query(
query_texts=[hypothetical_doc],
n_results=5,
)
return resultsWhy HyDE works: Queries and documents often have very different linguistic
structure (“What year was Python created?” vs “Python was first released in 1991…”).
The hypothetical document shares the register, vocabulary, and format of actual
corpus documents, leading to better ANN matches.
When to use HyDE:
- When queries are very short and vague
- When there is a large vocabulary mismatch between queries and documents
- Technical or academic domains where documents use formal language
Caution: If the LLM generates a hallucinated hypothetical that is factually wrong,
you might retrieve completely irrelevant chunks. HyDE works best when the LLM has
strong prior knowledge of the domain structure.
5.7 Parent-Document Retrieval
Index small child chunks for precise retrieval, but return their parent chunk (larger
context) to the LLM. See Section 2.4 for the chunking setup.
def parent_doc_retrieve(query: str, child_index, parent_store, n_children: int = 5):
# Search using small child embeddings
child_results = child_index.query(query_texts=[query], n_results=n_children)
# Look up parent IDs
parent_ids = set()
for metadata in child_results["metadatas"][0]:
parent_ids.add(metadata["parent_id"])
# Return full parent chunks (deduplicated)
return [parent_store[pid] for pid in parent_ids]6. Advanced RAG Patterns
6.1 Query Rewriting / Expansion
Before retrieval, preprocess the query to improve match quality.
Query Expansion: Add synonyms or related terms to the query.
Original: "car insurance claim process"
Expanded: "car automobile vehicle insurance claim process procedure steps"
Query Rewriting: Rephrase for clarity. Useful when the user query is ambiguous,
uses pronouns (“What did he say about it?”), or references prior conversation turns.
rewrite_prompt = """Given this conversation history and follow-up question,
rewrite the follow-up as a standalone question.
History: {history}
Follow-up: {question}
Standalone question:"""This is critical for conversational RAG where the user asks follow-up questions
that only make sense in context of the prior exchange.
6.2 Step-Back Prompting
For abstract or complex queries, first ask a more general “step-back” question, retrieve
for that, then combine both sets of results.
User query: "Why did Microsoft acquire Activision Blizzard?"
Step-back: "What are Microsoft's strategic interests in the gaming market?"
The step-back retrieves higher-level context that makes the specific answer
interpretable. Implementation: two retrieval passes + merge + send to LLM.
6.3 Contextual Compression
After retrieval, the retrieved chunks may contain a lot of irrelevant text. Send each
chunk to a lightweight LLM call to extract only the sentences relevant to the query.
Original Chunk (512 tokens):
"Python was created by Guido van Rossum. It was first released in 1991.
Python's design philosophy emphasizes code readability with the use of
significant indentation. Guido's favorite color is unknown. The language
provides constructs that enable clear programming on both small and large
scales..."
Compressed (for query "when was Python released"):
"Python was first released in 1991."
Reduces context window usage and improves signal-to-noise in the final prompt.
Trade-off: extra LLM call per retrieved chunk = additional latency.
6.4 FLARE — Forward-Looking Active Retrieval
Instead of retrieving everything upfront, FLARE lets the LLM generate its answer
token by token and trigger retrieval when it becomes uncertain (predicted tokens
have low probability). It then retrieves fresh context and continues generation.
LLM starts generating:
"Python was created by Guido van Rossum and first released in..."
[LLM is uncertain about the year — triggers retrieval]
[Retrieves chunk: "Python 1.0 was released in January 1994"]
[LLM continues]: "...1994..."
FLARE is powerful but complex to implement. It requires access to the model’s token
probabilities, which is available for self-hosted models but not always via API.
Best suited for long-form generation tasks where not all facts are needed upfront.
6.5 Agentic RAG
The LLM is given retrieval as a tool and decides autonomously:
- Whether to retrieve at all
- What query to issue
- Whether to retrieve again based on the result
- How many retrieval rounds are needed
tools = [
{
"name": "retrieve_documents",
"description": "Search the knowledge base for relevant information",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"n_results": {"type": "integer", "default": 5}
},
"required": ["query"]
}
}
]
# The LLM will call retrieve_documents as many times as needed
response = client.messages.create(
model="claude-opus-4-5",
tools=tools,
messages=[{"role": "user", "content": user_question}]
)Agentic RAG handles multi-hop questions naturally: the LLM retrieves a first fact,
reads it, formulates a follow-up query, retrieves more, and synthesizes the answer.
7. RAG Evaluation (RAGAS)
RAGAS (Retrieval-Augmented Generation Assessment) provides a framework for evaluating
RAG pipelines without manual annotation of every answer.
The Four Core Metrics
1. Faithfulness
Does the generated answer contain only claims that are supported by the retrieved context?
Detects hallucination. Score: 0 to 1.
Context: "Python was released in 1991."
Answer: "Python was released in 1991 and created by Tim Peters."
Faithfulness = 0.5 (second claim is unsupported/wrong)
2. Answer Relevancy
Does the answer actually address the user’s question?
Tests whether the LLM drifts from the question.
Computed by: generate N questions from the answer, measure semantic similarity to the
original question. Low similarity = LLM answered something adjacent but not the real question.
3. Context Precision
Of the retrieved chunks, what fraction were actually useful for answering the question?
Measures retrieval precision — are we retrieving junk along with gold?
4. Context Recall
Were all the facts needed to answer the question present in the retrieved context?
Measures retrieval completeness — are we missing critical information?
Context recall = (facts in answer that appear in context) / (total facts in answer)
Building a Golden Evaluation Set
- Select 50–200 representative questions from your domain
- Have domain experts write ideal answers
- Run your RAG pipeline on each question
- Score with RAGAS metrics
- Identify the weakest metric and optimize for it
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
eval_data = {
"question": ["What year was Python created?", ...],
"answer": ["Python was created in 1991.", ...], # RAG output
"contexts": [["Python was released in 1991...", ...], ...], # retrieved chunks
"ground_truth": ["Python was first released in 1991.", ...], # golden answer
}
result = evaluate(Dataset.from_dict(eval_data), metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
])
print(result)8. When RAG Fails (and What To Do)
Failure 1: Bad Retrieval → Hallucination
Symptom: The LLM produces a confident answer that is not in the corpus.
Root cause: The right chunks were never retrieved — the LLM had to invent.
Diagnosis: Check context recall score. Manually inspect retrieved chunks for a
failing query.
Fixes:
- Improve chunking (semantic chunking, appropriate chunk size)
- Add hybrid retrieval (BM25 captures keywords that dense misses)
- Add multi-query retrieval (query rephrasing increases recall)
- Use HyDE for abstract queries
- Add a “cannot answer” instruction: “If the context does not contain the answer, say so.”
Failure 2: Context Window Overflow
Symptom: Retrieved content is too long to fit in the prompt. Either you truncate
(losing information) or you hit API token limits.
Root cause: Too many retrieved chunks, or chunks are too large.
Fixes:
- Reduce K (retrieve fewer chunks, rely on reranking for quality)
- Use contextual compression to shrink each chunk before sending
- Use parent-document retrieval: retrieve more parents but they cover more ground per token
- Switch to a model with a larger context window
Failure 3: Stale Embeddings
Symptom: Queries about recently added documents return old, irrelevant results.
Root cause: Documents were added to the corpus but never re-embedded and indexed.
Fixes:
- Implement a change-detection pipeline: checksums on source docs, trigger re-indexing on change
- Incremental indexing: only re-embed changed or new documents
- Track
last_indexedtimestamps per document - For rapidly changing data: consider caching + invalidation rather than batch re-index
Failure 4: Query-Document Semantic Mismatch
Symptom: Dense retrieval misses obviously relevant chunks. Example: user asks
“Who founded Python?” but documents use the phrase “Guido van Rossum created Python.”
Root cause: The embedding model encodes queries and documents in slightly different
parts of the space; or domain-specific vocabulary mismatch.
Fixes:
- Add BM25 hybrid retrieval (keyword overlap doesn’t care about semantics)
- Use HyDE (hypothetical document is in document-space language)
- Use query expansion / rewriting
- Fine-tune the embedding model on your domain data
Failure 5: Multi-Hop Questions
Symptom: “What is the nationality of the person who created Python?” requires two
reasoning steps: (1) find who created Python, (2) find their nationality.
Single-pass retrieval cannot handle this in one shot.
Fixes:
- Iterative retrieval: answer step 1, use that answer as input to step 2
- Agentic RAG: let the LLM issue multiple retrieval calls
- Knowledge graph augmentation: store entity relationships explicitly
- Pre-compute common multi-hop paths during indexing
Failure 6: The “Lost in the Middle” Problem
Research (Liu et al., 2023) shows that LLMs pay the most attention to information at
the beginning and end of a long context. Information in the middle of a large context
window is systematically under-utilized.
Fix: Reranking ensures the most relevant chunks are placed first. If sending many
chunks, put the highest-ranked chunks at the top and bottom, not the middle. Some
systems use recursive summarization to compress the middle context.
9. Interview Flashcards
Q1: What is RAG and when would you use it over fine-tuning?
RAG (Retrieval-Augmented Generation) is a technique where relevant documents are
retrieved from an external knowledge base and injected into the LLM prompt as context
before generation. Use RAG when: knowledge changes frequently, you need citation/
auditability, or you have domain-specific factual data not well covered in training.
Use fine-tuning when: you need to change the model’s behavior or style, the task
requires a new reasoning pattern, or you want to internalize a stable, unchanging body
of knowledge efficiently. In practice, many production systems use both.
Q2: Explain chunking strategies and their trade-offs.
Fixed-size: simple but breaks semantic context at arbitrary boundaries. Sentence-based:
respects language units but variable chunk sizes. Semantic chunking: splits on meaning
shifts (embedding similarity drops), most coherent but computationally expensive.
Hierarchical: stores small chunks for retrieval precision but returns parent chunks for
LLM context — best of both worlds. Key insight: chunk size trades off precision vs
context. Small chunks (128 tokens) enable precise matching; large chunks (1000 tokens)
provide richer context. Overlap (10–15%) prevents losing information at boundaries.
Rule of thumb: 512 tokens, 10% overlap for most cases.
Q3: What is hybrid retrieval and why is it better than pure semantic search?
Hybrid retrieval combines dense (semantic/embedding-based) and sparse (BM25/keyword)
retrieval, then merges results using Reciprocal Rank Fusion (RRF). Dense retrieval
captures semantic meaning and synonyms but struggles with exact keyword matches and
proper nouns. BM25 excels at exact matches but has no semantic understanding. Hybrid
gets both: “Python 3.11 release notes” is found via keywords (BM25) while “Python
programming language history” is found via semantics (dense). On BEIR benchmarks,
hybrid consistently outperforms either alone by 5–15% nDCG.
Q4: What is HyDE and when would you use it?
HyDE (Hypothetical Document Embeddings) generates a hypothetical answer to the query
using the LLM, then embeds that answer for retrieval — instead of embedding the raw
query. It works because queries and documents have different linguistic structure:
a question and a passage that answers it may not be close in embedding space, but the
hypothetical answer is. Use HyDE when: queries are short and vague, there is a large
vocabulary mismatch between query and document styles, or in academic/technical domains.
Caution: if the LLM halluccinates the hypothetical, retrieval degrades.
Q5: How do you evaluate a RAG pipeline?
Use RAGAS metrics: (1) Faithfulness — does the answer only make claims supported by
retrieved context? (2) Answer relevancy — does the answer address the question?
(3) Context precision — are the retrieved chunks relevant? (4) Context recall — did
retrieval find all the necessary information? Build a golden evaluation set: 100–200
manually curated question/answer pairs. Run the pipeline on all questions, score with
RAGAS, identify the lowest-scoring metric, and optimize that component. For production,
add LLM-as-judge evaluation for edge cases and log + sample real user interactions.
Q6: What is reranking and when does it help?
Reranking is a two-stage process: (1) fast ANN retrieval of top-K candidates (~20),
(2) a cross-encoder model scores each candidate against the query and re-orders them.
A cross-encoder processes the full query-document pair jointly (unlike bi-encoder
embeddings that encode separately), giving much more accurate relevance scores. It adds
~100–200ms latency but significantly improves precision. Use reranking when: retrieval
quality is critical, the initial retrieval pool is large and noisy, or users notice
irrelevant answers. Services: Cohere Rerank API. Open-source: ColBERT, ms-marco-MiniLM.
Q7: How do you handle a question that requires information from multiple chunks?
Multi-hop questions require reasoning across multiple pieces of evidence. Approaches:
(1) Agentic RAG — give the LLM retrieval as a tool and let it call it multiple times;
(2) iterative retrieval — answer step 1, use the intermediate answer to form step 2
query; (3) multi-query retrieval with union — generate sub-questions, retrieve for
each, merge context; (4) knowledge graph — pre-compute entity relationships so
multi-hop traversal is a graph query. The lost-in-the-middle problem also applies:
put the most relevant chunks first and last in the context.
Q8: What is the “lost in the middle” problem and how does it affect RAG?
Research (Liu et al. 2023) showed that LLMs pay highest attention to information at
the beginning and end of long prompts, and systematically underweight information in
the middle. For RAG, this means: if the answer-bearing chunk is buried in position 5
of 10 retrieved chunks, the model may ignore it. Fix: use reranking to put the most
relevant chunks first. If you must send many chunks, consider placing critical chunks
at position 1 and position K (not in the middle). Contextual compression also helps by
reducing chunk size, fitting more signal in the same context budget.
Q9: How would you design a RAG system for a 10M-document corpus?
Architecture decisions at this scale:
- Embedding: Batch processing pipeline (e.g., Spark/Ray) with a hosted or local
embedding model. Incremental re-indexing with change detection. - Vector DB: Pinecone (managed) or Qdrant (self-hosted with sharding). Need
distributed ANN index across nodes. - Chunking: Semantic or hierarchical chunking with metadata (source, date, section).
~512 tokens per chunk = ~20M index entries. - Retrieval: Hybrid (BM25 + dense) with metadata pre-filtering. Reranking for
top-20 → top-5. - Latency budget: Embedding query (~50ms), ANN search (~20ms), reranking (~150ms),
LLM generation (~1–2s). Total: ~2s acceptable for most use cases. - Observability: Log every retrieval + answer pair. Sample for RAGAS evaluation.
Track cache hit rates, p99 latency, per-query costs. - Freshness: Document ingestion queue, re-index on change, TTL on stale documents.
Q10: What is parent-document retrieval?
Parent-document retrieval is a two-tier chunking strategy that separates the unit used
for indexing/search from the unit sent to the LLM. Small child chunks (~128 tokens)
are embedded and indexed — their compact size makes them precise search targets. When
a child chunk is retrieved, its parent chunk (~1000 tokens) is fetched from a document
store and sent to the LLM instead, providing full context around the matched text.
This solves the tension between retrieval precision (favors small chunks) and generation
quality (favors large context). The trade-off: doubled storage (child index + parent
store), and more complex retrieval logic.
Quick Reference
Chunk size rule of thumb:
256–512 tokens for most RAG use cases
10–15% overlap
Semantic or sentence chunking preferred over fixed-size
Retrieval rule of thumb:
Retrieve top-20, rerank to top-5
Use hybrid (BM25 + dense) by default
Add HyDE for complex/abstract queries
Evaluation rule of thumb:
Faithfulness > 0.85 = acceptable
Context recall > 0.80 = acceptable
If recall is low → fix retrieval
If faithfulness is low → fix prompt / add grounding instruction
Vector DB selection:
Local dev → Chroma
Production self-hosted → Qdrant
Production managed → Pinecone
Existing Postgres → pgvector
Next: see examples/basic_rag.py for a working implementation, and exercises/README.md for hands-on practice.