Chapter 6: RAG and Agents

RAG (Retrieval-Augmented Generation)

Why RAG

Models have limited context length and may hallucinate without grounding
RAG = retrieve relevant information per query → feed to model → better responses
Term coined by Lewis et al. (2020); retrieve-then-generate pattern introduced by Chen et al. (2017)
RAG ≠ just a context length workaround: even with long context, more tokens = more cost + model less focused (“needle in a haystack” problem in middle of context)
Data grows over time → will always need retrieval even as context lengths grow
Anthropic guidance: if knowledge base < 200K tokens (~500 pages), include it all in prompt — no RAG needed

RAG = context construction for AI, just as feature engineering = context construction for classical ML.

RAG Architecture

Two components:

Retriever: indexes data + retrieves relevant chunks per query
Generator: generates response based on retrieved context

Modern RAG: retriever and generator often trained separately; production systems use off-the-shelf retrievers.

Indexing: process data into a searchable form.
Querying: find chunks most relevant to each query.
Chunking: split documents into manageable pieces to avoid arbitrarily long contexts.

Retrieval Algorithms

Term-based retrieval (sparse):

Keywords/lexical matching
TF-IDF: term frequency × inverse document frequency — weights terms by how informative they are (common terms like “the” get lower weight)
BM25: refined TF-IDF, normalizes by document length; still competitive as a baseline
Elasticsearch: uses inverted index (term → documents containing it)
Fast, strong out-of-the-box; fewer parameters to tune; misses semantic meaning (transformer electric device vs ML architecture)

Embedding-based retrieval (dense/semantic):

Convert query and documents to embeddings → find nearest neighbors
Requires vector database for efficient nearest-neighbor search
k-NN: exact but slow for large datasets
ANN algorithms: FAISS (Facebook), ScaNN (Google), Annoy (Spotify), HNSW
- LSH: hashes similar vectors to same bucket
- HNSW: multi-layer graph; high accuracy + fast queries but slow/memory-intensive to build
- Product Quantization: compress vectors into lower-dim representations
- IVF: K-means clusters, searches within clusters
Can significantly outperform term-based with finetuning; more expensive (embedding generation, vector storage/query)
Vector DB cost can be 1/5 to 1/2 of total model API spending

Retrieval evaluation metrics:

Context precision: % of retrieved docs that are relevant
Context recall: % of relevant docs that are retrieved
NDCG, MAP, MRR: ranking quality
End-to-end evaluation: does the retriever help the generator produce better answers?

Hybrid search: combine term-based + embedding-based retrieval:

Sequential: BM25 fetches candidates → vector search reranks
Parallel (ensemble): multiple retrievers → Reciprocal Rank Fusion (RRF) combines rankings

Retrieval Optimization

Chunking strategy:

Equal-length chunks: by character (2048), word (512), sentence, paragraph
Recursive chunking: section → paragraph → sentence (preserves related content)
Language-specific splitters: code, Chinese text, Q&A pairs
Overlapping chunks: avoid cutting off context at boundaries
Chunk size trade-offs: smaller = more diverse info but more compute; larger = more context per chunk but may miss coverage
No universal best chunk size — experiment

Reranking:

First retrieval: cheap, high-recall; second stage: expensive, high-precision
Time-based reranking: recent docs get higher weight
Models better at content at beginning/end of context → place most relevant chunks there

Query rewriting:

Rewrite ambiguous/conversational queries to be self-contained
“How about Emily Doe?” → “When was the last time Emily Doe bought from us?”
Use another AI model with prompt: “rewrite the last user input to reflect what the user is actually asking”

Contextual retrieval (Anthropic):

Augment each chunk with AI-generated context (50-100 tokens) explaining the chunk’s role in the original document
Prepend context to chunk before indexing
Improves retrieval accuracy for chunks that lack standalone context

RAG beyond text:

Multimodal RAG: retrieve images/video using captions/metadata or CLIP multimodal embeddings
Tabular RAG: text-to-SQL → execute SQL → generate response from results

Agents

Agent Overview

An agent = anything that perceives its environment and acts upon it (Russell & Norvig, 1995).

Characterized by its environment (game, internet, kitchen, road) and tool inventory
AI = brain that processes task, plans, executes tools, reflects
ChatGPT is an agent (web search, code execution, image generation)
RAG systems are agents (retriever is a tool)

Why agents need stronger models:

Compound accuracy: 95% accuracy × 10 steps = 60% end-to-end accuracy; × 100 steps = 0.6%
Higher stakes: write actions can have severe consequences

Tools

Three categories:

Knowledge augmentation: text retriever, SQL executor, web search, email reader, inventory API
Capability extension: calculator, calendar, timezone converter, code interpreter, text-to-image, OCR, translator
- Chameleon (Lu et al., 2023): GPT-4 + 13 tools → +11.37% on ScienceQA, +17% on TabMWP
Write actions: SQL write, email send, bank transfer, database update
- Enable full workflow automation but require human oversight and security measures

Function calling: tool use feature supported by most model providers; model decides which tool and what parameters to use.

Tool selection principles:

Compare agent performance with different tool sets
Ablation study: remove tool → see performance drop
Plot tool use distribution — some tools rarely used
Different models have different tool preferences (GPT-4 favors retrieval; ChatGPT favors image captioning)
More tools = more capabilities but harder to use efficiently; start with minimum viable set

Planning

Planning process:

Plan generation: decompose task into action sequence
Reflection/validation: evaluate plan quality; iterate if bad
Execution: invoke tools following plan
Reflection: evaluate outcomes; correct errors; repeat if needed

Decoupled planning and execution:

Generate plan → validate → execute (vs. chain-of-thought which does all in one)
Multi-agent: planner agent + validator agent + executor agent
Validation: AI judge, heuristics (invalid tools, too many steps), human experts

Foundation models as planners (debate):

LeCun and Kambhampati: autoregressive LLMs can’t truly plan
Counter-argument: LLMs can predict action outcomes (contain world model) + can backtrack by revising path
Practical: even if imperfect planners, can still be part of a larger planning system

Plan generation techniques:

Prompt with tool descriptions + examples (few-shot)
Natural language plans (more robust to tool API changes; needs translator)
Hierarchical planning: high-level plan → detailed sub-plans

Control flows:

Sequential, Parallel, If-statement, For-loop
Parallel execution can drastically reduce latency for tasks with many independent steps

ReAct framework (Yao et al., 2022): interleave Thought → Action → Observation at each step. Pattern:

Thought 1: [reasoning]
Act 1: [action]
Observation 1: [result]
...
Act N: Finish [answer]

Reflexion (Shinn et al., 2023): after evaluation, agent reflects on what went wrong → generates new plan. Allows learning from mistakes within a session.

Agent Failure Modes and Evaluation

Planning failures:

Invalid tool
Valid tool, invalid parameters
Valid tool, wrong parameter values
Goal failure (doesn’t achieve the goal, violates constraints)
Time failure (task completed after deadline)
Reflection error (convinced task is done when it isn’t)

Tool failures:

Tool gives wrong output
Translation errors (natural language plan → executable commands)
Missing tools for the task

Efficiency metrics:

Average steps per task
Average cost per task
Time per action

Evaluation process: create (task, tool inventory) tuples → generate K plans → measure: % valid plans, avg plans to first valid plan, tool call validity rate, error type distribution.

Memory

Three memory mechanisms:

Type	Mechanism	Persistence	Speed	Capacity
Internal knowledge	Model weights (training data)	Permanent (until retrain)	Instant	Fixed
Short-term	Context window	Session only	Fast	Limited (context length)
Long-term	External storage (RAG)	Across sessions	Slower (retrieval)	Unlimited

Rule of thumb:

Information needed for ALL tasks → bake into model via training/finetuning
Information needed RARELY → long-term memory
Immediate/task-specific information → short-term memory

Memory benefits:

Manage information overflow within session
Persist personalization across sessions (e.g., “recommend based on books I loved”)
Boost consistency (model remembers its previous answers)
Maintain data structural integrity (can store structured data outside unstructured context)

Short-term memory management strategies:

FIFO (first in, first out): simple but risks losing important early context
Summarization: compress old messages into summary + add back key entities
Reflection-based (Liu et al. 2023): after each action, decide if info should be inserted/merged/replace in memory
Contradiction handling: keep newer info, use AI to judge, or keep both (use case dependent)

Key Takeaways

RAG = context construction per query; remains valuable even as context lengths grow (efficiency and cost)
Term-based retrieval (BM25/Elasticsearch) is a strong, cheap baseline; embedding-based adds power but costs more; hybrid search combines the best of both
Chunking strategy significantly impacts retrieval quality — experiment with size, overlap, and contextual augmentation
Agents = models + tool inventory + planning; the more tools, the more capable but harder to manage
Agents are only as good as their planners; reflection and error correction are crucial for multi-step tasks
Tool write actions enable full automation but require careful security measures (human approval, isolation)
Memory system = three tiers (internal, short-term, long-term) — use each based on usage frequency and persistence needs
Agents have unique failure modes (tool call errors, goal failures, reflection errors) — evaluate each independently

Study Notes by Niladri & AI

Explorer

06-rag-and-agents