Chapter 6: RAG and Agents
RAG (Retrieval-Augmented Generation)
Why RAG
- Models have limited context length and may hallucinate without grounding
- RAG = retrieve relevant information per query → feed to model → better responses
- Term coined by Lewis et al. (2020); retrieve-then-generate pattern introduced by Chen et al. (2017)
- RAG ≠ just a context length workaround: even with long context, more tokens = more cost + model less focused (“needle in a haystack” problem in middle of context)
- Data grows over time → will always need retrieval even as context lengths grow
- Anthropic guidance: if knowledge base < 200K tokens (~500 pages), include it all in prompt — no RAG needed
RAG = context construction for AI, just as feature engineering = context construction for classical ML.
RAG Architecture
Two components:
- Retriever: indexes data + retrieves relevant chunks per query
- Generator: generates response based on retrieved context
Modern RAG: retriever and generator often trained separately; production systems use off-the-shelf retrievers.
Indexing: process data into a searchable form.
Querying: find chunks most relevant to each query.
Chunking: split documents into manageable pieces to avoid arbitrarily long contexts.
Retrieval Algorithms
Term-based retrieval (sparse):
- Keywords/lexical matching
- TF-IDF: term frequency × inverse document frequency — weights terms by how informative they are (common terms like “the” get lower weight)
- BM25: refined TF-IDF, normalizes by document length; still competitive as a baseline
- Elasticsearch: uses inverted index (term → documents containing it)
- Fast, strong out-of-the-box; fewer parameters to tune; misses semantic meaning (transformer electric device vs ML architecture)
Embedding-based retrieval (dense/semantic):
- Convert query and documents to embeddings → find nearest neighbors
- Requires vector database for efficient nearest-neighbor search
- k-NN: exact but slow for large datasets
- ANN algorithms: FAISS (Facebook), ScaNN (Google), Annoy (Spotify), HNSW
- LSH: hashes similar vectors to same bucket
- HNSW: multi-layer graph; high accuracy + fast queries but slow/memory-intensive to build
- Product Quantization: compress vectors into lower-dim representations
- IVF: K-means clusters, searches within clusters
- Can significantly outperform term-based with finetuning; more expensive (embedding generation, vector storage/query)
- Vector DB cost can be 1/5 to 1/2 of total model API spending
Retrieval evaluation metrics:
- Context precision: % of retrieved docs that are relevant
- Context recall: % of relevant docs that are retrieved
- NDCG, MAP, MRR: ranking quality
- End-to-end evaluation: does the retriever help the generator produce better answers?
Hybrid search: combine term-based + embedding-based retrieval:
- Sequential: BM25 fetches candidates → vector search reranks
- Parallel (ensemble): multiple retrievers → Reciprocal Rank Fusion (RRF) combines rankings
Retrieval Optimization
Chunking strategy:
- Equal-length chunks: by character (2048), word (512), sentence, paragraph
- Recursive chunking: section → paragraph → sentence (preserves related content)
- Language-specific splitters: code, Chinese text, Q&A pairs
- Overlapping chunks: avoid cutting off context at boundaries
- Chunk size trade-offs: smaller = more diverse info but more compute; larger = more context per chunk but may miss coverage
- No universal best chunk size — experiment
Reranking:
- First retrieval: cheap, high-recall; second stage: expensive, high-precision
- Time-based reranking: recent docs get higher weight
- Models better at content at beginning/end of context → place most relevant chunks there
Query rewriting:
- Rewrite ambiguous/conversational queries to be self-contained
- “How about Emily Doe?” → “When was the last time Emily Doe bought from us?”
- Use another AI model with prompt: “rewrite the last user input to reflect what the user is actually asking”
Contextual retrieval (Anthropic):
- Augment each chunk with AI-generated context (50-100 tokens) explaining the chunk’s role in the original document
- Prepend context to chunk before indexing
- Improves retrieval accuracy for chunks that lack standalone context
RAG beyond text:
- Multimodal RAG: retrieve images/video using captions/metadata or CLIP multimodal embeddings
- Tabular RAG: text-to-SQL → execute SQL → generate response from results
Agents
Agent Overview
An agent = anything that perceives its environment and acts upon it (Russell & Norvig, 1995).
- Characterized by its environment (game, internet, kitchen, road) and tool inventory
- AI = brain that processes task, plans, executes tools, reflects
- ChatGPT is an agent (web search, code execution, image generation)
- RAG systems are agents (retriever is a tool)
Why agents need stronger models:
- Compound accuracy: 95% accuracy × 10 steps = 60% end-to-end accuracy; × 100 steps = 0.6%
- Higher stakes: write actions can have severe consequences
Tools
Three categories:
- Knowledge augmentation: text retriever, SQL executor, web search, email reader, inventory API
- Capability extension: calculator, calendar, timezone converter, code interpreter, text-to-image, OCR, translator
- Chameleon (Lu et al., 2023): GPT-4 + 13 tools → +11.37% on ScienceQA, +17% on TabMWP
- Write actions: SQL write, email send, bank transfer, database update
- Enable full workflow automation but require human oversight and security measures
Function calling: tool use feature supported by most model providers; model decides which tool and what parameters to use.
Tool selection principles:
- Compare agent performance with different tool sets
- Ablation study: remove tool → see performance drop
- Plot tool use distribution — some tools rarely used
- Different models have different tool preferences (GPT-4 favors retrieval; ChatGPT favors image captioning)
- More tools = more capabilities but harder to use efficiently; start with minimum viable set
Planning
Planning process:
- Plan generation: decompose task into action sequence
- Reflection/validation: evaluate plan quality; iterate if bad
- Execution: invoke tools following plan
- Reflection: evaluate outcomes; correct errors; repeat if needed
Decoupled planning and execution:
- Generate plan → validate → execute (vs. chain-of-thought which does all in one)
- Multi-agent: planner agent + validator agent + executor agent
- Validation: AI judge, heuristics (invalid tools, too many steps), human experts
Foundation models as planners (debate):
- LeCun and Kambhampati: autoregressive LLMs can’t truly plan
- Counter-argument: LLMs can predict action outcomes (contain world model) + can backtrack by revising path
- Practical: even if imperfect planners, can still be part of a larger planning system
Plan generation techniques:
- Prompt with tool descriptions + examples (few-shot)
- Natural language plans (more robust to tool API changes; needs translator)
- Hierarchical planning: high-level plan → detailed sub-plans
Control flows:
- Sequential, Parallel, If-statement, For-loop
- Parallel execution can drastically reduce latency for tasks with many independent steps
ReAct framework (Yao et al., 2022): interleave Thought → Action → Observation at each step. Pattern:
Thought 1: [reasoning]
Act 1: [action]
Observation 1: [result]
...
Act N: Finish [answer]
Reflexion (Shinn et al., 2023): after evaluation, agent reflects on what went wrong → generates new plan. Allows learning from mistakes within a session.
Agent Failure Modes and Evaluation
Planning failures:
- Invalid tool
- Valid tool, invalid parameters
- Valid tool, wrong parameter values
- Goal failure (doesn’t achieve the goal, violates constraints)
- Time failure (task completed after deadline)
- Reflection error (convinced task is done when it isn’t)
Tool failures:
- Tool gives wrong output
- Translation errors (natural language plan → executable commands)
- Missing tools for the task
Efficiency metrics:
- Average steps per task
- Average cost per task
- Time per action
Evaluation process: create (task, tool inventory) tuples → generate K plans → measure: % valid plans, avg plans to first valid plan, tool call validity rate, error type distribution.
Memory
Three memory mechanisms:
| Type | Mechanism | Persistence | Speed | Capacity |
|---|---|---|---|---|
| Internal knowledge | Model weights (training data) | Permanent (until retrain) | Instant | Fixed |
| Short-term | Context window | Session only | Fast | Limited (context length) |
| Long-term | External storage (RAG) | Across sessions | Slower (retrieval) | Unlimited |
Rule of thumb:
- Information needed for ALL tasks → bake into model via training/finetuning
- Information needed RARELY → long-term memory
- Immediate/task-specific information → short-term memory
Memory benefits:
- Manage information overflow within session
- Persist personalization across sessions (e.g., “recommend based on books I loved”)
- Boost consistency (model remembers its previous answers)
- Maintain data structural integrity (can store structured data outside unstructured context)
Short-term memory management strategies:
- FIFO (first in, first out): simple but risks losing important early context
- Summarization: compress old messages into summary + add back key entities
- Reflection-based (Liu et al. 2023): after each action, decide if info should be inserted/merged/replace in memory
- Contradiction handling: keep newer info, use AI to judge, or keep both (use case dependent)
Key Takeaways
- RAG = context construction per query; remains valuable even as context lengths grow (efficiency and cost)
- Term-based retrieval (BM25/Elasticsearch) is a strong, cheap baseline; embedding-based adds power but costs more; hybrid search combines the best of both
- Chunking strategy significantly impacts retrieval quality — experiment with size, overlap, and contextual augmentation
- Agents = models + tool inventory + planning; the more tools, the more capable but harder to manage
- Agents are only as good as their planners; reflection and error correction are crucial for multi-step tasks
- Tool write actions enable full automation but require careful security measures (human approval, isolation)
- Memory system = three tiers (internal, short-term, long-term) — use each based on usage frequency and persistence needs
- Agents have unique failure modes (tool call errors, goal failures, reflection errors) — evaluate each independently