Chapter 6: RAG and Agents

RAG (Retrieval-Augmented Generation)

Why RAG

  • Models have limited context length and may hallucinate without grounding
  • RAG = retrieve relevant information per query → feed to model → better responses
  • Term coined by Lewis et al. (2020); retrieve-then-generate pattern introduced by Chen et al. (2017)
  • RAG ≠ just a context length workaround: even with long context, more tokens = more cost + model less focused (“needle in a haystack” problem in middle of context)
  • Data grows over time → will always need retrieval even as context lengths grow
  • Anthropic guidance: if knowledge base < 200K tokens (~500 pages), include it all in prompt — no RAG needed

RAG = context construction for AI, just as feature engineering = context construction for classical ML.

RAG Architecture

Two components:

  1. Retriever: indexes data + retrieves relevant chunks per query
  2. Generator: generates response based on retrieved context

Modern RAG: retriever and generator often trained separately; production systems use off-the-shelf retrievers.

Indexing: process data into a searchable form.
Querying: find chunks most relevant to each query.
Chunking: split documents into manageable pieces to avoid arbitrarily long contexts.

Retrieval Algorithms

Term-based retrieval (sparse):

  • Keywords/lexical matching
  • TF-IDF: term frequency × inverse document frequency — weights terms by how informative they are (common terms like “the” get lower weight)
  • BM25: refined TF-IDF, normalizes by document length; still competitive as a baseline
  • Elasticsearch: uses inverted index (term → documents containing it)
  • Fast, strong out-of-the-box; fewer parameters to tune; misses semantic meaning (transformer electric device vs ML architecture)

Embedding-based retrieval (dense/semantic):

  • Convert query and documents to embeddings → find nearest neighbors
  • Requires vector database for efficient nearest-neighbor search
  • k-NN: exact but slow for large datasets
  • ANN algorithms: FAISS (Facebook), ScaNN (Google), Annoy (Spotify), HNSW
    • LSH: hashes similar vectors to same bucket
    • HNSW: multi-layer graph; high accuracy + fast queries but slow/memory-intensive to build
    • Product Quantization: compress vectors into lower-dim representations
    • IVF: K-means clusters, searches within clusters
  • Can significantly outperform term-based with finetuning; more expensive (embedding generation, vector storage/query)
  • Vector DB cost can be 1/5 to 1/2 of total model API spending

Retrieval evaluation metrics:

  • Context precision: % of retrieved docs that are relevant
  • Context recall: % of relevant docs that are retrieved
  • NDCG, MAP, MRR: ranking quality
  • End-to-end evaluation: does the retriever help the generator produce better answers?

Hybrid search: combine term-based + embedding-based retrieval:

  • Sequential: BM25 fetches candidates → vector search reranks
  • Parallel (ensemble): multiple retrievers → Reciprocal Rank Fusion (RRF) combines rankings

Retrieval Optimization

Chunking strategy:

  • Equal-length chunks: by character (2048), word (512), sentence, paragraph
  • Recursive chunking: section → paragraph → sentence (preserves related content)
  • Language-specific splitters: code, Chinese text, Q&A pairs
  • Overlapping chunks: avoid cutting off context at boundaries
  • Chunk size trade-offs: smaller = more diverse info but more compute; larger = more context per chunk but may miss coverage
  • No universal best chunk size — experiment

Reranking:

  • First retrieval: cheap, high-recall; second stage: expensive, high-precision
  • Time-based reranking: recent docs get higher weight
  • Models better at content at beginning/end of context → place most relevant chunks there

Query rewriting:

  • Rewrite ambiguous/conversational queries to be self-contained
  • “How about Emily Doe?” → “When was the last time Emily Doe bought from us?”
  • Use another AI model with prompt: “rewrite the last user input to reflect what the user is actually asking”

Contextual retrieval (Anthropic):

  • Augment each chunk with AI-generated context (50-100 tokens) explaining the chunk’s role in the original document
  • Prepend context to chunk before indexing
  • Improves retrieval accuracy for chunks that lack standalone context

RAG beyond text:

  • Multimodal RAG: retrieve images/video using captions/metadata or CLIP multimodal embeddings
  • Tabular RAG: text-to-SQL → execute SQL → generate response from results

Agents

Agent Overview

An agent = anything that perceives its environment and acts upon it (Russell & Norvig, 1995).

  • Characterized by its environment (game, internet, kitchen, road) and tool inventory
  • AI = brain that processes task, plans, executes tools, reflects
  • ChatGPT is an agent (web search, code execution, image generation)
  • RAG systems are agents (retriever is a tool)

Why agents need stronger models:

  • Compound accuracy: 95% accuracy × 10 steps = 60% end-to-end accuracy; × 100 steps = 0.6%
  • Higher stakes: write actions can have severe consequences

Tools

Three categories:

  1. Knowledge augmentation: text retriever, SQL executor, web search, email reader, inventory API
  2. Capability extension: calculator, calendar, timezone converter, code interpreter, text-to-image, OCR, translator
    • Chameleon (Lu et al., 2023): GPT-4 + 13 tools → +11.37% on ScienceQA, +17% on TabMWP
  3. Write actions: SQL write, email send, bank transfer, database update
    • Enable full workflow automation but require human oversight and security measures

Function calling: tool use feature supported by most model providers; model decides which tool and what parameters to use.

Tool selection principles:

  • Compare agent performance with different tool sets
  • Ablation study: remove tool → see performance drop
  • Plot tool use distribution — some tools rarely used
  • Different models have different tool preferences (GPT-4 favors retrieval; ChatGPT favors image captioning)
  • More tools = more capabilities but harder to use efficiently; start with minimum viable set

Planning

Planning process:

  1. Plan generation: decompose task into action sequence
  2. Reflection/validation: evaluate plan quality; iterate if bad
  3. Execution: invoke tools following plan
  4. Reflection: evaluate outcomes; correct errors; repeat if needed

Decoupled planning and execution:

  • Generate plan → validate → execute (vs. chain-of-thought which does all in one)
  • Multi-agent: planner agent + validator agent + executor agent
  • Validation: AI judge, heuristics (invalid tools, too many steps), human experts

Foundation models as planners (debate):

  • LeCun and Kambhampati: autoregressive LLMs can’t truly plan
  • Counter-argument: LLMs can predict action outcomes (contain world model) + can backtrack by revising path
  • Practical: even if imperfect planners, can still be part of a larger planning system

Plan generation techniques:

  • Prompt with tool descriptions + examples (few-shot)
  • Natural language plans (more robust to tool API changes; needs translator)
  • Hierarchical planning: high-level plan → detailed sub-plans

Control flows:

  • Sequential, Parallel, If-statement, For-loop
  • Parallel execution can drastically reduce latency for tasks with many independent steps

ReAct framework (Yao et al., 2022): interleave Thought → Action → Observation at each step. Pattern:

Thought 1: [reasoning]
Act 1: [action]
Observation 1: [result]
...
Act N: Finish [answer]

Reflexion (Shinn et al., 2023): after evaluation, agent reflects on what went wrong → generates new plan. Allows learning from mistakes within a session.

Agent Failure Modes and Evaluation

Planning failures:

  • Invalid tool
  • Valid tool, invalid parameters
  • Valid tool, wrong parameter values
  • Goal failure (doesn’t achieve the goal, violates constraints)
  • Time failure (task completed after deadline)
  • Reflection error (convinced task is done when it isn’t)

Tool failures:

  • Tool gives wrong output
  • Translation errors (natural language plan → executable commands)
  • Missing tools for the task

Efficiency metrics:

  • Average steps per task
  • Average cost per task
  • Time per action

Evaluation process: create (task, tool inventory) tuples → generate K plans → measure: % valid plans, avg plans to first valid plan, tool call validity rate, error type distribution.


Memory

Three memory mechanisms:

TypeMechanismPersistenceSpeedCapacity
Internal knowledgeModel weights (training data)Permanent (until retrain)InstantFixed
Short-termContext windowSession onlyFastLimited (context length)
Long-termExternal storage (RAG)Across sessionsSlower (retrieval)Unlimited

Rule of thumb:

  • Information needed for ALL tasks → bake into model via training/finetuning
  • Information needed RARELY → long-term memory
  • Immediate/task-specific information → short-term memory

Memory benefits:

  • Manage information overflow within session
  • Persist personalization across sessions (e.g., “recommend based on books I loved”)
  • Boost consistency (model remembers its previous answers)
  • Maintain data structural integrity (can store structured data outside unstructured context)

Short-term memory management strategies:

  • FIFO (first in, first out): simple but risks losing important early context
  • Summarization: compress old messages into summary + add back key entities
  • Reflection-based (Liu et al. 2023): after each action, decide if info should be inserted/merged/replace in memory
  • Contradiction handling: keep newer info, use AI to judge, or keep both (use case dependent)

Key Takeaways

  • RAG = context construction per query; remains valuable even as context lengths grow (efficiency and cost)
  • Term-based retrieval (BM25/Elasticsearch) is a strong, cheap baseline; embedding-based adds power but costs more; hybrid search combines the best of both
  • Chunking strategy significantly impacts retrieval quality — experiment with size, overlap, and contextual augmentation
  • Agents = models + tool inventory + planning; the more tools, the more capable but harder to manage
  • Agents are only as good as their planners; reflection and error correction are crucial for multi-step tasks
  • Tool write actions enable full automation but require careful security measures (human approval, isolation)
  • Memory system = three tiers (internal, short-term, long-term) — use each based on usage frequency and persistence needs
  • Agents have unique failure modes (tool call errors, goal failures, reflection errors) — evaluate each independently