Module 05: Agent Memory

Memory is what separates a stateless question-answering system from an agent that can operate over time, learn from past interactions, and maintain coherent context across sessions. This module covers the full landscape of memory patterns used in production agent systems.


1. Why Memory Matters for Agents

The Core Challenge: Stateless LLMs + State Management

Large language models are fundamentally stateless. Every API call is independent — the model has no awareness of previous calls unless you explicitly provide that history as input. This is both a constraint and a design opportunity.

When you call client.messages.create(...), you are not continuing a conversation — you are making a fresh inference request. The illusion of memory is entirely the responsibility of your application layer.

This creates the core challenge of agent development: you must decide what state to carry, how to compress it, when to persist it, and when to retrieve it. These decisions have direct consequences on:

  • Cost: Every token in the context window is a token you pay for. Carrying 50,000 tokens of history on every call is expensive.
  • Quality: Irrelevant context degrades model performance. A well-curated memory outperforms a full dump of everything.
  • Latency: Retrieving and injecting memory takes time, especially if it involves a vector DB query.
  • Coherence: If the agent “forgets” something the user told it earlier, trust breaks down.

What Agents Need to Remember

An agent may need to hold onto several distinct categories of information simultaneously:

CategoryExampleTypical Lifetime
Facts”User’s name is Priya”Persistent across sessions
Preferences”User prefers Python over JavaScript”Persistent, updateable
History”We discussed authentication last Tuesday”Session → long-term
Task state”We are on step 3 of a 7-step data pipeline”Session-scoped
Working memory”The last function I wrote was parse_invoice()Within-turn
Episodic context”Last time we tried this approach, it failed because X”Long-term

Designing memory means deciding which of these categories you need, and for each one, choosing the right storage and retrieval mechanism.


2. Memory Taxonomy

A complete taxonomy of agent memory covers five major types. Understanding the distinction between them is essential for both system design and interviews.

2.1 In-Context Memory (Scratchpad)

Everything currently in the LLM’s context window is in-context memory. This includes:

  • The system prompt
  • The conversation history (messages array)
  • Tool call results
  • Any injected background knowledge

Characteristics:

  • Zero retrieval latency — the model reasons directly over it
  • Highest reliability — no retrieval step that can fail or return irrelevant results
  • Hard upper limit — every model has a context window cap (Claude 3.5 Sonnet: 200K tokens)
  • Cost scales linearly — more context = more input tokens = higher cost

When to use it: For everything that must be reliably available for the current turn. System prompt instructions, the last few turns of conversation, and any tool outputs from the current task.

When it breaks down: Long-running conversations, agents that need access to information from dozens of past sessions, or tasks where the relevant background knowledge is too large to fit.

# The simplest form of in-context memory: just keep the messages array
messages = [
    {"role": "user", "content": "My name is Alex."},
    {"role": "assistant", "content": "Hi Alex, how can I help?"},
    {"role": "user", "content": "What's my name?"},
]
# The model can answer "Your name is Alex" because it's all in context

2.2 Conversation History

Multi-turn conversation is the most common form of short-term agent memory. The messages array is passed back and forth, growing with each turn.

The budget problem: A conversation that has been running for an hour might have 200 turns. At ~200 tokens per turn, that is 40,000 tokens just for history — before you even count the system prompt, tool definitions, or the current query.

Truncation strategies:

Sliding window — Keep only the last N turns. Simple and predictable. Loses older context abruptly.

MAX_HISTORY = 10  # keep last 10 messages (5 turns)
if len(messages) > MAX_HISTORY:
    messages = messages[-MAX_HISTORY:]

Summarization — When the window fills up, compress older messages into a single summary message. Preserves semantic content at the cost of detail.

# When history gets long, compress the oldest half
if len(messages) > MAX_HISTORY:
    old_messages = messages[:-MAX_HISTORY//2]
    summary = summarize_messages(old_messages)  # LLM call
    messages = [{"role": "system", "content": f"Earlier conversation summary: {summary}"}] \
               + messages[-MAX_HISTORY//2:]

Importance scoring — Assign a relevance score to each message. Discard low-scoring messages first. More sophisticated but requires a scoring step.

Hybrid / Hierarchical — Recent turns are kept verbatim, medium-old turns are summarized, ancient turns are reduced to key facts. This mirrors how human memory works.

Managing the context budget:

A useful mental model is a “context budget” with fixed allocations:

  • System prompt: ~2,000 tokens (fixed)
  • Tool definitions: ~1,000-3,000 tokens (fixed per tool set)
  • Working memory / injected facts: ~2,000 tokens
  • Conversation history: whatever is left before the cap
  • Current turn + buffer for response: ~4,000 tokens minimum

Track this budget explicitly in production agents. Don’t wait for an API error to tell you you’ve exceeded the window.

2.3 External Memory (Vector Store)

External memory stores information outside the context window and retrieves it on demand using semantic search. The most common implementation uses embeddings and a vector database.

How it works:

  1. When a fact or interaction is worth remembering, embed it and store the vector + original text in a vector DB.
  2. At query time, embed the current query and perform approximate nearest-neighbor (ANN) search.
  3. The top-K most semantically similar memories are retrieved and injected into the context.

Write triggers — when should you save to external memory?

  • User corrects the agent (“No, I prefer tabs over spaces”)
  • User shares a personal preference or fact (“I work at Acme Corp”)
  • A task is completed that may be referenced later
  • The agent makes an inference that should be cached
  • At the end of a session (write-back pattern)

Read triggers — when should you query external memory?

  • At session start: load recent + high-importance memories into the system prompt
  • When the current query seems to reference past context (“like we discussed before”)
  • When the agent is about to perform a task it has done before
  • Proactively: when semantic similarity between query and stored memories is high

Practical considerations:

  • Embedding quality matters: use the same embedding model for write and read
  • Metadata filtering: tag memories with user_id, date, topic, etc. to enable filtered retrieval
  • Memory eviction: old or low-relevance memories should eventually be pruned
  • Deduplication: avoid storing the same fact multiple times

2.4 Episodic Memory

Episodic memory is a timestamped log of what happened — a history of events, not just facts. Think of it as an agent’s diary.

Examples:

  • “On 2024-03-15, the user asked me to refactor the auth module. I completed it. The PR was #142.”
  • “On 2024-03-16, the user was frustrated because I gave incorrect SQL syntax for PostgreSQL.”
  • “On 2024-03-17, a deployment failed because I forgot to check the staging environment first.”

Why it matters: Episodic memory lets agents learn from experience. An agent that can recall “last time I tried approach X on task Y, it failed because Z” can avoid repeating mistakes.

Implementation approaches:

  • Simple: append to a JSONL log file, retrieve via keyword search
  • Advanced: embed episodes and retrieve via semantic search
  • Structured: store in a relational DB with time range queries

2.5 Semantic Memory

Semantic memory stores facts and relationships — the agent’s “knowledge base.” Unlike episodic memory, it is de-temporalized. It records what is true, not what happened when.

Examples:

  • User profile: name, role, preferences, timezone
  • Domain knowledge: “This codebase uses FastAPI and PostgreSQL”
  • Learned facts: “The user’s deployment pipeline requires two approvals before merge”

Implementation: Knowledge graphs (Neo4j, etc.) are the purest form, but most practical implementations use structured JSON files, a SQL table, or vector DB with metadata.

2.6 Procedural Memory

Procedural memory encodes how to do things — workflows, playbooks, templates. For agents, this often takes the form of stored prompts, system prompt snippets, or code templates.

Examples:

  • “When asked to debug Python code, always start by asking for the full traceback”
  • A stored ReAct reasoning template
  • A code generation prompt that the agent has learned works well for a particular task

Implementation: Usually stored as prompt templates in a file or database. Retrieved by task type matching.


3. Memory Compression Strategies

When context grows too large, you must compress it without losing critical information. These strategies are not mutually exclusive — production systems combine them.

Strategy 1: Sliding Window

Keep only the last N messages. Drop anything older than the window.

def sliding_window(messages: list[dict], max_turns: int = 10) -> list[dict]:
    """
    Keep only the last `max_turns` complete exchanges (user + assistant pairs).
    Always preserve the system message if present.
    """
    system_msgs = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]
 
    # A "turn" is one user message + one assistant message = 2 items
    max_messages = max_turns * 2
    trimmed = non_system[-max_messages:] if len(non_system) > max_messages else non_system
 
    return system_msgs + trimmed

Trade-off: Simple and predictable, but can lose critical early context (e.g., the user’s core requirements stated in turn 1).

Strategy 2: Summarization

When the window is full, compress old messages into a running summary.

def summarize_old_turns(
    client,
    messages: list[dict],
    keep_recent: int = 6,
    model: str = "claude-haiku-4-5-20251001"
) -> list[dict]:
    """
    Compress all but the last `keep_recent` messages into a summary.
    """
    system_msgs = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]
 
    if len(non_system) <= keep_recent:
        return messages  # Nothing to compress
 
    to_summarize = non_system[:-keep_recent]
    recent = non_system[-keep_recent:]
 
    # Build a prompt to summarize the old messages
    summary_prompt = "Summarize the following conversation history concisely. " \
                     "Preserve key facts, decisions, user preferences, and any " \
                     "unresolved issues:\n\n"
    for msg in to_summarize:
        summary_prompt += f"{msg['role'].upper()}: {msg['content']}\n"
 
    summary_response = client.messages.create(
        model=model,
        max_tokens=500,
        messages=[{"role": "user", "content": summary_prompt}]
    )
    summary_text = summary_response.content[0].text
 
    summary_message = {
        "role": "user",
        "content": f"[CONVERSATION SUMMARY — earlier messages compressed]\n{summary_text}"
    }
 
    return system_msgs + [summary_message] + recent

Trade-off: Preserves semantic content but loses detail. The summary quality depends on the summarization model. Adds one LLM call per compression step.

Strategy 3: Importance Scoring

Score each message and drop low-scoring ones first.

def importance_scored_window(
    messages: list[dict],
    max_tokens: int = 8000,
    high_value_patterns: list[str] = None
) -> list[dict]:
    """
    Retain messages by importance score. Always keep system prompt and recent messages.
    """
    if high_value_patterns is None:
        high_value_patterns = [
            "my name is", "i prefer", "always", "never",
            "requirement", "must", "deadline", "error", "failed"
        ]
 
    def score(msg: dict) -> float:
        content = msg.get("content", "").lower()
        base = 1.0
        # Recency bonus would normally be computed by index
        for pattern in high_value_patterns:
            if pattern in content:
                base += 2.0
        return base
 
    system_msgs = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]
 
    # Always keep the last 4 messages
    pinned = non_system[-4:]
    candidates = non_system[:-4]
 
    scored = sorted(candidates, key=score, reverse=True)
 
    # Add back by importance until we approach token budget
    # (In production, use a real token counter)
    result = pinned[:]
    estimated_tokens = sum(len(m.get("content","")) // 4 for m in result)
    for msg in scored:
        msg_tokens = len(msg.get("content","")) // 4
        if estimated_tokens + msg_tokens < max_tokens:
            result.insert(0, msg)
            estimated_tokens += msg_tokens
        else:
            break
 
    return system_msgs + result

Strategy 4: Hierarchical Compression

Maintain three tiers of memory at different compression levels.

class HierarchicalMemory:
    """
    Three tiers:
    - recent: last N turns, verbatim
    - summary: compressed representation of older turns
    - key_facts: extracted atomic facts from oldest turns
    """
    def __init__(self, recent_turns: int = 4):
        self.recent: list[dict] = []
        self.summary: str = ""
        self.key_facts: list[str] = []
        self.recent_turns = recent_turns
 
    def to_messages(self) -> list[dict]:
        """Reconstruct a messages array for injection into context."""
        injected = []
        if self.key_facts:
            injected.append({
                "role": "user",
                "content": "[KEY FACTS FROM EARLIER]\n" + "\n".join(f"- {f}" for f in self.key_facts)
            })
        if self.summary:
            injected.append({
                "role": "user",
                "content": f"[EARLIER CONVERSATION SUMMARY]\n{self.summary}"
            })
        return injected + self.recent

4. Memory Write Patterns

Knowing when to persist memory is as important as knowing how.

Write-Through

Save every interaction to persistent storage in real time, as it happens.

def write_through(store: list, message: dict) -> None:
    """Persist immediately on every message."""
    store.append(message)
    persist_to_disk(store)  # e.g., append to JSONL file

Pros: Never lose data, even on crash.
Cons: High I/O, can become a bottleneck at scale.

Write-Back

Save at the end of the session.

class SessionMemory:
    def __init__(self):
        self.buffer: list[dict] = []
 
    def add(self, message: dict) -> None:
        self.buffer.append(message)  # in-memory only
 
    def flush(self) -> None:
        """Call when session ends."""
        persist_to_disk(self.buffer)
        self.buffer.clear()

Pros: Minimal I/O during session.
Cons: Lose all data if the process crashes before flush().

Selective Write

Only save when certain conditions are met — a preference was expressed, a fact was learned, the user made a correction.

def selective_write(message: dict, memory_store: list) -> bool:
    """Return True if message was saved."""
    content = message.get("content", "").lower()
    save_triggers = [
        "i prefer", "my name is", "always use", "never use",
        "actually,", "that's wrong", "remember that"
    ]
    if any(trigger in content for trigger in save_triggers):
        memory_store.append({
            "content": message["content"],
            "timestamp": message.get("timestamp"),
            "type": "preference_or_fact"
        })
        return True
    return False

Debounced Write

Batch writes to avoid hammering storage on every keystroke or token.

import time
 
class DebouncedMemory:
    def __init__(self, delay_seconds: float = 2.0):
        self.buffer: list[dict] = []
        self.last_write = 0
        self.delay = delay_seconds
 
    def add(self, message: dict) -> None:
        self.buffer.append(message)
        now = time.time()
        if now - self.last_write > self.delay:
            self.flush()
            self.last_write = now
 
    def flush(self) -> None:
        if self.buffer:
            persist_to_disk(self.buffer)
            self.buffer.clear()

5. Memory Read Patterns

How and when you retrieve memory is just as important as how you store it.

Load at Session Start

At session initialization, query external memory and inject the most relevant facts into the system prompt. The agent has this context from turn 1.

def build_system_prompt(user_id: str, memory_store) -> str:
    memories = memory_store.get_recent(user_id, limit=10)
    memory_block = "\n".join(f"- {m['content']}" for m in memories)
    return f"""You are a helpful assistant.
 
Known facts about this user:
{memory_block}
 
Use this context naturally in the conversation."""

Pros: Agent always has baseline context. No retrieval latency mid-conversation.
Cons: Adds tokens to every call. May include stale or irrelevant facts.

Lazy Retrieval

Only fetch from external memory when the query explicitly or implicitly demands it.

def maybe_retrieve_memory(query: str, memory_store, threshold: float = 0.75) -> list[str]:
    """Only retrieve if the query looks like it references past context."""
    past_references = ["before", "last time", "earlier", "you mentioned", "we discussed", "remember"]
    if any(ref in query.lower() for ref in past_references):
        return memory_store.search(query, top_k=3)
    return []

Pros: Saves tokens and latency on most turns.
Cons: May miss relevant memories when the user does not explicitly signal they’re referencing the past.

Proactive Retrieval

Retrieve memory on every turn based on semantic similarity, not just explicit past-references.

async def proactive_retrieve(query: str, memory_store, top_k: int = 3) -> list[dict]:
    """
    Always run a semantic search. Inject results if similarity is above threshold.
    """
    results = await memory_store.semantic_search(query, top_k=top_k)
    return [r for r in results if r["similarity"] > 0.70]

Pros: Catches non-obvious relevance. Agent feels smarter.
Cons: Extra latency and cost on every turn. Risk of injecting irrelevant noise if threshold is too low.


6. Claude Code Memory System as Reference Implementation

Claude Code ships with a memory system that is worth studying as a reference implementation. Understanding it gives you design patterns you can apply to your own systems.

Directory Layout

~/.claude/
└── projects/
    └── <project-hash>/
        └── memory/
            ├── MEMORY.md          # index file listing all memory entries
            ├── user/              # facts about the user
            ├── feedback/          # corrections and preferences
            ├── project/           # project-specific facts
            └── reference/         # long-term reference material

Memory Types

  • user: Personal facts — name, role, communication style preferences.
  • feedback: Corrections and preferences — “don’t use semicolons in Python”, “always explain your reasoning”.
  • project: Project-scoped facts — tech stack, conventions, architecture decisions.
  • reference: External reference material the agent should have on hand.

MEMORY.md as an Index

MEMORY.md serves as a searchable index. It contains frontmatter metadata for each memory entry plus a brief summary. When the agent needs memory, it reads the index first to decide which entries to fully load.

---
type: feedback
id: f001
created: 2024-03-15
tags: [code-style, python]
---
User prefers snake_case and dislikes semicolons in Python.

The Two-Step Write Pattern

When new information worth saving is detected, Claude Code does not write it immediately to a new file. It follows a two-step pattern:

  1. Write the memory entry to the appropriate subdirectory with full content and frontmatter.
  2. Update MEMORY.md to add the new entry to the index.

This ensures the index stays consistent. If the process is interrupted after step 1 but before step 2, the orphaned file can be detected on next startup by comparing directory contents to the index.

Lessons for Your Own Systems

  1. Separate your index from your data. The index (MEMORY.md) should be cheap to read and tell you whether a full fetch is needed. Data files can be large.
  2. Use frontmatter metadata. Type, timestamp, tags, and source let you filter without reading full content.
  3. Make the two-step pattern atomic with a lock or transaction. In a distributed system, use a database transaction. In a single-process system, a file lock suffices.
  4. Partition by memory type. User facts and project facts have different eviction policies and different audiences. Keep them separate.
  5. Keep an eviction strategy. Without pruning, memory grows unboundedly. Define TTLs or a max count per type.

7. Interview Flashcards

Q1: What are the different types of agent memory?

A: There are five main types:

  1. In-context (scratchpad) — everything currently in the LLM’s prompt window. Zero latency, most reliable, but limited by context window size and cost.
  2. Conversation history — the multi-turn messages array. Managed through truncation strategies like sliding windows, summarization, and importance scoring.
  3. External memory (vector store) — facts and past interactions stored in a vector DB, retrieved via semantic search. Enables cross-session memory.
  4. Episodic memory — timestamped log of events (what happened, when). Enables the agent to learn from past experience.
  5. Semantic memory — de-temporalized facts and relationships (what is true). User profiles, domain knowledge, preferences.
    A sixth type, procedural memory, encodes how-to knowledge as stored prompts or templates.

Q2: How do you handle context window limits in a long-running conversation?

A: Several complementary strategies exist:

  • Sliding window: Keep only the last N messages. Simple but loses early context.
  • Summarization: When the window fills, compress older messages into a summary using an LLM call. Preserves semantic content at lower token cost.
  • Importance scoring: Assign relevance scores to each message; drop low-scoring ones first. More sophisticated but requires scoring logic.
  • Hierarchical compression: Maintain three tiers — recent messages verbatim, medium-old as summaries, oldest as key facts.
  • External memory offload: Move older context to a vector store and retrieve it on demand rather than carrying it in every call.

In production, track your context budget explicitly (system prompt + tools + history + buffer) and trigger compression proactively rather than reactively.


Q3: What is the difference between episodic and semantic memory?

A: The distinction comes from cognitive psychology applied to agents:

Episodic memory is event-based and temporal — it records what happened when. Example: “On March 15, the user asked me to refactor the auth module. I completed it. The PR was #142.” It preserves the narrative of past interactions.

Semantic memory is fact-based and atemporal — it records what is true, independent of when it was learned. Example: “The user’s name is Alex. They work at Acme Corp. They prefer Python.” It is the agent’s knowledge base.

In implementation terms: episodic memory is typically an append-only log retrieved by time range or event type. Semantic memory is a structured store (JSON, SQL, knowledge graph) retrieved by entity or topic.


Q4: When should memory be written eagerly vs lazily?

A: It depends on the importance and expected reuse of the information.

Write eagerly (write-through) when:

  • Information is unique and non-reproducible (a user preference expressed once)
  • Crash recovery is important
  • The information is clearly high-value (a correction, a new fact)

Write lazily (write-back or selective) when:

  • You need to minimize I/O overhead during a session
  • Most information is transient and not worth persisting
  • You can afford to lose the session’s data if the process crashes

Best practice: Use selective writes during a session (only persist trigger events like preferences and corrections) and a write-back flush at session end for anything else worth keeping. Avoid write-through for every single token — it creates unnecessary load.


Q5: How do you implement a sliding window memory with summarization fallback?

A: The pattern is:

  1. After each turn, check if len(messages) > MAX_WINDOW.
  2. If not exceeded, do nothing — just append the new messages.
  3. If exceeded, check if the oldest messages can be summarized (they are “old enough” and not already a summary).
  4. Call the LLM to summarize the oldest half of the non-system messages.
  5. Replace those messages with a single summary message.
  6. Continue.

Key implementation detail: the summary message should be clearly labeled (e.g., a system or user message prefixed with [SUMMARY OF EARLIER CONVERSATION]) so the model does not confuse it with a live user turn. Also preserve any system messages separately — they should not be summarized away.


Q6: How does vector-based memory retrieval work?

A: Vector-based retrieval has three components:

  1. Embedding: Convert text (a memory entry or a query) into a high-dimensional vector using an embedding model. Texts that are semantically similar map to nearby points in vector space.

  2. Storage: Store the vectors alongside their original text and metadata in a vector database (e.g., Pinecone, Qdrant, Chroma, pgvector).

  3. Retrieval: At query time, embed the query, then perform approximate nearest-neighbor (ANN) search to find the K stored vectors closest to the query vector. Return their original text.

The key insight is that “closest” in vector space means “most semantically similar,” not “most lexically similar.” This lets you retrieve “the user prefers concise answers” when the query is “how should I format my response?” — even though none of those words appear in the stored memory.

Practical considerations: use the same embedding model for writing and reading; add metadata filters (user_id, date range, memory type) to avoid cross-contamination; tune the similarity threshold to balance recall vs. noise.


What’s Next

  • Work through the examples in examples/ to see sliding window memory and external JSON-based memory in action.
  • Complete the exercises in exercises/README.md.
  • See references.md for deeper reading on MemGPT and production memory architectures.