Module 05: Memory — Exercises

Work through these exercises in order. They build on each other. Each exercise has a stated objective, a set of tasks, and a “stretch goal” for going deeper.


Exercise 1: Implement Importance-Scored Sliding Window

Objective: Replace the basic sliding window in conversation_memory.py with an importance-scored version that retains high-value messages even when they are older.

Background: The naive sliding window drops messages purely by age. This is fine for casual chat but breaks for agents that must remember instructions stated early in a conversation (“Always respond in Spanish”, “The project deadline is March 15”).

Tasks:

  1. Define a scoring function score_message(msg: dict) -> float that assigns higher scores to messages containing:

    • User preferences (keywords: “prefer”, “always”, “never”, “must”)
    • Corrections (keywords: “actually”, “that’s wrong”, “I meant”)
    • Explicit facts (keywords: “my name is”, “I am”, “I work at”)
    • Long messages (more content likely means more information)
  2. Modify the memory management in chat() so that instead of a pure sliding window, you:

    • Always keep the last 4 messages (recency bias)
    • Score all remaining messages
    • Fill remaining budget with highest-scoring messages
    • Stay within a MAX_TOKENS budget (estimate: 4 characters per token)
  3. Write a test: create a messages list where turn 1 contains “My name is Alex and I prefer Python”, and turns 2-15 are filler. Verify that after applying your function, the first message is retained.

Stretch goal: Add a decay factor so that importance scores decay over time. A preference stated 50 turns ago should score slightly lower than the same preference stated 5 turns ago (but still higher than filler).


Exercise 2: Build a Simple Vector Memory Without a Vector DB

Objective: Implement semantic memory retrieval using only Python’s standard library — no vector database required. This exercise teaches you what vector search is doing under the hood.

Background: Real vector search uses high-dimensional embeddings and approximate nearest-neighbor algorithms. But the core intuition is simple: represent text as a bag-of-words vector and compute cosine similarity.

Tasks:

  1. Implement text_to_vector(text: str) -> dict[str, float]:

    • Tokenize the text (lowercase, split on non-alphanumeric characters)
    • Remove stopwords (a provided list is fine)
    • Return a dictionary mapping each token to its term frequency (TF): count of occurrences / total tokens
  2. Implement cosine_similarity(vec_a: dict, vec_b: dict) -> float:

    • Compute the dot product: sum of (a[k] * b[k]) for k in both vectors
    • Compute magnitudes: sqrt(sum of v^2) for each vector
    • Return dot_product / (magnitude_a * magnitude_b), handle division by zero
  3. Build a TFMemoryStore class with:

    • save(content: str, metadata: dict): store the content, metadata, and its TF vector
    • search(query: str, top_k: int) -> list[dict]: compute similarity to all stored entries, return top_k
  4. Demo: save 10 facts (e.g., tech stack preferences, project details), then query with different phrasings and observe which facts are retrieved.

Stretch goal: Implement TF-IDF instead of plain TF. IDF (inverse document frequency) down-weights common words across all entries, improving retrieval precision. Formula: tfidf(t, d) = tf(t, d) * log(N / df(t)) where N is total documents and df(t) is documents containing term t.


Exercise 3: Multi-Tier Hierarchical Memory

Objective: Implement the HierarchicalMemory class from the README and integrate it into a working chat loop.

Background: The hierarchical approach is used in systems like MemGPT. Recent messages stay verbatim, older messages get summarized, oldest messages are distilled to key facts. This mirrors human memory and delivers the best cost/quality trade-off for long-running agents.

Tasks:

  1. Complete the HierarchicalMemory class with:

    • add_turn(user_msg: str, assistant_msg: str): add a new turn to recent
    • compress(): when recent grows beyond N turns, move the oldest turn to summary (summarize it with an LLM call) or, if the summary is already large, extract key facts from the summary into key_facts
    • to_context() -> list[dict]: serialize all three tiers into a messages array suitable for injection
  2. The to_context() output should be ordered: key_facts injection → summary injection → recent verbatim messages.

  3. Build a chat loop that uses HierarchicalMemory and prints which tier each message is in after each turn.

  4. Test with a 20-turn conversation and verify:

    • Facts from the first turn appear in key_facts by turn 20
    • The agent can correctly answer “What’s my name?” even when that information is in key_facts

Stretch goal: Make the compression policy configurable. Support “compress on token limit” (trigger when estimated token count exceeds a threshold) in addition to “compress on turn count”.


Exercise 4: Session Boundary Memory with Write-Back

Objective: Build a memory system that persists state across Python process restarts — simulating what production agents need for multi-session continuity.

Background: Every time a user starts a new session with an agent, the in-context messages array is empty. External memory bridges this gap: at session start you load relevant memories; at session end you flush new information.

Tasks:

  1. Define a Session class that:

    • On __init__: loads recent memories from a JSON file (or creates the file if missing); builds a system prompt injecting those memories
    • During operation: uses a selective-write strategy to save high-value messages immediately (preferences, corrections, explicit facts)
    • On close(): runs a final extraction pass on the full session to catch anything the selective write missed; writes to file with deduplication
  2. The JSON file schema should include:

    {
      "version": 1,
      "entries": [
        {
          "id": "...",
          "content": "...",
          "category": "fact|preference|correction",
          "created_at": "ISO-8601 timestamp",
          "session_id": "...",
          "keywords": ["..."]
        }
      ]
    }
  3. Write a test script test_session_persistence.py that:

    • Creates Session A, has a short conversation sharing facts
    • Closes Session A (triggering flush)
    • Creates Session B (new Python process simulated by re-instantiating)
    • Verifies Session B can answer questions about information from Session A

Stretch goal: Add a “memory freshness” system: memories older than 30 days get a staleness flag, and the agent asks to confirm whether they’re still accurate on next use.


Exercise 5: Interview Simulation — Design a Personal Assistant Memory System

Objective: Practice designing a complete memory system under interview conditions. This is an open-ended system design question commonly asked for senior/staff AI engineer roles.

Prompt:

You are designing the memory system for a personal AI assistant that users interact with daily. The assistant needs to:

  • Remember user preferences across sessions
  • Recall specific conversations from up to 6 months ago
  • Know what tasks were completed and what is outstanding
  • Understand the user’s current projects and context
  • Operate within a 100K token context window
  • Serve 10,000 users, each with up to 10,000 conversation turns in history

Design the memory system. Consider: storage, retrieval, write strategy, cost, latency, and privacy.

How to approach this exercise:

Work through the following questions (write your answers in a file or speak them aloud to a study partner):

  1. Requirements clarification (5 minutes):

    • What is the P50/P99 latency budget for a response?
    • Do users ever share sessions (multi-user context)?
    • Are there regulatory constraints on storing conversation data?
  2. Storage layer (10 minutes):

    • What storage technologies would you use for each memory type?
    • How would you partition data by user?
    • What is your retention and eviction policy?
  3. Retrieval design (10 minutes):

    • How does the session start? What gets injected into the system prompt?
    • How do you decide what to retrieve mid-conversation?
    • How do you handle retrieval latency (pre-fetch vs. lazy)?
  4. Write strategy (5 minutes):

    • Write-through, write-back, or selective? Justify.
    • What is your deduplication strategy?
  5. Scaling and cost (5 minutes):

    • What is the cost per user per month (estimate)?
    • Where are the hot spots?
    • What would you do first if you needed to reduce costs by 50%?
  6. Failure modes (5 minutes):

    • What happens if the vector DB is unavailable?
    • What if memory retrieval returns irrelevant results?
    • How do you detect and fix “memory poisoning” (incorrect facts saved)?

Evaluation rubric:

DimensionStrong AnswerWeak Answer
Memory taxonomyUses all 5 types correctlyOnly mentions “save the chat history”
RetrievalDescribes hybrid: inject at start + lazy mid-conversationSays “always load everything”
Write strategySelective + write-back hybrid with justificationWrite-through everything
Cost awarenessEstimates tokens, proposes compressionNo mention of cost
Failure handlingGraceful degradation describedNo failure mode discussion

Deliverable: Write a 1-2 page design document covering all five areas above. Time yourself: you have 35 minutes, as in a real interview.