Module 05: Memory — Exercises
Work through these exercises in order. They build on each other. Each exercise has a stated objective, a set of tasks, and a “stretch goal” for going deeper.
Exercise 1: Implement Importance-Scored Sliding Window
Objective: Replace the basic sliding window in conversation_memory.py with an importance-scored version that retains high-value messages even when they are older.
Background: The naive sliding window drops messages purely by age. This is fine for casual chat but breaks for agents that must remember instructions stated early in a conversation (“Always respond in Spanish”, “The project deadline is March 15”).
Tasks:
-
Define a scoring function
score_message(msg: dict) -> floatthat assigns higher scores to messages containing:- User preferences (keywords: “prefer”, “always”, “never”, “must”)
- Corrections (keywords: “actually”, “that’s wrong”, “I meant”)
- Explicit facts (keywords: “my name is”, “I am”, “I work at”)
- Long messages (more content likely means more information)
-
Modify the memory management in
chat()so that instead of a pure sliding window, you:- Always keep the last 4 messages (recency bias)
- Score all remaining messages
- Fill remaining budget with highest-scoring messages
- Stay within a
MAX_TOKENSbudget (estimate: 4 characters per token)
-
Write a test: create a messages list where turn 1 contains “My name is Alex and I prefer Python”, and turns 2-15 are filler. Verify that after applying your function, the first message is retained.
Stretch goal: Add a decay factor so that importance scores decay over time. A preference stated 50 turns ago should score slightly lower than the same preference stated 5 turns ago (but still higher than filler).
Exercise 2: Build a Simple Vector Memory Without a Vector DB
Objective: Implement semantic memory retrieval using only Python’s standard library — no vector database required. This exercise teaches you what vector search is doing under the hood.
Background: Real vector search uses high-dimensional embeddings and approximate nearest-neighbor algorithms. But the core intuition is simple: represent text as a bag-of-words vector and compute cosine similarity.
Tasks:
-
Implement
text_to_vector(text: str) -> dict[str, float]:- Tokenize the text (lowercase, split on non-alphanumeric characters)
- Remove stopwords (a provided list is fine)
- Return a dictionary mapping each token to its term frequency (TF): count of occurrences / total tokens
-
Implement
cosine_similarity(vec_a: dict, vec_b: dict) -> float:- Compute the dot product: sum of (a[k] * b[k]) for k in both vectors
- Compute magnitudes: sqrt(sum of v^2) for each vector
- Return dot_product / (magnitude_a * magnitude_b), handle division by zero
-
Build a
TFMemoryStoreclass with:save(content: str, metadata: dict): store the content, metadata, and its TF vectorsearch(query: str, top_k: int) -> list[dict]: compute similarity to all stored entries, return top_k
-
Demo: save 10 facts (e.g., tech stack preferences, project details), then query with different phrasings and observe which facts are retrieved.
Stretch goal: Implement TF-IDF instead of plain TF. IDF (inverse document frequency) down-weights common words across all entries, improving retrieval precision. Formula: tfidf(t, d) = tf(t, d) * log(N / df(t)) where N is total documents and df(t) is documents containing term t.
Exercise 3: Multi-Tier Hierarchical Memory
Objective: Implement the HierarchicalMemory class from the README and integrate it into a working chat loop.
Background: The hierarchical approach is used in systems like MemGPT. Recent messages stay verbatim, older messages get summarized, oldest messages are distilled to key facts. This mirrors human memory and delivers the best cost/quality trade-off for long-running agents.
Tasks:
-
Complete the
HierarchicalMemoryclass with:add_turn(user_msg: str, assistant_msg: str): add a new turn torecentcompress(): whenrecentgrows beyondNturns, move the oldest turn tosummary(summarize it with an LLM call) or, if the summary is already large, extract key facts from the summary intokey_factsto_context() -> list[dict]: serialize all three tiers into a messages array suitable for injection
-
The
to_context()output should be ordered: key_facts injection → summary injection → recent verbatim messages. -
Build a chat loop that uses
HierarchicalMemoryand prints which tier each message is in after each turn. -
Test with a 20-turn conversation and verify:
- Facts from the first turn appear in
key_factsby turn 20 - The agent can correctly answer “What’s my name?” even when that information is in
key_facts
- Facts from the first turn appear in
Stretch goal: Make the compression policy configurable. Support “compress on token limit” (trigger when estimated token count exceeds a threshold) in addition to “compress on turn count”.
Exercise 4: Session Boundary Memory with Write-Back
Objective: Build a memory system that persists state across Python process restarts — simulating what production agents need for multi-session continuity.
Background: Every time a user starts a new session with an agent, the in-context messages array is empty. External memory bridges this gap: at session start you load relevant memories; at session end you flush new information.
Tasks:
-
Define a
Sessionclass that:- On
__init__: loads recent memories from a JSON file (or creates the file if missing); builds a system prompt injecting those memories - During operation: uses a selective-write strategy to save high-value messages immediately (preferences, corrections, explicit facts)
- On
close(): runs a final extraction pass on the full session to catch anything the selective write missed; writes to file with deduplication
- On
-
The JSON file schema should include:
{ "version": 1, "entries": [ { "id": "...", "content": "...", "category": "fact|preference|correction", "created_at": "ISO-8601 timestamp", "session_id": "...", "keywords": ["..."] } ] } -
Write a test script
test_session_persistence.pythat:- Creates Session A, has a short conversation sharing facts
- Closes Session A (triggering flush)
- Creates Session B (new Python process simulated by re-instantiating)
- Verifies Session B can answer questions about information from Session A
Stretch goal: Add a “memory freshness” system: memories older than 30 days get a staleness flag, and the agent asks to confirm whether they’re still accurate on next use.
Exercise 5: Interview Simulation — Design a Personal Assistant Memory System
Objective: Practice designing a complete memory system under interview conditions. This is an open-ended system design question commonly asked for senior/staff AI engineer roles.
Prompt:
You are designing the memory system for a personal AI assistant that users interact with daily. The assistant needs to:
- Remember user preferences across sessions
- Recall specific conversations from up to 6 months ago
- Know what tasks were completed and what is outstanding
- Understand the user’s current projects and context
- Operate within a 100K token context window
- Serve 10,000 users, each with up to 10,000 conversation turns in history
Design the memory system. Consider: storage, retrieval, write strategy, cost, latency, and privacy.
How to approach this exercise:
Work through the following questions (write your answers in a file or speak them aloud to a study partner):
-
Requirements clarification (5 minutes):
- What is the P50/P99 latency budget for a response?
- Do users ever share sessions (multi-user context)?
- Are there regulatory constraints on storing conversation data?
-
Storage layer (10 minutes):
- What storage technologies would you use for each memory type?
- How would you partition data by user?
- What is your retention and eviction policy?
-
Retrieval design (10 minutes):
- How does the session start? What gets injected into the system prompt?
- How do you decide what to retrieve mid-conversation?
- How do you handle retrieval latency (pre-fetch vs. lazy)?
-
Write strategy (5 minutes):
- Write-through, write-back, or selective? Justify.
- What is your deduplication strategy?
-
Scaling and cost (5 minutes):
- What is the cost per user per month (estimate)?
- Where are the hot spots?
- What would you do first if you needed to reduce costs by 50%?
-
Failure modes (5 minutes):
- What happens if the vector DB is unavailable?
- What if memory retrieval returns irrelevant results?
- How do you detect and fix “memory poisoning” (incorrect facts saved)?
Evaluation rubric:
| Dimension | Strong Answer | Weak Answer |
|---|---|---|
| Memory taxonomy | Uses all 5 types correctly | Only mentions “save the chat history” |
| Retrieval | Describes hybrid: inject at start + lazy mid-conversation | Says “always load everything” |
| Write strategy | Selective + write-back hybrid with justification | Write-through everything |
| Cost awareness | Estimates tokens, proposes compression | No mention of cost |
| Failure handling | Graceful degradation described | No failure mode discussion |
Deliverable: Write a 1-2 page design document covering all five areas above. Time yourself: you have 35 minutes, as in a real interview.