Project 1: Personal Knowledge Assistant

A command-line RAG system that indexes your personal notes/docs folder and lets you query it conversationally using Claude.


What It Does

  • ingest.py walks a directory, chunks .txt/.md/.pdf files, embeds them, and stores in a local Chroma vector database. Idempotent: files are hashed and skipped if already indexed.
  • query.py starts an interactive REPL. You type a question, the system retrieves the 5 most relevant chunks, and streams a Claude-powered answer. Conversation history (last 10 turns) is maintained for follow-up questions.

Architecture

Your docs folder
      |
      v
  ingest.py
  |-- Walk directory (.txt / .md / .pdf)
  |-- Chunk text (~500 tokens, 50-token overlap)
  |-- Hash check -> skip if already indexed
  |-- Embed with OpenAI text-embedding-3-small (or sentence-transformers fallback)
  `-- Store in Chroma DB (./chroma_db)

  query.py  <--  you type a question
  |-- Embed query
  |-- Retrieve top-5 chunks from Chroma
  |-- Build prompt (system + context + conversation history + query)
  |-- Stream response from Claude (claude-haiku-4-5-20251001)
  `-- Append turn to conversation history -> loop

Skills Covered

ModuleConcept
02 — RAGDocument chunking, embedding, vector retrieval, context injection
03 — AgentsConversational REPL loop
05 — MemoryRolling conversation history (last 10 turns)
09 — ProductionIdempotent ingestion, env-var config, graceful error handling

Setup

1. Install dependencies

cd projects/01-personal-knowledge-assistant
pip install -r requirements.txt

2. Create a .env file

ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...      # Optional: falls back to sentence-transformers if omitted
DOCS_DIR=./docs            # Directory to index  (default: ./docs)
CHROMA_DIR=./chroma_db     # Vector DB location  (default: ./chroma_db)
COLLECTION_NAME=knowledge  # Chroma collection   (default: knowledge)

If OPENAI_API_KEY is not set the system falls back to sentence-transformers/all-MiniLM-L6-v2 (local, no API key needed).

3. Add your documents

mkdir docs
cp ~/Documents/notes/*.md docs/
cp ~/Downloads/some-paper.pdf docs/

Nested folder structures are fine — ingest.py walks recursively.


Usage

Index your documents

python ingest.py
Using OpenAI embeddings (text-embedding-3-small)
Scanning: ./docs
  [ 1/12] notes/project-alpha.md       ->  8 chunks
  [ 2/12] notes/meeting-2025-03.md     ->  0 chunks  (skipped: already indexed)
  [12/12] papers/attention-is-all.pdf  -> 24 chunks
Indexed 47 new chunks from 10 documents (2 skipped, already up to date)

Query interactively

python query.py
Knowledge Assistant ready. Type 'quit' to exit.
Loaded 47 chunks from collection 'knowledge'.

You: What did we decide in the March meeting about the API design?
Assistant: Based on your notes from the March 14 meeting, the team decided to...

You: What were the action items?
Assistant: Following up on the API design decision, the action items were...

You: quit
Goodbye!

Extension Ideas

IdeaEffortDescription
Web UIMediumWrap query.py in FastAPI + a Streamlit or React frontend
Slack botMediumReplace the CLI REPL with a Slack Bolt app — each DM is a query
Scheduled re-indexingSmallCron job or GitHub Action that re-runs ingest.py when your notes repo changes
Multi-userMediumStore user IDs in Chroma metadata; filter by user on retrieval
Source highlightingSmallReturn the source filename and page/line alongside each answer
Hybrid searchMediumCombine vector similarity with BM25 keyword search for better recall

Interview Demo Script

  1. Run ingest.py on your own notes folder — show real output.
  2. Ask a question whose answer spans two different documents.
  3. Ask a follow-up question that only makes sense given the previous answer (demonstrates history).
  4. Re-run ingest.py and show “0 new chunks” because hashes match (demonstrates idempotency).
  5. Explain the architecture diagram — why chunking, why overlap, why embeddings.

File Reference

FilePurpose
ingest.pyIndex documents into Chroma
query.pyInteractive RAG REPL
requirements.txtPython dependencies
.envAPI keys and config (never commit this)
./docs/Put your documents here
./chroma_db/Auto-created; stores the vector index