Project 1: Personal Knowledge Assistant
A command-line RAG system that indexes your personal notes/docs folder and lets you query it conversationally using Claude.
What It Does
ingest.pywalks a directory, chunks.txt/.md/.pdffiles, embeds them, and stores in a local Chroma vector database. Idempotent: files are hashed and skipped if already indexed.query.pystarts an interactive REPL. You type a question, the system retrieves the 5 most relevant chunks, and streams a Claude-powered answer. Conversation history (last 10 turns) is maintained for follow-up questions.
Architecture
Your docs folder
|
v
ingest.py
|-- Walk directory (.txt / .md / .pdf)
|-- Chunk text (~500 tokens, 50-token overlap)
|-- Hash check -> skip if already indexed
|-- Embed with OpenAI text-embedding-3-small (or sentence-transformers fallback)
`-- Store in Chroma DB (./chroma_db)
query.py <-- you type a question
|-- Embed query
|-- Retrieve top-5 chunks from Chroma
|-- Build prompt (system + context + conversation history + query)
|-- Stream response from Claude (claude-haiku-4-5-20251001)
`-- Append turn to conversation history -> loop
Skills Covered
| Module | Concept |
|---|---|
| 02 — RAG | Document chunking, embedding, vector retrieval, context injection |
| 03 — Agents | Conversational REPL loop |
| 05 — Memory | Rolling conversation history (last 10 turns) |
| 09 — Production | Idempotent ingestion, env-var config, graceful error handling |
Setup
1. Install dependencies
cd projects/01-personal-knowledge-assistant
pip install -r requirements.txt2. Create a .env file
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-... # Optional: falls back to sentence-transformers if omitted
DOCS_DIR=./docs # Directory to index (default: ./docs)
CHROMA_DIR=./chroma_db # Vector DB location (default: ./chroma_db)
COLLECTION_NAME=knowledge # Chroma collection (default: knowledge)
If OPENAI_API_KEY is not set the system falls back to sentence-transformers/all-MiniLM-L6-v2 (local, no API key needed).
3. Add your documents
mkdir docs
cp ~/Documents/notes/*.md docs/
cp ~/Downloads/some-paper.pdf docs/Nested folder structures are fine — ingest.py walks recursively.
Usage
Index your documents
python ingest.pyUsing OpenAI embeddings (text-embedding-3-small)
Scanning: ./docs
[ 1/12] notes/project-alpha.md -> 8 chunks
[ 2/12] notes/meeting-2025-03.md -> 0 chunks (skipped: already indexed)
[12/12] papers/attention-is-all.pdf -> 24 chunks
Indexed 47 new chunks from 10 documents (2 skipped, already up to date)
Query interactively
python query.pyKnowledge Assistant ready. Type 'quit' to exit.
Loaded 47 chunks from collection 'knowledge'.
You: What did we decide in the March meeting about the API design?
Assistant: Based on your notes from the March 14 meeting, the team decided to...
You: What were the action items?
Assistant: Following up on the API design decision, the action items were...
You: quit
Goodbye!
Extension Ideas
| Idea | Effort | Description |
|---|---|---|
| Web UI | Medium | Wrap query.py in FastAPI + a Streamlit or React frontend |
| Slack bot | Medium | Replace the CLI REPL with a Slack Bolt app — each DM is a query |
| Scheduled re-indexing | Small | Cron job or GitHub Action that re-runs ingest.py when your notes repo changes |
| Multi-user | Medium | Store user IDs in Chroma metadata; filter by user on retrieval |
| Source highlighting | Small | Return the source filename and page/line alongside each answer |
| Hybrid search | Medium | Combine vector similarity with BM25 keyword search for better recall |
Interview Demo Script
- Run
ingest.pyon your own notes folder — show real output. - Ask a question whose answer spans two different documents.
- Ask a follow-up question that only makes sense given the previous answer (demonstrates history).
- Re-run
ingest.pyand show “0 new chunks” because hashes match (demonstrates idempotency). - Explain the architecture diagram — why chunking, why overlap, why embeddings.
File Reference
| File | Purpose |
|---|---|
ingest.py | Index documents into Chroma |
query.py | Interactive RAG REPL |
requirements.txt | Python dependencies |
.env | API keys and config (never commit this) |
./docs/ | Put your documents here |
./chroma_db/ | Auto-created; stores the vector index |