Exercises — 00 Foundations
Work through these in order. They are designed to move from mechanical verification (running code, counting things) to conceptual synthesis (explaining to others, making design decisions).
Tip: Write your answers in a scratchpad or notebook alongside the exercises. The act of writing forces precision.
Exercise 1 — Token Counting (Mechanical)
Difficulty: Beginner
Time: 20–30 min
Prerequisite: examples/tokenization_demo.py installed and runnable
Task
Count the tokens for each of the following 5 inputs using tiktoken (cl100k_base encoding). Then write a one-sentence explanation of why the count is what it is for each one.
Inputs to tokenize:
A. "The quick brown fox jumps over the lazy dog."
B. "SELECT * FROM users WHERE email = 'alice@example.com' AND created_at > '2024-01-01';"
C. "私は毎日コーヒーを飲みます。" (Japanese: "I drink coffee every day.")
D. "9999999999999 + 1111111111111 = ?"
E. "def quicksort(arr):\n if len(arr) <= 1: return arr\n pivot = arr[0]\n left = [x for x in arr[1:] if x < pivot]\n right = [x for x in arr[1:] if x >= pivot]\n return quicksort(left) + [pivot] + quicksort(right)"
What to write down
For each input, record:
- Raw token count
- Approximate word count (count manually)
- Tokens-per-word ratio
- Your one-sentence explanation
Expected insight
You should observe:
- English prose is approximately 1 token per 0.75 words.
- SQL uses symbols and quoted strings that tokenize differently from prose.
- Japanese and other non-Latin scripts use significantly more tokens per “word”.
- Large numbers are split at surprising boundaries — the model sees fragments, not the number.
- Python code, despite being only ~3 lines, has a high token count due to punctuation, indentation, and operators.
Bonus
Compute the cost of sending input D as a message to GPT-4o (use $5 per million input tokens) vs. Claude 3.5 Sonnet (check current pricing at docs.anthropic.com). Why might the cost differ even for identical text?
Exercise 2 — Transformer Block: Information Flow Diagram
Difficulty: Intermediate
Time: 30–45 min
Prerequisite: Read Section 1 of the README
Task
Without looking at the README, draw the full information flow through a single transformer block using ASCII art or plain text notation. Your diagram must show:
- The input token embedding (with positional encoding added)
- Layer normalization (position: before or after the sublayer?)
- Multi-head attention with Q, K, V
- The residual connection after attention
- The second layer normalization
- The feed-forward network (with dimensionality: d_model → 4×d_model → d_model)
- The second residual connection
- The output passed to the next block
Additional questions to answer in text:
- Why are residual connections important? What problem do they solve?
- What would happen if you removed layer normalization?
- In a decoder-only model, what change is made to the attention step that is not shown in a plain encoder diagram?
- Why does the FFN expand to 4× the model dimension before projecting back? What is the intuition?
Self-check
Compare your diagram to the one in Section 1.6 of the README. Identify any steps you missed or misplaced. The goal is not to reproduce it exactly — it is to know the structure well enough to reconstruct it from memory.
Exercise 3 — Architecture Decision: Chatbot for Legal Documents
Difficulty: Intermediate
Time: 30–45 min
Prerequisite: Read Section 5 of the README (Fine-tuning vs Prompting vs RAG)
Scenario
A mid-sized law firm wants to build an internal chatbot that can:
- Answer questions about their specific clients’ contracts and case files (thousands of documents, updated weekly).
- Summarize legal precedents from a curated database of 50,000 public court cases.
- Respond only in formal legal writing style — never casual, never first-person.
- When asked about a specific clause, always cite the exact document and page number.
Task
For each of the four capabilities above, independently decide: prompting, RAG, or fine-tuning? Or a combination?
Write your reasoning for each decision. Use the decision tree from Section 5.3 as a guide.
Evaluation criteria
Your answer should address:
- Why the knowledge is or is not too large/dynamic for prompting.
- Whether behavior (style) is better baked into weights or enforced via prompt.
- How citations/verifiability affect the choice between RAG and fine-tuning.
- What the failure mode is if you make the wrong choice (e.g., fine-tuning on documents that change weekly).
Suggested answer structure
Capability 1: [Your choice]
Reason: ...
Failure mode if wrong: ...
Capability 2: [Your choice]
Reason: ...
...
Stretch goal
Sketch the system architecture that satisfies all four requirements simultaneously. What components do you need? (e.g., vector DB, embedding model, fine-tuned model checkpoint, prompt template)
Exercise 4 — Run the Sampling Demo and Explain Variance
Difficulty: Beginner-Intermediate
Time: 20–30 min + API key required
Prerequisite: examples/sampling_params_demo.py, an ANTHROPIC_API_KEY
Setup
# From the 00-foundations directory:
cp .env.example .env # if .env.example exists, or create .env
echo "ANTHROPIC_API_KEY=sk-ant-YOUR_KEY" >> .env
# Install dependencies:
pip install anthropic python-dotenv
# Run the demo:
python examples/sampling_params_demo.pyTask
Run the demo and observe the outputs. Then answer these questions in writing:
Part A — Temperature comparison
- Do the outputs at temperature=0 look more or less “generic” than at temperature=1.0? Why?
- Does temperature=0 guarantee the exact same output if you run the script twice? Try it. What do you observe?
- Which temperature would you use for a chatbot that needs to generate legal contract summaries? Which for a creative naming brainstorm? Justify each.
Part B — Variance at temperature=1.0
- Look at the three outputs from the variance runs. Are they meaningfully different, or just slightly rephrased? Why does the degree of variance matter for a production system?
- If a user reports “the chatbot gave different answers to the same question” — is this a bug? Under what circumstances is it acceptable, and when is it a problem?
Part C — Reflection
- The demo uses
max_tokens=120. How would increasing this to 1000 change the cost and the observed variance? (Think about this before changing the code — then verify by editing the script.) - Why does the Anthropic API accept
temperaturein [0, 1] while OpenAI’s API goes up to 2.0? What might Anthropic’s internal sampling implementation do differently?
Exercise 5 — Interview Simulation: Explain Self-Attention at Two Levels
Difficulty: Advanced (synthesis + communication)
Time: 45–60 min
Prerequisite: Read all of Section 1 and the Interview Flashcards in the README
This exercise simulates a real-world technical interview scenario. Strong LLM engineers must be able to explain the same concept at multiple levels of abstraction — to a non-technical stakeholder, and to a peer who will probe the details.
Part A — Explain Self-Attention to a 5-Year-Old (2 minutes max)
Write a spoken explanation (as if you are speaking aloud) that a 5-year-old could follow. No math. No jargon. Use an analogy.
Constraints:
- No use of the words: “matrix”, “vector”, “embedding”, “dimension”, “softmax”
- Must convey the core idea: the model looks at all words at once and figures out which ones are related
- Must use a concrete, relatable analogy (e.g., connecting words in a sentence like arrows, a group of friends deciding who to listen to)
Example opening (do not copy — write your own):
“Imagine you’re reading a mystery book and you see the word ‘it’…”
Part B — Explain Self-Attention to a Senior Engineer (5 minutes max)
Write a technical explanation suitable for a senior SWE who knows linear algebra and ML basics but has not studied transformers. Cover:
- The Q/K/V projection matrices and what they represent conceptually.
- The scaled dot-product formula and why we scale by sqrt(d_k).
- Why attention is O(n²) and what that means for practical context limits.
- How causal masking works for decoder models.
- One specific failure mode or limitation of standard self-attention (your choice — e.g., fixed context length, sensitivity to token order via position encoding, quadratic memory).
Part C — Self-evaluation
After writing both explanations, answer:
- Which explanation was harder to write? Why?
- What is the core insight that both explanations must convey, regardless of audience?
- If the interviewer asks “What is multi-head attention and why do we need it?” — write your 60-second answer here.
Why this exercise matters
The ability to explain self-attention at multiple levels is a reliable signal in interviews. A candidate who can only recite the formula likely memorized it. A candidate who can explain it to a 5-year-old and then derive the O(n²) complexity on the spot has understood it.
This exercise also builds the metacognitive skill of knowing when you truly understand something vs. when you have pattern-matched on the words.
Completion Checklist
Mark each exercise complete when you can do the following without referring to notes:
- Exercise 1: State the token count range for English prose, code, and Japanese text without running the script.
- Exercise 2: Draw the transformer block from memory with all residual connections and layer norms in the right places.
- Exercise 3: Given any chatbot scenario, justify your architecture choice in under 2 minutes.
- Exercise 4: Explain in one sentence what temperature controls and give a concrete example of when to use 0 vs 1.0.
- Exercise 5: Deliver the 5-year-old explanation without hesitation, then give the O(n²) derivation on the spot.
When all five are checked, you are ready for the 01-prompting-fundamentals module.