Exercises — 00 Foundations

Work through these in order. They are designed to move from mechanical verification (running code, counting things) to conceptual synthesis (explaining to others, making design decisions).

Tip: Write your answers in a scratchpad or notebook alongside the exercises. The act of writing forces precision.


Exercise 1 — Token Counting (Mechanical)

Difficulty: Beginner
Time: 20–30 min
Prerequisite: examples/tokenization_demo.py installed and runnable

Task

Count the tokens for each of the following 5 inputs using tiktoken (cl100k_base encoding). Then write a one-sentence explanation of why the count is what it is for each one.

Inputs to tokenize:

A. "The quick brown fox jumps over the lazy dog."

B. "SELECT * FROM users WHERE email = 'alice@example.com' AND created_at > '2024-01-01';"

C. "私は毎日コーヒーを飲みます。" (Japanese: "I drink coffee every day.")

D. "9999999999999 + 1111111111111 = ?"

E. "def quicksort(arr):\n    if len(arr) <= 1: return arr\n    pivot = arr[0]\n    left = [x for x in arr[1:] if x < pivot]\n    right = [x for x in arr[1:] if x >= pivot]\n    return quicksort(left) + [pivot] + quicksort(right)"

What to write down

For each input, record:

  • Raw token count
  • Approximate word count (count manually)
  • Tokens-per-word ratio
  • Your one-sentence explanation

Expected insight

You should observe:

  1. English prose is approximately 1 token per 0.75 words.
  2. SQL uses symbols and quoted strings that tokenize differently from prose.
  3. Japanese and other non-Latin scripts use significantly more tokens per “word”.
  4. Large numbers are split at surprising boundaries — the model sees fragments, not the number.
  5. Python code, despite being only ~3 lines, has a high token count due to punctuation, indentation, and operators.

Bonus

Compute the cost of sending input D as a message to GPT-4o (use $5 per million input tokens) vs. Claude 3.5 Sonnet (check current pricing at docs.anthropic.com). Why might the cost differ even for identical text?


Exercise 2 — Transformer Block: Information Flow Diagram

Difficulty: Intermediate
Time: 30–45 min
Prerequisite: Read Section 1 of the README

Task

Without looking at the README, draw the full information flow through a single transformer block using ASCII art or plain text notation. Your diagram must show:

  1. The input token embedding (with positional encoding added)
  2. Layer normalization (position: before or after the sublayer?)
  3. Multi-head attention with Q, K, V
  4. The residual connection after attention
  5. The second layer normalization
  6. The feed-forward network (with dimensionality: d_model → 4×d_model → d_model)
  7. The second residual connection
  8. The output passed to the next block

Additional questions to answer in text:

  • Why are residual connections important? What problem do they solve?
  • What would happen if you removed layer normalization?
  • In a decoder-only model, what change is made to the attention step that is not shown in a plain encoder diagram?
  • Why does the FFN expand to 4× the model dimension before projecting back? What is the intuition?

Self-check

Compare your diagram to the one in Section 1.6 of the README. Identify any steps you missed or misplaced. The goal is not to reproduce it exactly — it is to know the structure well enough to reconstruct it from memory.


Difficulty: Intermediate
Time: 30–45 min
Prerequisite: Read Section 5 of the README (Fine-tuning vs Prompting vs RAG)

Scenario

A mid-sized law firm wants to build an internal chatbot that can:

  1. Answer questions about their specific clients’ contracts and case files (thousands of documents, updated weekly).
  2. Summarize legal precedents from a curated database of 50,000 public court cases.
  3. Respond only in formal legal writing style — never casual, never first-person.
  4. When asked about a specific clause, always cite the exact document and page number.

Task

For each of the four capabilities above, independently decide: prompting, RAG, or fine-tuning? Or a combination?

Write your reasoning for each decision. Use the decision tree from Section 5.3 as a guide.

Evaluation criteria

Your answer should address:

  • Why the knowledge is or is not too large/dynamic for prompting.
  • Whether behavior (style) is better baked into weights or enforced via prompt.
  • How citations/verifiability affect the choice between RAG and fine-tuning.
  • What the failure mode is if you make the wrong choice (e.g., fine-tuning on documents that change weekly).

Suggested answer structure

Capability 1: [Your choice]
  Reason: ...
  Failure mode if wrong: ...

Capability 2: [Your choice]
  Reason: ...
  ...

Stretch goal

Sketch the system architecture that satisfies all four requirements simultaneously. What components do you need? (e.g., vector DB, embedding model, fine-tuned model checkpoint, prompt template)


Exercise 4 — Run the Sampling Demo and Explain Variance

Difficulty: Beginner-Intermediate
Time: 20–30 min + API key required
Prerequisite: examples/sampling_params_demo.py, an ANTHROPIC_API_KEY

Setup

# From the 00-foundations directory:
cp .env.example .env          # if .env.example exists, or create .env
echo "ANTHROPIC_API_KEY=sk-ant-YOUR_KEY" >> .env
 
# Install dependencies:
pip install anthropic python-dotenv
 
# Run the demo:
python examples/sampling_params_demo.py

Task

Run the demo and observe the outputs. Then answer these questions in writing:

Part A — Temperature comparison

  1. Do the outputs at temperature=0 look more or less “generic” than at temperature=1.0? Why?
  2. Does temperature=0 guarantee the exact same output if you run the script twice? Try it. What do you observe?
  3. Which temperature would you use for a chatbot that needs to generate legal contract summaries? Which for a creative naming brainstorm? Justify each.

Part B — Variance at temperature=1.0

  1. Look at the three outputs from the variance runs. Are they meaningfully different, or just slightly rephrased? Why does the degree of variance matter for a production system?
  2. If a user reports “the chatbot gave different answers to the same question” — is this a bug? Under what circumstances is it acceptable, and when is it a problem?

Part C — Reflection

  1. The demo uses max_tokens=120. How would increasing this to 1000 change the cost and the observed variance? (Think about this before changing the code — then verify by editing the script.)
  2. Why does the Anthropic API accept temperature in [0, 1] while OpenAI’s API goes up to 2.0? What might Anthropic’s internal sampling implementation do differently?

Exercise 5 — Interview Simulation: Explain Self-Attention at Two Levels

Difficulty: Advanced (synthesis + communication)
Time: 45–60 min
Prerequisite: Read all of Section 1 and the Interview Flashcards in the README

This exercise simulates a real-world technical interview scenario. Strong LLM engineers must be able to explain the same concept at multiple levels of abstraction — to a non-technical stakeholder, and to a peer who will probe the details.

Part A — Explain Self-Attention to a 5-Year-Old (2 minutes max)

Write a spoken explanation (as if you are speaking aloud) that a 5-year-old could follow. No math. No jargon. Use an analogy.

Constraints:

  • No use of the words: “matrix”, “vector”, “embedding”, “dimension”, “softmax”
  • Must convey the core idea: the model looks at all words at once and figures out which ones are related
  • Must use a concrete, relatable analogy (e.g., connecting words in a sentence like arrows, a group of friends deciding who to listen to)

Example opening (do not copy — write your own):

“Imagine you’re reading a mystery book and you see the word ‘it’…”

Part B — Explain Self-Attention to a Senior Engineer (5 minutes max)

Write a technical explanation suitable for a senior SWE who knows linear algebra and ML basics but has not studied transformers. Cover:

  1. The Q/K/V projection matrices and what they represent conceptually.
  2. The scaled dot-product formula and why we scale by sqrt(d_k).
  3. Why attention is O(n²) and what that means for practical context limits.
  4. How causal masking works for decoder models.
  5. One specific failure mode or limitation of standard self-attention (your choice — e.g., fixed context length, sensitivity to token order via position encoding, quadratic memory).

Part C — Self-evaluation

After writing both explanations, answer:

  1. Which explanation was harder to write? Why?
  2. What is the core insight that both explanations must convey, regardless of audience?
  3. If the interviewer asks “What is multi-head attention and why do we need it?” — write your 60-second answer here.

Why this exercise matters

The ability to explain self-attention at multiple levels is a reliable signal in interviews. A candidate who can only recite the formula likely memorized it. A candidate who can explain it to a 5-year-old and then derive the O(n²) complexity on the spot has understood it.

This exercise also builds the metacognitive skill of knowing when you truly understand something vs. when you have pattern-matched on the words.


Completion Checklist

Mark each exercise complete when you can do the following without referring to notes:

  • Exercise 1: State the token count range for English prose, code, and Japanese text without running the script.
  • Exercise 2: Draw the transformer block from memory with all residual connections and layer norms in the right places.
  • Exercise 3: Given any chatbot scenario, justify your architecture choice in under 2 minutes.
  • Exercise 4: Explain in one sentence what temperature controls and give a concrete example of when to use 0 vs 1.0.
  • Exercise 5: Deliver the 5-year-old explanation without hesitation, then give the O(n²) derivation on the spot.

When all five are checked, you are ready for the 01-prompting-fundamentals module.