References — 00 Foundations

Curated references for the foundational LLM module. Ordered from most essential to supplementary. Read in this order if you are starting from scratch.


Essential Reading

Attention Is All You Need

Vaswani et al., Google Brain / Google Research, 2017

The paper that introduced the Transformer architecture. Proposes self-attention as a replacement for recurrence and convolution in sequence-to-sequence models. Introduces multi-head attention, positional encodings, and the encoder-decoder structure. Achieved state-of-the-art on WMT English-to-German and English-to-French translation. Read the abstract and sections 3.1–3.3 (the attention mechanism) at minimum. Every modern LLM descends directly from this paper.

Why it matters for engineers: Understanding the Q/K/V formulation, the scaling factor sqrt(d_k), and why attention is O(n²) all come from this paper.


The Illustrated Transformer

Jay Alammar, 2018

The single best visual explainer of the Transformer architecture. Walks through self-attention, multi-head attention, positional encodings, and the full encoder-decoder flow with animated GIFs and step-by-step diagrams. If you find the math in the original paper opaque, read this first. It is one of the most widely linked resources in ML education.

Why it matters for engineers: Builds geometric intuition for what Q, K, and V actually do. Essential companion to the original paper.


The Illustrated GPT-2

Jay Alammar, 2019

Extends the Illustrated Transformer to GPT-2 specifically — covers decoder-only architecture, causal masking, language model head, and autoregressive generation. Explains the difference between encoder and decoder attention patterns visually. Directly applicable to understanding how Claude and GPT-4 work internally.

Why it matters for engineers: GPT-2 is architecturally identical to modern decoder-only LLMs. Understanding GPT-2 means understanding the inference loop for all chat models.


Tooling and APIs

tiktoken

OpenAI, open source

OpenAI’s fast BPE tokenizer library, written in Rust with Python bindings. Used by GPT-3.5, GPT-4, and related models. Supports multiple encodings: cl100k_base (GPT-4), p50k_base (GPT-3), r50k_base (older models). Provides encoding_for_model() for automatic encoding selection. Essential for token counting before API calls.

Usage: pip install tiktoken — see examples/tokenization_demo.py for worked examples.


Tiktokenizer — Interactive Tokenizer Playground

Community tool

A web-based interface for tokenizing arbitrary text with different encodings. Highlights each token in a different color, shows token IDs on hover, and displays total token count in real time. Faster than running code for quick intuition-building. Particularly useful for visualizing how numbers, code, and non-English text tokenize differently.

Usage: Paste any text. Select cl100k_base to match GPT-4/Claude-equivalent counting. No API key required.


Claude Model Specifications and Context Windows

Anthropic

Official reference for available Claude models: context window sizes, pricing tiers, capability comparisons, and recommended use cases. Use this to select the right model for cost vs. capability trade-offs. Updated when new models are released. Check here for the current haiku, sonnet, and opus tier options.

Usage: Reference before choosing a model for a new project; bookmark for pricing lookups.


Key Research Papers

Lost in the Middle: How Language Models Use Long Contexts

Liu et al., 2023

Empirical study showing that LLMs reliably use information at the beginning and end of long contexts but degrade significantly for information in the middle. Tested on multi-document QA where the relevant document was placed at different positions. Results held across GPT-3.5, GPT-4, Claude, and open-weight models. Introduced the phrase “lost in the middle” as a shorthand for this retrieval failure mode.

Why it matters for engineers: Directly affects RAG system design (where to place retrieved chunks), conversation history management (what to truncate when budget runs low), and prompt structure (put critical instructions at top and bottom).


Supplementary Papers (Further Reading)

The following papers are referenced in the README but not linked to avoid fabricated URLs. Search for the exact titles on arxiv.org or Google Scholar:

  • GPT-3: Language Models are Few-Shot Learners (Brown et al., 2020) — Establishes in-context learning as a scaling phenomenon; introduces zero/one/few-shot terminology for LLMs.

  • InstructGPT: Training language models to follow instructions with human feedback (Ouyang et al., 2022) — Introduces the RLHF pipeline (SFT → reward model → PPO) that became the standard alignment technique for chat LLMs.

  • Constitutional AI: Harmlessness from AI Feedback (Behrens et al., 2022, Anthropic) — Proposes using AI-generated preference data guided by a written constitution to reduce reliance on human labelers for alignment.

  • FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., 2022) — Shows how to implement attention in O(n) memory (vs O(n²)) using hardware-aware tiling, enabling practical long-context inference.

  • Rotary Position Embedding (Su et al., 2021, RoFormer) — Introduces RoPE, the positional encoding scheme used by LLaMA, GPT-NeoX, Mistral, and most modern open-weight models.


How to Use These References

Your goalStart here
Understand the math of attentionAttention Is All You Need §3.1–3.3
Build visual intuition for transformersThe Illustrated Transformer
Understand decoder/autoregressive generationThe Illustrated GPT-2
Count tokens before an API calltiktoken repo + tokenization_demo.py
Quickly visualize how a string tokenizestiktokenizer.vercel.app
Pick the right Claude model for your use casedocs.anthropic.com/en/docs/about-claude/models
Design a RAG system for long documentsLost in the Middle paper
Understand RLHF / chat model trainingInstructGPT paper
Understand Claude’s alignment trainingConstitutional AI paper