References — 00 Foundations
Curated references for the foundational LLM module. Ordered from most essential to supplementary. Read in this order if you are starting from scratch.
Essential Reading
Attention Is All You Need
Vaswani et al., Google Brain / Google Research, 2017
The paper that introduced the Transformer architecture. Proposes self-attention as a replacement for recurrence and convolution in sequence-to-sequence models. Introduces multi-head attention, positional encodings, and the encoder-decoder structure. Achieved state-of-the-art on WMT English-to-German and English-to-French translation. Read the abstract and sections 3.1–3.3 (the attention mechanism) at minimum. Every modern LLM descends directly from this paper.
Why it matters for engineers: Understanding the Q/K/V formulation, the scaling factor sqrt(d_k), and why attention is O(n²) all come from this paper.
The Illustrated Transformer
Jay Alammar, 2018
The single best visual explainer of the Transformer architecture. Walks through self-attention, multi-head attention, positional encodings, and the full encoder-decoder flow with animated GIFs and step-by-step diagrams. If you find the math in the original paper opaque, read this first. It is one of the most widely linked resources in ML education.
Why it matters for engineers: Builds geometric intuition for what Q, K, and V actually do. Essential companion to the original paper.
The Illustrated GPT-2
Jay Alammar, 2019
Extends the Illustrated Transformer to GPT-2 specifically — covers decoder-only architecture, causal masking, language model head, and autoregressive generation. Explains the difference between encoder and decoder attention patterns visually. Directly applicable to understanding how Claude and GPT-4 work internally.
Why it matters for engineers: GPT-2 is architecturally identical to modern decoder-only LLMs. Understanding GPT-2 means understanding the inference loop for all chat models.
Tooling and APIs
tiktoken
OpenAI, open source
OpenAI’s fast BPE tokenizer library, written in Rust with Python bindings. Used by GPT-3.5, GPT-4, and related models. Supports multiple encodings: cl100k_base (GPT-4), p50k_base (GPT-3), r50k_base (older models). Provides encoding_for_model() for automatic encoding selection. Essential for token counting before API calls.
Usage: pip install tiktoken — see examples/tokenization_demo.py for worked examples.
Tiktokenizer — Interactive Tokenizer Playground
Community tool
A web-based interface for tokenizing arbitrary text with different encodings. Highlights each token in a different color, shows token IDs on hover, and displays total token count in real time. Faster than running code for quick intuition-building. Particularly useful for visualizing how numbers, code, and non-English text tokenize differently.
Usage: Paste any text. Select cl100k_base to match GPT-4/Claude-equivalent counting. No API key required.
Claude Model Specifications and Context Windows
Anthropic
Official reference for available Claude models: context window sizes, pricing tiers, capability comparisons, and recommended use cases. Use this to select the right model for cost vs. capability trade-offs. Updated when new models are released. Check here for the current haiku, sonnet, and opus tier options.
Usage: Reference before choosing a model for a new project; bookmark for pricing lookups.
Key Research Papers
Lost in the Middle: How Language Models Use Long Contexts
Liu et al., 2023
Empirical study showing that LLMs reliably use information at the beginning and end of long contexts but degrade significantly for information in the middle. Tested on multi-document QA where the relevant document was placed at different positions. Results held across GPT-3.5, GPT-4, Claude, and open-weight models. Introduced the phrase “lost in the middle” as a shorthand for this retrieval failure mode.
Why it matters for engineers: Directly affects RAG system design (where to place retrieved chunks), conversation history management (what to truncate when budget runs low), and prompt structure (put critical instructions at top and bottom).
Supplementary Papers (Further Reading)
The following papers are referenced in the README but not linked to avoid fabricated URLs. Search for the exact titles on arxiv.org or Google Scholar:
-
GPT-3: Language Models are Few-Shot Learners (Brown et al., 2020) — Establishes in-context learning as a scaling phenomenon; introduces zero/one/few-shot terminology for LLMs.
-
InstructGPT: Training language models to follow instructions with human feedback (Ouyang et al., 2022) — Introduces the RLHF pipeline (SFT → reward model → PPO) that became the standard alignment technique for chat LLMs.
-
Constitutional AI: Harmlessness from AI Feedback (Behrens et al., 2022, Anthropic) — Proposes using AI-generated preference data guided by a written constitution to reduce reliance on human labelers for alignment.
-
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., 2022) — Shows how to implement attention in O(n) memory (vs O(n²)) using hardware-aware tiling, enabling practical long-context inference.
-
Rotary Position Embedding (Su et al., 2021, RoFormer) — Introduces RoPE, the positional encoding scheme used by LLaMA, GPT-NeoX, Mistral, and most modern open-weight models.
How to Use These References
| Your goal | Start here |
|---|---|
| Understand the math of attention | Attention Is All You Need §3.1–3.3 |
| Build visual intuition for transformers | The Illustrated Transformer |
| Understand decoder/autoregressive generation | The Illustrated GPT-2 |
| Count tokens before an API call | tiktoken repo + tokenization_demo.py |
| Quickly visualize how a string tokenizes | tiktokenizer.vercel.app |
| Pick the right Claude model for your use case | docs.anthropic.com/en/docs/about-claude/models |
| Design a RAG system for long documents | Lost in the Middle paper |
| Understand RLHF / chat model training | InstructGPT paper |
| Understand Claude’s alignment training | Constitutional AI paper |