References: Prompt Engineering
A curated reading list organized by category. Start with the Official Docs and Papers sections, then explore the courses and tools as needed.
Official Documentation
Anthropic Prompt Engineering Guide
- URL: https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview
- What it covers: Anthropic’s first-party recommendations for prompting Claude — XML tags, system prompts, prefilling, avoiding hallucinations, handling long documents
- Why read it: These are the canonical best practices for the model you are most likely working with. Anthropic documents Claude-specific behaviors that differ from other models.
- Best sections: “Be clear and direct”, “Use XML tags”, “Long context tips”, “Extended thinking”
Foundational Papers
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- URL: https://arxiv.org/abs/2201.11903
- Authors: Wei et al. (Google Brain), 2022
- What it covers: Introduces chain-of-thought prompting — showing that providing intermediate reasoning steps as few-shot examples dramatically improves performance on arithmetic, commonsense, and symbolic reasoning tasks
- Key result: Few-shot CoT on GPT-3 (540B) achieved 57% accuracy on GSM8K math problems, vs 17% for standard few-shot — a 3x improvement
- Why read it: The foundational paper for the single most impactful prompting technique. Every engineer using LLMs for multi-step tasks should understand this.
Self-Consistency Improves Chain of Thought Reasoning in Language Models
- URL: https://arxiv.org/abs/2203.11171
- Authors: Wang et al. (Google Brain), 2022
- What it covers: Proposes sampling multiple diverse reasoning paths and taking a majority vote on the final answer, instead of using greedy decoding on a single CoT path
- Key result: Self-consistency with 40 sampled paths improved CoT performance by 17.9% on GSM8K
- Why read it: Critical for production systems where single-pass accuracy is insufficient. Also introduces the concept of reasoning path diversity as a signal of answer confidence.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
- URL: https://arxiv.org/abs/2305.10601
- Authors: Yao et al. (Princeton/Google DeepMind), 2023
- What it covers: Frames problem solving as search over a tree of partial solutions, where the model can evaluate and backtrack — a generalization of CoT that enables non-linear reasoning
- Key result: ToT solved 74% of “Game of 24” problems vs 4% for standard CoT
- Why read it: Foundational for understanding agent architectures and structured reasoning. Conceptually underpins modern agent planning frameworks.
ReAct: Synergizing Reasoning and Acting in Language Models
- URL: https://arxiv.org/abs/2210.03629
- Authors: Yao et al. (Princeton), 2022
- What it covers: Interleaves reasoning traces and task-specific actions (e.g., search queries, API calls), allowing models to dynamically adjust plans based on observations
- Key result: ReAct outperforms pure reasoning (CoT) and pure acting approaches on knowledge-intensive tasks like HotpotQA and FEVER
- Why read it: The direct precursor to modern tool-using agent patterns. Understanding ReAct is essential for the
04-agentsmodule.
Large Language Models Are Zero-Shot Reasoners
- URL: https://arxiv.org/abs/2205.11916
- Authors: Kojima et al. (University of Tokyo), 2022
- What it covers: Discovers that “Let’s think step by step” as a zero-shot prompt suffix elicits CoT reasoning without any examples — the famous “zero-shot CoT” result
- Key result: Zero-shot CoT with “Let’s think step by step” improved MultiArith accuracy from 17.7% to 78.7%
- Why read it: Explains the mechanism behind the single most useful prompting phrase. Short, readable paper — highly recommended.
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
- URL: https://arxiv.org/abs/2205.10625
- Authors: Zhou et al. (Google Research), 2022
- What it covers: A two-stage prompting strategy: first decompose a complex problem into sub-problems, then solve them in sequence using previous answers as context
- Why read it: Addresses a key limitation of standard CoT — handling problems that require solving simpler prerequisites first. Directly applicable to multi-step software engineering tasks.
Comprehensive Guides & Courses
Prompting Guide (promptingguide.ai)
- URL: https://www.promptingguide.ai
- What it covers: Comprehensive reference for all major prompting techniques with examples across multiple models. Covers zero/few-shot, CoT, ToT, ReAct, self-consistency, and more.
- Why use it: Best single reference for looking up a specific technique quickly. Well-maintained with citations to original papers.
- Best sections: Techniques overview, Model-specific guides, Applications
Learn Prompting
- URL: https://learnprompting.org
- What it covers: Interactive, structured course on prompt engineering from basics to advanced topics. Covers defensive prompting, image prompting, and agent design.
- Why use it: More approachable than reading papers. Good for filling in gaps after you’ve read the core papers.
- Recommended path: Start with “Basics” → “Intermediate” → “Prompt Hacking” section
Security & Adversarial Prompting
Prompt Injection Resources (Brex)
- URL: https://github.com/brexhq/prompt-security
- What it covers: Practical guide from Brex’s engineering team on prompt injection attacks and defenses in production LLM systems. Includes attack taxonomy, real-world examples, and mitigation strategies.
- Why read it: One of the most practically useful security resources for engineers building LLM-powered products. Written by practitioners, not just researchers.
Prompt Injection Attacks Against GPT-3 (Simon Willison)
- URL: https://simonwillison.net/2022/Sep/12/prompt-injection/
- What it covers: The blog post that popularized the term “prompt injection” — explains the attack with clear examples and analogies to SQL injection
- Why read it: Clear, concise introduction to the attack surface. Good for explaining the concept to non-ML engineers.
Tools & Infrastructure
Anthropic Console
- URL: https://console.anthropic.com
- What it is: Anthropic’s web IDE for prompt development. Features: prompt editor with version history, side-by-side model comparison, token counting, test case library.
- Best for: Rapid prototyping, sharing prompts with team, exploring model differences
LangSmith
- URL: https://smith.langchain.com
- What it is: LLM observability platform. Features: run tracing, prompt versioning, dataset management, evaluation pipelines, A/B testing.
- Best for: Production systems requiring full observability and systematic prompt evaluation
OpenAI Evals (framework, not model-specific)
- URL: https://github.com/openai/evals
- What it is: Open-source evaluation framework. Useful for building automated prompt evaluation suites even if you’re not using OpenAI models.
- Best for: Building systematic evals for prompt regression testing
Recommended Reading Order
For a software engineer building prompt engineering skills:
- Start here: Anthropic Prompt Engineering Guide (docs) — 30 minutes
- Core technique: “Large Language Models Are Zero-Shot Reasoners” (Kojima et al.) — 20 minutes
- Core technique: “Chain-of-Thought Prompting Elicits Reasoning” (Wei et al.) — 40 minutes
- Security: Brex prompt security repo README — 20 minutes
- Reference: Bookmark promptingguide.ai for technique lookup
- Deep dive: “Self-Consistency” (Wang et al.) and “Tree of Thoughts” (Yao et al.) when needed for your use case
- Production: LangSmith docs when you need observability and evals infrastructure