References: Prompt Engineering

A curated reading list organized by category. Start with the Official Docs and Papers sections, then explore the courses and tools as needed.


Official Documentation

Anthropic Prompt Engineering Guide

  • URL: https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview
  • What it covers: Anthropic’s first-party recommendations for prompting Claude — XML tags, system prompts, prefilling, avoiding hallucinations, handling long documents
  • Why read it: These are the canonical best practices for the model you are most likely working with. Anthropic documents Claude-specific behaviors that differ from other models.
  • Best sections: “Be clear and direct”, “Use XML tags”, “Long context tips”, “Extended thinking”

Foundational Papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

  • URL: https://arxiv.org/abs/2201.11903
  • Authors: Wei et al. (Google Brain), 2022
  • What it covers: Introduces chain-of-thought prompting — showing that providing intermediate reasoning steps as few-shot examples dramatically improves performance on arithmetic, commonsense, and symbolic reasoning tasks
  • Key result: Few-shot CoT on GPT-3 (540B) achieved 57% accuracy on GSM8K math problems, vs 17% for standard few-shot — a 3x improvement
  • Why read it: The foundational paper for the single most impactful prompting technique. Every engineer using LLMs for multi-step tasks should understand this.

Self-Consistency Improves Chain of Thought Reasoning in Language Models

  • URL: https://arxiv.org/abs/2203.11171
  • Authors: Wang et al. (Google Brain), 2022
  • What it covers: Proposes sampling multiple diverse reasoning paths and taking a majority vote on the final answer, instead of using greedy decoding on a single CoT path
  • Key result: Self-consistency with 40 sampled paths improved CoT performance by 17.9% on GSM8K
  • Why read it: Critical for production systems where single-pass accuracy is insufficient. Also introduces the concept of reasoning path diversity as a signal of answer confidence.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

  • URL: https://arxiv.org/abs/2305.10601
  • Authors: Yao et al. (Princeton/Google DeepMind), 2023
  • What it covers: Frames problem solving as search over a tree of partial solutions, where the model can evaluate and backtrack — a generalization of CoT that enables non-linear reasoning
  • Key result: ToT solved 74% of “Game of 24” problems vs 4% for standard CoT
  • Why read it: Foundational for understanding agent architectures and structured reasoning. Conceptually underpins modern agent planning frameworks.

ReAct: Synergizing Reasoning and Acting in Language Models

  • URL: https://arxiv.org/abs/2210.03629
  • Authors: Yao et al. (Princeton), 2022
  • What it covers: Interleaves reasoning traces and task-specific actions (e.g., search queries, API calls), allowing models to dynamically adjust plans based on observations
  • Key result: ReAct outperforms pure reasoning (CoT) and pure acting approaches on knowledge-intensive tasks like HotpotQA and FEVER
  • Why read it: The direct precursor to modern tool-using agent patterns. Understanding ReAct is essential for the 04-agents module.

Large Language Models Are Zero-Shot Reasoners

  • URL: https://arxiv.org/abs/2205.11916
  • Authors: Kojima et al. (University of Tokyo), 2022
  • What it covers: Discovers that “Let’s think step by step” as a zero-shot prompt suffix elicits CoT reasoning without any examples — the famous “zero-shot CoT” result
  • Key result: Zero-shot CoT with “Let’s think step by step” improved MultiArith accuracy from 17.7% to 78.7%
  • Why read it: Explains the mechanism behind the single most useful prompting phrase. Short, readable paper — highly recommended.

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

  • URL: https://arxiv.org/abs/2205.10625
  • Authors: Zhou et al. (Google Research), 2022
  • What it covers: A two-stage prompting strategy: first decompose a complex problem into sub-problems, then solve them in sequence using previous answers as context
  • Why read it: Addresses a key limitation of standard CoT — handling problems that require solving simpler prerequisites first. Directly applicable to multi-step software engineering tasks.

Comprehensive Guides & Courses

Prompting Guide (promptingguide.ai)

  • URL: https://www.promptingguide.ai
  • What it covers: Comprehensive reference for all major prompting techniques with examples across multiple models. Covers zero/few-shot, CoT, ToT, ReAct, self-consistency, and more.
  • Why use it: Best single reference for looking up a specific technique quickly. Well-maintained with citations to original papers.
  • Best sections: Techniques overview, Model-specific guides, Applications

Learn Prompting

  • URL: https://learnprompting.org
  • What it covers: Interactive, structured course on prompt engineering from basics to advanced topics. Covers defensive prompting, image prompting, and agent design.
  • Why use it: More approachable than reading papers. Good for filling in gaps after you’ve read the core papers.
  • Recommended path: Start with “Basics” → “Intermediate” → “Prompt Hacking” section

Security & Adversarial Prompting

Prompt Injection Resources (Brex)

  • URL: https://github.com/brexhq/prompt-security
  • What it covers: Practical guide from Brex’s engineering team on prompt injection attacks and defenses in production LLM systems. Includes attack taxonomy, real-world examples, and mitigation strategies.
  • Why read it: One of the most practically useful security resources for engineers building LLM-powered products. Written by practitioners, not just researchers.

Prompt Injection Attacks Against GPT-3 (Simon Willison)

  • URL: https://simonwillison.net/2022/Sep/12/prompt-injection/
  • What it covers: The blog post that popularized the term “prompt injection” — explains the attack with clear examples and analogies to SQL injection
  • Why read it: Clear, concise introduction to the attack surface. Good for explaining the concept to non-ML engineers.

Tools & Infrastructure

Anthropic Console

  • URL: https://console.anthropic.com
  • What it is: Anthropic’s web IDE for prompt development. Features: prompt editor with version history, side-by-side model comparison, token counting, test case library.
  • Best for: Rapid prototyping, sharing prompts with team, exploring model differences

LangSmith

  • URL: https://smith.langchain.com
  • What it is: LLM observability platform. Features: run tracing, prompt versioning, dataset management, evaluation pipelines, A/B testing.
  • Best for: Production systems requiring full observability and systematic prompt evaluation

OpenAI Evals (framework, not model-specific)

  • URL: https://github.com/openai/evals
  • What it is: Open-source evaluation framework. Useful for building automated prompt evaluation suites even if you’re not using OpenAI models.
  • Best for: Building systematic evals for prompt regression testing

For a software engineer building prompt engineering skills:

  1. Start here: Anthropic Prompt Engineering Guide (docs) — 30 minutes
  2. Core technique: “Large Language Models Are Zero-Shot Reasoners” (Kojima et al.) — 20 minutes
  3. Core technique: “Chain-of-Thought Prompting Elicits Reasoning” (Wei et al.) — 40 minutes
  4. Security: Brex prompt security repo README — 20 minutes
  5. Reference: Bookmark promptingguide.ai for technique lookup
  6. Deep dive: “Self-Consistency” (Wang et al.) and “Tree of Thoughts” (Yao et al.) when needed for your use case
  7. Production: LangSmith docs when you need observability and evals infrastructure