Interview Question Bank

For each question: key points a strong answer must hit, a model answer, and follow-up probes.


Fundamentals (10 questions)

Q1: Explain how the transformer attention mechanism works.

Key points: Q/K/V matrices, dot-product similarity, softmax normalization, weighted sum of values, why it’s O(n²) in sequence length, multi-head attention.

Model answer: Attention computes, for each token, a weighted average of all other tokens’ value vectors. The weight for each pair is the dot product of the query vector (what this token is looking for) and the key vector (what that token offers), scaled by √d and passed through softmax. Multi-head attention runs this in parallel across H subspaces, letting the model attend to different relationship types simultaneously — syntax in one head, coreference in another. The O(n²) cost comes from computing all n² pair-wise similarities.

Follow-ups: Why scale by √d? What is the KV cache and how does it reduce inference cost? How does causal masking work in decoder models?


Q2: What is tokenization and why does it matter for LLM applications?

Key points: Subword units (BPE), token ≠ word, context window measured in tokens, cost is per-token, non-English text is token-dense, numbers and code tokenize differently.

Model answer: Tokenization splits text into subword units using algorithms like Byte Pair Encoding. A token is roughly 0.75 English words on average, but this varies enormously — a single emoji can be 1–3 tokens, a Chinese character is often 1–2 tokens, and large numbers split into many tokens. This matters for three reasons: (1) API cost is per token, so dense encodings cost more, (2) context windows are measured in tokens, so you must estimate token budgets before sending requests, (3) tasks that manipulate individual characters (counting, reversal) are harder because the model reasons over tokens, not characters.

Follow-ups: How would you count tokens before sending an API request? Why does tiktoken give different counts than the model’s reported usage? What is the “tokenization tax” for code vs prose?


Q3: What is the difference between fine-tuning, prompting, and RAG?

Key points: Fine-tuning bakes knowledge into weights (expensive, static), prompting injects context at runtime (flexible, cheap), RAG retrieves relevant docs dynamically (handles large/fresh knowledge bases). Decision factors: data size, freshness, cost, latency.

Model answer: Prompting is the baseline: you shape behavior through the input text. No training, no infra, instant iteration. RAG adds a retrieval step — you embed a query, find relevant chunks in a vector store, and inject them into the prompt. Best when knowledge is too large for the context window, frequently updated, or proprietary. Fine-tuning modifies the model weights via gradient descent. Best for teaching the model a consistent style, format, or domain-specific reasoning pattern — things that can’t easily be expressed in a prompt. In practice: try prompting first, add RAG if the knowledge base is large, fine-tune only if neither is sufficient.

Follow-ups: When would you combine RAG and fine-tuning? What are the risks of fine-tuning on low-quality data? How do you evaluate whether fine-tuning actually improved things?


Q4: How does temperature affect LLM output?

Key points: Temperature scales logits before softmax (higher = flatter distribution = more randomness), 0 ≈ greedy (most likely token), >1 = more uniform/creative/chaotic, affects diversity not accuracy.

Model answer: Temperature T divides each logit by T before the softmax. At T=0, the model always picks the highest-probability token (greedy decoding) — deterministic but potentially repetitive. At T=1, the distribution is unchanged from training. At T>1, the distribution flattens — low-probability tokens become more likely, increasing diversity but also nonsense. For production tasks requiring consistency (classification, extraction, code), use T=0 or 0.1. For creative tasks, try T=0.7–1.0. Note: T=0 is only approximately deterministic in practice due to floating-point nondeterminism.

Follow-ups: How does top-p interact with temperature? When would you use temperature >1? What is the difference between temperature and repetition penalty?


Q5: What is the KV cache and how does it affect latency?

Key points: Caches key and value matrices from previous tokens so they don’t need to be recomputed, reduces prefill cost on repeated prefixes, Anthropic prompt caching extends this across requests.

Model answer: During autoregressive generation, each new token needs attention over all previous tokens. Naively, this requires recomputing K and V for every previous token at every step — O(n²) total work. The KV cache stores K and V matrices as they’re computed so each new token only needs to compute its own K, V, and Q and then attend over the cached values. This reduces generation from O(n²) to O(n). Anthropic’s prompt caching extends this idea to the API level: if your system prompt is identical across requests, the server caches its K/V state and charges ~10x less for those tokens while reducing time-to-first-token.

Follow-ups: What fills up GPU memory in a long-context deployment? How does KV cache interact with batching? What is the “cache warming” problem?


Q6: Explain encoder-only vs decoder-only transformers.

Key points: Encoder (BERT) = bidirectional attention, good for understanding/classification. Decoder (GPT) = causal/unidirectional attention, good for generation. Encoder-decoder (T5) = both, good for seq2seq tasks.

Model answer: Encoder-only models (BERT, RoBERTa) use bidirectional attention — each token attends to all others. Great for tasks requiring full-sentence understanding: classification, NER, embeddings. They don’t generate text naturally. Decoder-only models (GPT, Claude, Llama) use causal masking — each token only attends to previous tokens. This makes them natural text generators: predict the next token, sample it, append it, repeat. Encoder-decoder models (T5, BART) use an encoder to build a representation of the input, then a decoder to generate output attending to both the encoder output and previous generated tokens. Good for translation, summarization.

Follow-ups: Why are decoder-only models now the dominant architecture even for classification? What is prefix LM and where does it fit? Why does bidirectional attention not work for generation?


Q7: What happens when a prompt exceeds the context window?

Key points: Hard error at the API level, must truncate or compress; strategies: sliding window, summarization, RAG, hierarchical processing.

Model answer: The API returns an error and refuses to process the request. You must either truncate (simplest, but loses information), compress (summarize old conversation turns), or restructure (use RAG to pull only relevant chunks rather than including everything). For long documents: chunk and retrieve. For long conversations: maintain a running summary and a sliding window of recent turns. For agent traces: compress intermediate steps into a status summary. The “lost in the middle” research shows that even within the context window, information in the middle of a long context is retrieved less reliably than information at the beginning or end.

Follow-ups: How do you decide what to truncate when you must? How do you implement conversation summarization without losing critical facts? What is the “lost in the middle” problem and how do you design around it?


Q8: What is instruction tuning and why was it important?

Key points: Supervised fine-tuning on (instruction, response) pairs, made models follow directions instead of just completing text, enabled chat interfaces, InstructGPT/RLHF further aligned outputs with human preferences.

Model answer: Base language models are trained to predict the next token on web text. They’re good at completion but don’t inherently follow instructions — if you prompt “Write a poem about cats,” the base model might continue “Write a poem about cats. Here are some examples…” rather than writing the poem. Instruction tuning fine-tunes the model on thousands of (instruction, desired response) pairs. The model learns to treat the prompt as an instruction to follow, not text to complete. InstructGPT added RLHF (reinforcement learning from human feedback) on top, further aligning outputs with human preferences for helpfulness, harmlessness, and honesty. This made models dramatically more useful in practice.

Follow-ups: What are the downsides of instruction tuning? What is RLHF and what problem does it solve that supervised fine-tuning doesn’t? What is Constitutional AI (Anthropic’s approach)?


Q9: How does beam search differ from sampling?

Key points: Beam search keeps top-k hypotheses at each step (deterministic, high probability, but repetitive), sampling draws from the distribution (stochastic, diverse, can be incoherent).

Model answer: Beam search maintains a “beam” of k candidate sequences. At each step, it expands all k candidates with all possible next tokens and keeps the k sequences with highest cumulative probability. It’s deterministic and tends to produce grammatical, high-probability text — but often generic and repetitive. Pure sampling draws a token randomly according to the probability distribution, then repeats. It’s more diverse and creative but can be incoherent. In practice, LLMs now mostly use sampling with constraints: top-p (only sample from tokens that make up the top P% of probability mass) or top-k (only sample from the k most likely tokens), combined with temperature scaling.

Follow-ups: Why did beam search fall out of favor for dialogue/creative tasks? What is the “degeneration” problem with beam search? When would you still use beam search today?


Q10: What is the “lost in the middle” problem?

Key points: LLMs retrieve information at the start and end of long contexts better than information in the middle, affects RAG (put most relevant chunks first or last), measured in a 2023 paper.

Model answer: A 2023 study (“Lost in the Middle,” Liu et al.) found that when a relevant document is placed in the middle of a long context window, LLMs perform significantly worse at using it compared to when it’s placed at the start or end. Performance forms a U-shape with context position. This has direct implications for RAG: (1) put the most relevant retrieved chunks first or last, not in the middle of a pile of context, (2) don’t retrieve more chunks than necessary just because you have context space — irrelevant context in the middle degrades performance, (3) reranking retrieved chunks by relevance (not just similarity score) helps because the top-ranked chunk ends up at the top of the context.

Follow-ups: How does this affect how you structure a RAG prompt? Does this problem get better with longer context models? How do you test whether your RAG system is hitting this problem?


RAG (8 questions)

Q11: How would you chunk a 100-page PDF for RAG?

Key points: Don’t use fixed-size blindly, consider document structure (sections/paragraphs), semantic chunking > arbitrary splits, overlap for context continuity, test chunk sizes empirically (256–512 tokens common starting point).

Model answer: First, extract structure: a PDF with clear section headings should be chunked at section boundaries, not arbitrary character counts. For dense technical text, I’d use 300–500 token chunks with 50-token overlap. For narrative text, sentence-based chunking preserves meaning better. I’d avoid splitting in the middle of sentences or code blocks. I’d also create metadata per chunk (page number, section title, document ID) for citation and filtering. Most importantly, I’d evaluate the chunking: run 20 representative queries, check if retrieved chunks actually contain the answer, and tune chunk size if retrieval precision is low.

Follow-ups: How do you handle tables in PDFs? What is parent-document retrieval and when would you use it? How do you handle PDFs where the logical structure doesn’t match the visual layout?


Key points: Combines dense (semantic/embedding) and sparse (BM25/keyword) retrieval via RRF fusion, better for keyword-heavy queries, proper nouns, codes/IDs, technical terms.

Model answer: Dense retrieval finds semantically similar documents even with different words — good for concept-level queries. Sparse retrieval (BM25) does exact and near-exact keyword matching — good for specific product names, error codes, identifiers, or when the user uses the exact words that appear in the document. Hybrid retrieval runs both, then merges the ranked lists using Reciprocal Rank Fusion: score = Σ 1/(k + rank_i) for each system. This is robust because a document ranked #1 by one system and #15 by another still beats a document ranked #10 by both. In practice, hybrid retrieval consistently outperforms either system alone, especially for heterogeneous query types.

Follow-ups: How do you weight the two systems in hybrid retrieval? What is RRF and why is it preferred over score normalization? When would pure BM25 outperform hybrid?


Q13: How do you evaluate a RAG pipeline?

Key points: RAGAS metrics (faithfulness, answer relevancy, context precision, context recall), golden dataset, automated LLM-as-judge for scale, human review for calibration.

Model answer: I’d use the RAGAS framework with four metrics: faithfulness (does the answer come from the retrieved context — catches hallucination), answer relevancy (does it actually answer the question), context precision (are the retrieved chunks relevant — measures retrieval quality), and context recall (did we retrieve all information needed — measures coverage). I’d build a golden dataset of 50–100 question/answer/ground-truth-context triples, use an LLM as the evaluator for scale, and manually review a sample to calibrate. I’d run this as a regression suite — every time I change chunking, embedding model, or retrieval strategy, I re-run and compare the metrics.

Follow-ups: How do you build a good golden dataset? What are the failure modes of LLM-as-judge for RAG evaluation? How do you detect when your RAG system is hallucinating?


Q14: What is HyDE and when would you use it?

Key points: Hypothetical Document Embeddings — generate a fake answer, embed it, use that embedding for retrieval instead of the query embedding. Helps when query and documents are semantically distant (question vs answer asymmetry).

Model answer: The intuition behind HyDE is that a question (“What causes type 2 diabetes?”) and its answer (“Type 2 diabetes is caused by…”) are semantically different, so the query embedding may not match the document embeddings well. HyDE generates a hypothetical answer to the query using an LLM (without access to the real documents), embeds that hypothetical answer, and uses that embedding for retrieval. The hypothesis is in the same “answer space” as the documents, so it retrieves better. It’s especially useful for complex questions, abstract concepts, and domains where query-document semantic mismatch is high. The downside: an extra LLM call per query (cost + latency).

Follow-ups: What if the hypothetical document is wrong or hallucinated — does it hurt retrieval? How does HyDE compare to multi-query retrieval? When would you not use HyDE?


Q15: How do you handle a question that requires information from multiple documents?

Key points: Multi-hop retrieval, iterative/agentic RAG, retrieve → partial answer → re-retrieve, or retrieve more chunks and hope they’re all present.

Model answer: Single-pass RAG retrieves chunks for the original query, but multi-hop questions (“What did the CEO who founded company X say about Y?”) require first finding company X’s founder, then finding what they said. Options: (1) retrieve more chunks (k=10 instead of 3) and hope both are captured — simple but expensive; (2) iterative retrieval: answer partially, identify missing information, re-query; (3) agentic RAG: the LLM decides when to query and what to query for, building up context over multiple retrieval steps. The agentic approach is most powerful but adds latency and complexity. For predictable multi-hop patterns, you can also use a query decomposition step: break the question into sub-questions, retrieve for each, then synthesize.

Follow-ups: How do you know when iterative retrieval should stop? What is FLARE (Forward-Looking Active Retrieval)? How do you prevent context explosion in multi-hop retrieval?


Q16: What’s the difference between dense and sparse retrieval?

Key points: Dense = embedding-based semantic similarity, sparse = keyword/BM25 term frequency, dense handles synonyms/concepts, sparse handles exact matches, both have failure modes.

Model answer: Dense retrieval embeds queries and documents into vectors and finds nearest neighbors by cosine similarity. It understands semantic equivalence — “automobile” and “car” have similar vectors. But it can miss exact matches — a query for “SKU-4892” might not retrieve the document mentioning “SKU-4892” if that code is rare in training data. Sparse retrieval (BM25) is based on term frequency and inverse document frequency — it scores documents by how often query terms appear in them relative to the corpus. It excels at exact matches and rare terms but fails for synonyms and concept-level queries. This is why hybrid retrieval combining both consistently outperforms either alone.

Follow-ups: How does sparse retrieval handle multi-word phrases? What is SPLADE and how does it combine dense and sparse ideas? When would you choose sparse-only retrieval?


Q17: How do you handle stale embeddings in a production RAG system?

Key points: Re-embed on document update, hash-based change detection, incremental indexing, embedding model versioning.

Model answer: Stale embeddings happen when documents change but their stored embeddings weren’t updated. Three problems: (1) document content changed but old embedding still matches, (2) embedding model was upgraded and old embeddings aren’t comparable to new ones, (3) new documents were added but not indexed. Solutions: (1) hash document content at index time, re-embed when hash changes; (2) maintain a version tag per embedding and run a migration job when you upgrade the embedding model — rebuild the index; (3) run incremental indexing on a schedule or event-triggered (webhook on document update). In practice, I’d use a pipeline: document store → change detection → embedding job → vector store upsert, with monitoring for index freshness.

Follow-ups: How do you handle embedding model upgrades without a full re-index? What is the risk of mixing embeddings from different model versions in the same index? How do you monitor index freshness?


Q18: What is reranking and what are the trade-offs?

Key points: Two-stage retrieval — retrieve broad set (recall), rerank with stronger model (precision), Cohere Rerank / ColBERT, adds latency and cost but improves precision significantly.

Model answer: First-stage retrieval (dense/BM25) optimizes for recall — retrieve the top-50 candidates quickly. Reranking then applies a stronger, slower model to the candidates to re-score them for relevance, keeping only the top 3–5 for the final prompt. Cohere Rerank is a cross-encoder that jointly encodes the query and each candidate, giving much better relevance scores than the query embedding alone. ColBERT uses late interaction — more efficient than cross-encoders. The trade-off: reranking adds 100–500ms latency and cost, but significantly improves the quality of what goes into the LLM’s context. Worth it when retrieval precision is the bottleneck (wrong chunks are being included) rather than coverage (correct chunks are being missed).

Follow-ups: What is the difference between a bi-encoder and a cross-encoder? When is reranking not worth the cost? How do you measure whether reranking actually helped?


Agents (8 questions)

Q19: What is the ReAct pattern?

Key points: Interleaves Reasoning traces (Thought) with Actions (tool calls) and Observations (results), from a 2022 paper, better than pure reasoning chains because it grounds reasoning in real information.

Model answer: ReAct (Reasoning + Acting) is a prompting pattern where the model alternates between writing a reasoning trace and taking an action. A typical step looks like: “Thought: I need to find the current price of AAPL. Action: search_web(‘AAPL stock price’). Observation: AAPL is currently trading at $182. Thought: Now I can answer the question.” This is better than pure chain-of-thought because reasoning is grounded in real retrieved information rather than the model’s potentially stale parametric knowledge. It’s better than pure tool-calling without reasoning because the intermediate thoughts help the model plan and course-correct. Almost every production agent framework implements some variant of ReAct.

Follow-ups: What are the failure modes of ReAct? How does ReAct compare to Plan-and-Execute? What happens when a tool call returns an error in a ReAct loop?


Q20: How do you prevent an agent from looping infinitely?

Key points: Hard step limit, detect repeated states, timeout, budget guard on API cost, human-in-the-loop interrupt.

Model answer: Multiple layers: (1) Hard step limit — max 20 iterations, then force a final answer with whatever the agent has. (2) Loop detection — hash the (action, input) pair at each step; if you see it twice, break and report. (3) Timeout — wall-clock time limit, not just step count. (4) Cost guard — if cumulative API spend exceeds $X for a single task, abort. (5) Progress check — every 5 steps, have the agent assess whether it’s making progress; if it reports no progress, escalate to human. In practice, step limits catch 95% of cases. Loop detection handles the specific “agent keeps calling the same tool with the same args” failure mode.

Follow-ups: How do you communicate a graceful stop to the user (partial results vs error)? What’s the right step limit for different task types? How do you log agent loops for debugging?


Q21: How do you handle tool call errors?

Key points: Return error info in tool_result (not raise exception), let the LLM decide how to recover, log all errors, retry transient errors, don’t silently swallow errors.

Model answer: When a tool fails, return the error information in the tool_result content block rather than raising an exception — the LLM is part of the error handling loop. A tool_result with {"error": "API timeout after 5s", "suggestion": "try again or use a different approach"} lets the model decide: retry, try a different tool, or tell the user it can’t complete the task. For transient errors (timeouts, rate limits), implement automatic retry with backoff inside the tool before returning. For permanent errors (invalid input, resource not found), return a clear error message. Always log: tool name, input, error type, and stack trace for debugging.

Follow-ups: What is the difference between a tool error and an agent failure? How do you prevent an agent from retrying a tool that keeps failing? How do you surface tool errors to end users?


Q22: What makes a good tool description?

Key points: Verb-noun naming, clear description of what it does AND when to use it AND what it returns, typed schemas with per-field descriptions, examples in the description, specify edge cases.

Model answer: A tool description is part of the model’s context — it’s essentially a prompt. A bad description: search(query: str) -> str. A good description: “Search the company knowledge base for relevant documentation. Use this when the user asks about internal processes, policies, or product specifications. Returns the top 3 matching document excerpts with their titles and URLs. Will return an empty list if no relevant documents are found. Do not use for general internet searches — use web_search for that.” The key elements: (1) what it does, (2) when to use it vs alternatives, (3) what the return value looks like, (4) edge cases and failure modes. Per-field descriptions in the JSON schema are equally important.

Follow-ups: How do you test whether a tool description is good? What happens when two tools have overlapping descriptions? How do you handle tools that have complex conditional behavior?


Q23: When would you NOT use an agent?

Key points: When the task structure is known and fixed (use a chain), when latency is critical, when the output must be deterministic, when cost matters and a single call suffices.

Model answer: Agents have real overhead: multiple API calls, unpredictable latency, higher cost, and harder-to-debug behavior. Don’t use an agent when: (1) the steps are known in advance — use a chain or pipeline; (2) the task can be done in a single LLM call — agents add unnecessary complexity; (3) latency is critical (< 1s response time) — agents rarely achieve this; (4) you need deterministic, auditable behavior — agents are inherently non-deterministic; (5) the user needs to trust the output and you can’t explain the agent’s reasoning. A single well-crafted prompt often outperforms a simple ReAct agent on bounded tasks.

Follow-ups: How do you know when a task is “complex enough” for an agent? What is the cost penalty for adding a ReAct loop vs a single call? Can you make an agent’s behavior more deterministic?


Q24: How do parallel tool calls work?

Key points: LLM returns multiple tool_use blocks in one response, client runs all tools concurrently, sends all tool_results back in one user message, model then continues.

Model answer: When Claude decides multiple tools can be called independently, it returns them all in a single assistant response — multiple tool_use blocks with different IDs. The client is responsible for running them in parallel (e.g., asyncio.gather()), collecting all results, and sending them back as a single user message containing all tool_result blocks, each with the matching tool_use_id. The model then reads all results together and continues. This is more efficient than sequential tool calls — a task requiring weather for 3 cities can make one round trip instead of three. The client must run tools concurrently to capture this benefit.

Follow-ups: What happens if one parallel tool call fails but others succeed? How do you handle tool calls that have dependencies on each other? Does Claude always parallelize when it can?


Q25: What is plan-and-execute?

Key points: Two-phase approach — first call generates a complete step-by-step plan, second loop executes each step, better for complex tasks requiring upfront reasoning, predictable structure.

Model answer: Instead of having the agent decide what to do at each step (ReAct), plan-and-execute separates planning from execution. Phase 1: send the task to the model with a “create a step-by-step plan” instruction. The model outputs a numbered plan: “1. Search for X, 2. Calculate Y from the result, 3. Format as Z.” Phase 2: execute each step in sequence, using tool calls as needed. Advantages: the full plan is visible for human review before execution; you can validate/modify the plan; the executor doesn’t need to reason about what to do next; the structure is predictable. Disadvantages: the plan can’t adapt to unexpected tool results mid-execution. Best for tasks with predictable structure.

Follow-ups: How do you handle when the plan needs to change mid-execution? How is plan-and-execute implemented in LangGraph? When would you prefer ReAct over plan-and-execute?


Q26: How do you test agent behavior reliably?

Key points: Golden trace datasets, deterministic tool mocking, input/output pair testing, trajectory evaluation, LLM-as-judge for open-ended responses.

Model answer: Agents are hard to test because they’re non-deterministic. Strategies: (1) Mock all tools deterministically — fixed inputs → fixed outputs. This makes the agent’s behavior depend only on the LLM. (2) Build a golden trace dataset: for representative inputs, record the expected sequence of tool calls and final output. Test that the agent takes the right actions. (3) Evaluate outcomes, not trajectories for open-ended tasks — use LLM-as-judge to score whether the final output is correct, regardless of path taken. (4) Property-based testing: define invariants that should always hold (e.g., “agent should never call delete_file without first reading it”) and check them across many inputs. (5) Always test failure paths: broken tools, empty results, malformed responses.

Follow-ups: How do you handle the LLM’s inherent randomness in tests? What is trajectory evaluation and how do you score it? How do you test for prompt injection vulnerabilities in agents?


Production (8 questions)

Q27: How does Anthropic prompt caching work?

Key points: Marks a prefix with cache_control: ephemeral, API caches K/V state for ~5 min, hits are ~10x cheaper input tokens and ~3x faster TTFT, must be ≥ 1024 tokens (Haiku) or 2048 tokens (Sonnet/Opus).

Model answer: You mark a content block with "cache_control": {"type": "ephemeral"}. The API caches the K/V attention state for that prefix for approximately 5 minutes. Subsequent requests with the same prefix up to the cache marker are served from cache — at roughly 10% the cost of uncached tokens and with reduced time-to-first-token. The cache is per-API-key. Best practice: put your stable system prompt and any static context (tools, documents) at the top of the request and mark them for caching, then put the variable user message at the end. Minimum cacheable size is 1024 tokens (Haiku) or 2048 tokens (Sonnet/Opus). The response’s usage object shows cache_read_input_tokens vs cache_creation_input_tokens.

Follow-ups: How do you calculate the cost savings from prompt caching? What is the minimum token threshold for caching? Can you cache tool definitions?


Q28: How would you implement retry with rate limits?

Key points: Exponential backoff with jitter, read Retry-After header if present, separate retry for transient (429, 5xx) vs permanent (4xx) errors, max retries cap.

Model answer:

import time, random
def with_retry(fn, max_retries=5):
    for attempt in range(max_retries):
        try:
            return fn()
        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)  # jitter
            time.sleep(wait)
        except APIError as e:
            if e.status_code < 500:
                raise  # don't retry 4xx
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)

Key points: exponential backoff (2^attempt seconds) prevents thundering herd; jitter (random 0–1s) prevents synchronized retries from multiple clients; only retry transient errors (429, 5xx), not permanent ones (400, 401, 404); respect Retry-After header if present; cap at 5 retries.

Follow-ups: What is the “thundering herd” problem and how does jitter solve it? How do you implement a token bucket rate limiter client-side? How do you handle rate limits in a high-concurrency async system?


Q29: How do you track LLM costs per user?

Key points: Extract token counts from response usage field, multiply by model rates, aggregate by user_id in a database, add to every log line, alert on anomalies.

Model answer: Every Anthropic API response includes usage.input_tokens, usage.output_tokens, and usage.cache_read_input_tokens. After each call, compute cost = input_tokens × input_rate + output_tokens × output_rate - cache_read_tokens × (input_rate - cache_rate). Log this with user_id, feature_name, model, and timestamp. Aggregate in a database (or ClickHouse for scale). Alert when a user’s cost exceeds a budget or when aggregate cost spikes unexpectedly — the latter often indicates a bug (infinite loop, prompt injection causing long outputs). Expose per-user cost data to help users understand their consumption and to enforce quotas.

Follow-ups: How do you set per-user cost budgets? What does a cost spike usually indicate? How do you attribute costs to features vs users when one feature serves many users?


Q30: When would you use the Batch API?

Key points: 50% cost discount, async processing (up to 24h), best for offline eval runs, bulk document processing, nightly jobs — not for user-facing, latency-sensitive tasks.

Model answer: The Batch API is for workloads where you can tolerate up to 24 hours of latency in exchange for 50% cost reduction. Submit a JSONL file of requests, poll for completion, download results. Use cases: running your eval suite overnight, processing 10,000 documents for indexing, generating embeddings at scale, A/B testing prompts offline. Avoid for anything user-facing or time-sensitive. The workflow: create a batch → get a batch ID → poll every 30min until status = “completed” → download the output JSONL. You can also have partial results if some requests fail.

Follow-ups: How do you handle partial failures in a batch job? What is the cost difference in dollars for a typical 10,000-request eval run? How do you monitor a long-running batch job?


Q31: What’s the difference between TTFT and total response time?

Key points: TTFT = time to first token (critical for streaming UX), total = TTFT + generation time, TTFT dominated by prefill (input processing), generation dominated by number of output tokens.

Model answer: TTFT (time to first token) is the time from sending the request to receiving the first output token — this is what the user perceives as “responsiveness” in a streaming interface. Prompt caching directly reduces TTFT by skipping recomputation of cached prefixes. Total response time = TTFT + (output_tokens × time_per_token). For a 100-token output at 50 tokens/second, generation adds 2 seconds on top of TTFT. For UX, TTFT matters most — users tolerate slow streaming better than a long blank wait. For batch/offline jobs, total time matters. Monitor both P50 and P99 — P99 latency determines the worst case that some percentage of users experience.

Follow-ups: How does prompt length affect TTFT? How does output length affect total response time? What can you do architecturally to reduce P99 latency?


Q32: How would you deploy an LLM API to handle 10K req/day?

Key points: 10K/day ≈ 7 req/min average (very manageable), containerized service, async framework, connection pooling, prompt caching, horizontal scaling via load balancer.

Model answer: 10K req/day is roughly 7 requests/minute average — quite modest. A single async Python service (FastAPI + asyncio) can handle this easily. Architecture: FastAPI app in a Docker container on Cloud Run or Fargate (serverless, zero cost at idle), Anthropic SDK with async client and connection pool, prompt caching enabled for static prefixes (reduces cost ~60%), Redis for response caching on identical inputs, structured logging to CloudWatch/Datadog. For burst handling, add a queue (SQS) in front of workers so sudden spikes don’t cause timeouts. At this scale, cost optimization (prompt caching, right model tier) matters more than infrastructure complexity.

Follow-ups: How does the answer change at 1M req/day? How do you handle cold starts in a serverless deployment? When would you switch from Cloud Run to a persistent ECS/EKS deployment?


Q33: What would you put in structured logs for every LLM call?

Key points: Request (model, temp, prompt hash, user_id, feature), response (output_tokens, input_tokens, cache_tokens, latency, finish_reason), plus error info, trace_id for correlation.

Model answer:

{
  "timestamp": "2025-04-14T10:00:00Z",
  "trace_id": "abc-123",
  "user_id": "u-456",
  "feature": "rag_query",
  "model": "claude-haiku-4-5-20251001",
  "input_tokens": 1250,
  "output_tokens": 180,
  "cache_read_tokens": 900,
  "cost_usd": 0.00032,
  "ttft_ms": 210,
  "total_ms": 890,
  "finish_reason": "end_turn",
  "prompt_hash": "sha256:abc...",
  "error": null
}

Include trace_id for correlating multi-step agent calls. Never log raw prompt/response content in production logs (PII risk) — log the hash. Do log finish_reason (“max_tokens” indicates truncation, “stop_sequence” is normal, “tool_use” means the model wants to call a tool).

Follow-ups: How do you monitor for anomalous token usage patterns? What do you do when finish_reason is “max_tokens”? How do you correlate agent trace logs across multiple LLM calls?


Q34: How do you A/B test a prompt change safely?

Key points: Hash user_id to assign variant deterministically, run both variants simultaneously, collect outcome metrics (user rating, task completion), use statistical significance test before rolling out.

Model answer: Use deterministic assignment: variant = "B" if hash(user_id) % 100 < 10 else "A" — this assigns 10% of users to variant B consistently across sessions. Run for enough traffic to reach statistical significance (use a power calculation upfront). Collect business metrics: thumbs up/down, task completion rate, session length — not just technical metrics. Monitor for regressions: if variant B has higher cost or higher error rate, that’s also data. After reaching significance (typically p < 0.05 with a two-proportion z-test), roll out the winner. Log which variant served each request so you can analyze retroactively.

Follow-ups: How do you handle users who switch devices mid-experiment? How do you run multiple prompt experiments simultaneously without interaction effects? What is a “novelty effect” in A/B testing and how do you account for it?


Architecture & Design (6 questions)

Q35: Design a customer support bot. What components does it need?

Key points: Intent classification, RAG over knowledge base, escalation path, conversation memory, guardrails, human handoff, logging.

Model answer: Core components: (1) Intent classifier — route to FAQ/RAG, account lookup, or human escalation based on query type; (2) RAG system over product documentation, FAQ, and policy docs — most queries answered here; (3) Account tool — authenticated lookup of order status, billing (requires user verification first); (4) Conversation memory — last 10 turns + a user profile summary (name, account tier, recent issues); (5) Guardrails — don’t discuss competitors, don’t make promises outside policy, don’t discuss sensitive topics; (6) Escalation path — if confidence is low or user expresses frustration 2+ times, offer human agent; (7) Logging — full transcript per conversation for QA sampling and model improvement.

Follow-ups: How do you handle a user who is trying to extract sensitive information about other users? How do you measure customer support bot quality? How do you handle 10 languages?


Q36: How would you add AI features to an existing app without breaking it?

Key points: Feature flags, async sidecars, graceful degradation, shadow mode (log but don’t serve), incremental rollout, evals before launch.

Model answer: (1) Feature flags — wrap every AI feature so it can be disabled instantly without a deploy; (2) Async/non-blocking — don’t put LLM calls in the critical path if the feature is “nice to have”; (3) Graceful degradation — if the AI call fails or times out, fall back to the non-AI behavior silently; (4) Shadow mode — call the model and log its output but don’t show it to users yet; validate quality offline; (5) Incremental rollout — 1% → 10% → 50% → 100%, with monitoring at each step; (6) Build evals before launch — you need a baseline to detect if something goes wrong post-launch.

Follow-ups: How do you handle increased latency from LLM calls in a web app? What’s your rollback plan if the AI feature degrades user experience? How do you A/B test an AI feature vs. the non-AI baseline?


Q37: When would you use LangGraph vs building your own agent loop?

Key points: LangGraph for complex multi-step workflows needing persistence/checkpoints/human-in-the-loop; bare SDK for simple agents, when debugging matters, when LangGraph’s abstraction gets in the way.

Model answer: Build your own first. A while-loop with tool calls is 50 lines of Python and fully transparent. Use LangGraph when you need: (1) Persistence — the ability to pause an agent mid-execution and resume it later (checkpointers); (2) Human-in-the-loop — interrupt execution at specific nodes for approval; (3) Complex conditional flows — multiple branches based on agent state; (4) Cycles — the graph needs to loop back, not just proceed linearly. LangGraph’s overhead is real: harder to debug, tied to LangChain ecosystem, steeper learning curve. For production agents at non-trivial scale, LangGraph’s persistence and monitoring integrations are worth it. For prototypes and simple agents, bare SDK is better.

Follow-ups: What does a LangGraph checkpointer enable that a simple while-loop doesn’t? How do you debug a LangGraph agent that’s stuck? What is the performance overhead of LangGraph vs bare SDK?


Q38: How do you design a multi-agent system resilient to individual agent failures?

Key points: Timeout per agent, retry with backoff, fallback agents, partial results assembly, circuit breaker for persistently failing agents, orchestrator owns failure handling.

Model answer: The orchestrator must own all failure handling — subagents shouldn’t communicate failures to each other. Design: (1) Timeout — every subagent call has a deadline (e.g., 30s); (2) Retry — transient failures get 2 retries with backoff; (3) Fallback — if a specialist agent fails after retries, fall back to a generalist agent or a cached result; (4) Partial assembly — if 2 of 3 parallel research agents succeed, synthesize from the 2 and note the missing coverage rather than failing entirely; (5) Circuit breaker — if an agent fails 5 times in a row, mark it unavailable and route around it; (6) Idempotency — design subagent tasks so they can be safely retried without side effects.

Follow-ups: How do you handle a subagent that returns a result but the result is wrong? How do you implement circuit breaker in a multi-agent system? How do you monitor subagent failure rates in production?


Q39: What’s your approach to evaluation-driven development for LLM systems?

Key points: Build eval harness before shipping, establish baseline, test every change against baseline, block deploys on metric regression, treat evals as tests in CI.

Model answer: Before writing the first prompt, define how you’ll measure success. Build a golden dataset of 30–50 representative examples with expected outputs. Write an evaluation script that scores new outputs against expectations (LLM-as-judge, exact match, or task-specific metrics). Now you have a baseline. Every change — prompt edit, model upgrade, chunking tweak — runs through this eval. Treat a metric drop > 5% as a failing test that blocks the deploy. Over time, expand the golden dataset with edge cases and failure cases you discover in production. This mirrors TDD: the eval suite is your test suite; the LLM system is your code.

Follow-ups: How do you handle flaky evals due to LLM nondeterminism? How do you build evals for subjective tasks? What is the minimum eval size before you trust the metrics?


Q40: How would you migrate from GPT-4 to Claude without regressions?

Key points: Build evals before migration, run both in parallel (shadow mode), compare on golden dataset, check prompts for model-specific formatting, roll out incrementally.

Model answer: (1) Build a comprehensive eval suite first — this is the safety net. Without evals, you can’t detect regressions. (2) Audit prompts for model-specific assumptions — GPT-4 and Claude respond differently to certain patterns. Claude prefers XML tags for structure; GPT-4 prompts often use markdown differently. (3) Run shadow mode: for 1–2 weeks, call both models on every request, log both outputs, but only serve GPT-4. Compare outputs offline. (4) Fix prompt issues identified in shadow comparison. (5) Run the golden eval on Claude — compare to GPT-4 baseline. Target ≥ 95% of GPT-4 metric scores. (6) Incremental rollout: 5% → 25% → 50% → 100% with monitoring. (7) Keep GPT-4 in the code path for 2 weeks after 100% rollout in case you need to roll back.

Follow-ups: What prompt patterns work well with Claude but not GPT-4? How do you handle different tokenization in cost estimation? What is the biggest risk in a model migration?


Quick Reference: Key Numbers to Know

TopicNumberContext
Avg tokens per English word~1.33(750 words ≈ 1000 tokens)
Cache minimum (Haiku)1,024 tokensSmaller = not cached
Cache minimum (Sonnet/Opus)2,048 tokensSmaller = not cached
Cache TTL~5 minutesEphemeral cache
Cache cost savings~90%On cached tokens
Batch API discount50%vs real-time
Typical chunk size256–512 tokensStarting point for RAG
Chunk overlap10–20%e.g., 50 tokens for 512-chunk
RAGAS metrics4Faithfulness, relevancy, precision, recall
ReAct max steps (typical)10–20Tune per task
Output tokens cost vs input3–5xModel dependent