Module 12: AI Engineering Interview Prep

This module prepares you for the full spectrum of AI engineering interviews — from LLM fundamentals to production system design. The material targets mid-to-senior software engineering roles where you’ll be expected to build, deploy, and reason deeply about AI systems.


1. How AI Engineering Interviews Work

Types of Interviews

Technical Coding
Usually a 45–60 minute session where you implement something hands-on: a chunking function, a retry wrapper around an LLM API, a simple RAG pipeline, or a tool-calling agent loop. The bar is not “does it work” but “does it work the way someone who understands LLMs would write it.” Expect to handle edge cases like empty responses, malformed JSON from a model, or rate limit errors.

System Design
Typically 45–60 minutes. You’ll be given a vague product prompt (“design a customer support bot”) and expected to drive a structured conversation: ask clarifying questions, define requirements, sketch architecture, walk through components, discuss trade-offs, and call out failure modes. These interviews heavily reward candidates who proactively bring up cost, latency, and evaluation — most candidates forget all three.

ML/LLM Fundamentals
Conceptual questions testing whether you actually understand how these systems work, not just how to call an API. Expect questions on transformer architecture, tokenization, fine-tuning vs RAG, context windows, and inference optimization. At senior levels, expect questions on training dynamics, RLHF, and evaluation methodology.

LLM-Specific / Applied
A newer category that blends coding and conceptual. Examples: “Show me how you’d write a prompt that reliably extracts structured JSON from a user message,” or “Walk me through how you’d debug a RAG pipeline that’s returning irrelevant results.” These reward direct hands-on experience.

Behavioral
Standard STAR-format questions but AI-specific: failures you’ve had shipping AI systems, how you handle non-deterministic test failures, how you’ve measured model quality. See Section 5 for model answers.


What Interviewers Are Actually Evaluating

Depth of understanding over surface familiarity
Can you explain why RAG works, not just that it works? Can you explain what happens at inference time when the KV cache is warm vs cold? Interviewers probe one level deeper than your initial answer to see if you have genuine understanding.

Trade-off reasoning
Every design decision has a cost. Interviewers want to hear: “I’d use semantic chunking over fixed-size chunking because X, but the trade-off is Y.” Candidates who only describe the upside of every choice signal inexperience.

Practical experience
“Have you actually shipped this?” shows up as specificity. Candidates with real experience say things like “in my experience, reranking adds about 200–400ms of latency but can cut irrelevant retrieval by 30%.” Candidates without it give generic answers that could come from a blog post.

Failure awareness
Strong candidates proactively call out what can go wrong: prompt injection, context length overflow, model hallucination on edge cases, retrieval returning stale content. If you only describe the happy path, that is a signal you haven’t operated these systems in production.

Communication and structure
AI systems are genuinely complex. Interviewers test whether you can explain them to a non-expert (product manager, eng manager) and to an expert (another senior engineer). Practice both registers.


Red Flags Interviewers Look For

  • Buzzword soup without depth: “I’d use RAG with semantic search and a vector database and chain-of-thought and an agent loop.” Cool — now explain what each of those does at the component level.
  • “Just use LangChain/LlamaIndex for that”: Framework names are not answers. If you can’t explain what LangChain is doing under the hood, you don’t understand the problem. Say “I’d use LangChain’s retriever abstraction because it handles the embedding → similarity search → result formatting pipeline, and here’s what that pipeline actually does…”
  • Not knowing failure modes: If an interviewer asks “what could go wrong with this design?” and you can only name one thing, that’s a flag.
  • Treating LLMs as magic boxes: “The model figures it out” is not an acceptable answer for any detailed question about LLM behavior.
  • Ignoring cost and latency entirely: Production systems have budgets. If you never mention cost or latency in a system design, you’re designing in a vacuum.
  • Vague evaluation answers: “I’d evaluate it by seeing if the outputs look good” is a failing answer. Name specific metrics, evaluation datasets, and human review strategies.

2. Top 10 Concepts You Must Know Cold

1. Transformer Attention Mechanism

The transformer’s attention mechanism allows every token in a sequence to attend to every other token by computing a weighted combination of value vectors, where the weights are determined by the similarity between query and key vectors. The core formula is Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V. Multi-head attention runs this in parallel across multiple learned subspaces, allowing the model to attend to different aspects of the input simultaneously. Self-attention is what gives transformers their ability to capture long-range dependencies that RNNs struggle with. The sqrt(d_k) scaling prevents the dot products from growing large and causing vanishing gradients in the softmax.

2. Retrieval-Augmented Generation (RAG)

RAG is an architecture that grounds LLM responses in retrieved external knowledge by: embedding the user query, finding semantically similar document chunks in a vector store, inserting those chunks into the prompt as context, and generating a response conditioned on that context. It solves the core problem that LLMs have static knowledge frozen at training time and hallucinate when asked about facts they weren’t trained on. The retrieval step is a separate IR (information retrieval) problem that must be evaluated independently from the generation step. Production RAG systems must handle chunking strategy, embedding freshness, retrieval quality, and context window limits.

3. Agents and Tool Use

An LLM agent is a system where the model acts as a reasoning engine that iteratively decides which tools to call, observes the results, and continues reasoning until it can answer the original task. The model doesn’t execute tools — it outputs a structured description of which tool to call with which arguments, and the surrounding code executes that and feeds the result back. This loop continues until the model produces a final answer. Key design concerns: tool descriptions must be precise (the model reads them as instructions), error handling must prevent infinite loops, and latency multiplies with each tool call round-trip.

4. Prompt Engineering and Prompt Caching

A prompt is not just the user’s question — it’s a carefully engineered combination of a system prompt (role, rules, output format), few-shot examples, retrieved context, and the user turn. Prompt engineering is the practice of iterating on this combination to produce reliable, high-quality outputs. Prompt caching (available in Anthropic’s API) allows the prefix of a prompt (typically the system prompt and static context) to be cached at the KV level so that repeated calls with the same prefix don’t recompute the expensive attention over that prefix — reducing cost by up to 90% and latency by up to 85% for the cached portion.

5. Fine-tuning vs RAG

Fine-tuning adapts model weights via gradient descent on a task-specific dataset — it’s the right choice when you need the model to change its style, format, or behavior (e.g., always respond in a specific persona, follow a specific JSON schema reliably, or reason in a domain-specific way). RAG is the right choice when you need the model to have access to specific, updateable, or private knowledge. Fine-tuning is expensive, requires data collection, and the knowledge it encodes is frozen at training time — you can’t update it without retraining. RAG is cheaper to update (just re-embed new documents) but adds retrieval latency and retrieval can fail. In practice, most production systems should start with RAG.

6. Evaluation

Evaluation is the hardest unsolved problem in LLM engineering. You need: (1) a golden dataset of inputs with expected outputs or expected attributes, (2) metrics that actually correlate with what you care about (relevance, faithfulness, groundedness, factuality), (3) a way to run evals continuously so you catch regressions. For RAG: evaluate retrieval (recall@k, MRR) separately from generation (faithfulness, answer relevance). For agents: evaluate final task completion rate, not just individual tool call accuracy. LLM-as-judge is a common pattern but it’s noisy — always calibrate judge agreement against human labels.

7. Context Windows and “Lost in the Middle”

The context window is the maximum number of tokens a model can process in a single forward pass — this includes system prompt, conversation history, retrieved context, and the model’s response. Long context doesn’t mean perfect recall — models tend to have stronger recall for information at the beginning and end of the context than in the middle (the “lost in the middle” phenomenon). For production systems, this means: put the most important instructions at the start of the system prompt, put the most relevant retrieved chunks first in context, and don’t assume a 200K context window eliminates the need for good retrieval.

8. Streaming

Streaming returns tokens to the client as they are generated rather than waiting for the full response. This dramatically improves perceived latency — TTFT (time to first token) is typically under a second even for responses that take 10–20 seconds to complete. Implementation requires SSE (server-sent events) or chunked HTTP responses. Applications must handle stream interruption, partial responses, and the fact that structured data (like JSON) isn’t parseable until the stream closes. Always stream in production user-facing applications. The Anthropic SDK provides a streaming API that handles connection management.

9. Multi-Agent Systems

Multi-agent systems distribute work across multiple specialized LLM instances that communicate via structured messages. Common patterns: orchestrator/subagent (one model plans and delegates), pipeline (each agent transforms the output of the previous), and parallel fan-out (multiple agents work simultaneously on different subtasks). Key design challenges: shared state management, failure propagation (one agent’s bad output corrupts downstream agents), cost multiplication (N agents × M turns = N×M LLM calls), and evaluation (harder because failures can be latent). The primary reason to use multi-agent over single-agent is specialization, parallelism, or context window limits.

10. KV Cache and Inference Latency

During transformer inference, the key-value pairs for all previous tokens are cached in GPU memory so that each new token only needs to compute attention over its own query against the already-cached keys/values from the context. Without the KV cache, generation would require O(n^2) attention for a sequence of n tokens — it would become exponentially slower as responses grew. With the cache, each new token generation is O(n). The practical implication: the first token is slow (prefill phase, the full prompt is processed), subsequent tokens are fast (decode phase, one token at a time with cached KV). Latency optimization focuses on reducing prefill time (prompt caching, smaller prompts) and increasing decode throughput (batching, quantization).


3. Framework for Answering System Design Questions

Use this structure for every system design question. Don’t skip steps even under time pressure — tell the interviewer which step you’re on.

Step 1: Clarify Requirements (2–3 minutes)

Always ask before designing. The prompt is always underspecified. Questions to ask:

  • What are the input/output types? (text, documents, structured data, multimodal?)
  • What’s the scale? (users, requests per day, document count)
  • What are the latency requirements? (interactive chat vs. async batch)
  • What’s the accuracy/quality bar? (95% accuracy acceptable? 99%? What happens on failure?)
  • What’s the cost budget? (startup with tight margins vs. enterprise with cost flexibility)
  • Is this greenfield or integrating into an existing system?
  • What data exists for training/evaluation?

State your assumptions explicitly: “I’ll assume this is a B2B SaaS product with 500 internal users, ~1000 queries/day, sub-5-second latency requirement, and an existing document management system.”

Step 2: High-Level Design (5 minutes)

Sketch the system as boxes and arrows. Name the major components without detailing them yet. The goal is to align with the interviewer on scope before diving deep.

Step 3: Component Deep Dive (15–20 minutes)

Walk through each component in detail. For each component explain:

  • What it does
  • What technology/approach you’d use and why
  • What the interface looks like (input/output)
  • What could go wrong

Step 4: Trade-offs (5 minutes)

For every significant design decision, state the alternative and explain why you chose what you chose. “I’m using semantic chunking over fixed-size chunking because it preserves logical units, but the trade-off is slower ingestion and more complex implementation.”

Step 5: Failure Modes (5 minutes)

Proactively cover: what fails, how it fails, and how you’d detect and recover. At minimum cover:

  • Data quality failures
  • Model failures (hallucination, refusals, degradation)
  • Infrastructure failures (vector DB downtime, LLM API outage)
  • Security failures (prompt injection, data leakage)

Step 6: Scaling (3–5 minutes)

How does the system change as you go from 1K → 10K → 100K → 1M queries/day? What breaks first? What would you add?

Always mention:

  • Cost: rough estimate per query, monthly total at stated scale
  • Latency: P50 and P99, where bottlenecks are
  • Evaluation: how you’d measure whether the system is working
  • Monitoring: what you’d alert on in production

4. Common Interview Mistakes

Using Frameworks as a Crutch

Saying “I’d use LangChain for the retrieval pipeline” without explaining what LangChain is doing is the most common mistake. Interviewers hear this constantly and it signals you’ve read docs but haven’t built anything. Always explain the underlying mechanism: “I’d use LangChain’s retrieval chain — which under the hood embeds the query, does an ANN search against the vector store, retrieves the top-k chunks, formats them into a prompt template, and calls the LLM. I’m choosing LangChain here because the abstraction saves implementation time, but I’d watch out for its default chunking strategy which may not be optimal for our document types.”

Ignoring Costs

LLM APIs are not free. At Claude 3.5 Sonnet pricing, 1 million input tokens costs 6,000/month. Prompt caching can cut this to under $1,000/month. Interviewers notice when candidates never mention cost. Always include a rough cost estimate and strategies to reduce it.

Not Knowing Failure Modes

“The system will retrieve relevant documents and generate a good answer” is not a complete design. What happens when:

  • The query is out of domain and no relevant documents exist?
  • The LLM hallucinates a fact not in the retrieved context?
  • The vector DB returns the same document 5 times because similar documents are near-duplicates?
  • A user crafts a prompt injection attack in the document content?
    Always enumerate failure modes and their mitigations.

Vague Evaluation Answers

“I’d evaluate it manually and see if the outputs look good” fails. A strong answer names: the dataset (golden Q&A pairs curated by domain experts), the metrics (retrieval recall@5, faithfulness score against source documents, answer relevance via LLM-as-judge calibrated against human labels), the evaluation frequency (every PR that changes the prompt or retrieval logic), and the threshold for shipping (>85% faithfulness on the golden set, no regression on the previous version).

Forgetting About Security

For any customer-facing LLM system, prompt injection is a real attack vector. A malicious user can embed instructions in a document or input that override your system prompt. Mitigations include: input sanitization, using structured output formats (harder to hijack than free text), keeping tool permissions minimal (principle of least privilege), and output filtering. If you design a customer support bot without mentioning prompt injection, that’s a flag.

Over-Engineering the First Version

Candidates sometimes design a 15-component system when the interviewer wanted a pragmatic MVP. Read the room. If they ask “how would you get this working in 4 weeks,” your answer should be simple and shipping-focused. If they ask “design this for scale,” go deep. Ask clarifying questions about timeline and constraints before designing.


5. Behavioral Questions for AI Roles

”Tell me about an AI system you built and what went wrong.”

What they’re testing: Real production experience, intellectual honesty, ability to learn from failure, failure mode awareness.

Model answer structure:

  1. Brief description of the system and its goal
  2. What specifically went wrong (be specific, not vague)
  3. How you detected the problem
  4. What you did to fix it
  5. What you’d do differently

Example: “I built a RAG-based internal knowledge base Q&A system. The first version looked great in development — answers were accurate on the test queries I’d written. In production, we found that about 20% of queries returned irrelevant results because users were asking questions in natural language that didn’t overlap well with how the documents were written. The retrieval was failing silently — the LLM would get irrelevant chunks and either hallucinate a confident answer or give a vague non-answer. I detected it through user feedback, then added logging of the top-3 retrieved chunks for every query so I could audit what the model was actually seeing. I fixed it by adding HyDE (hypothetical document embedding) for retrieval, adding a reranker, and building a small golden evaluation set I could run on every prompt change. I’d do it differently by building the evaluation pipeline on day one, not week six."


"How do you decide when to use LLMs vs traditional ML?”

What they’re testing: Judgment, cost-consciousness, avoiding LLM-for-everything thinking, understanding of when LLMs add genuine value.

Model answer:
“I use a simple decision tree. First, is the task fundamentally about language understanding, generation, or reasoning over unstructured text? If yes, LLMs are worth considering. If the task is classification, regression, or prediction on structured/tabular data, traditional ML is almost always better (cheaper, faster, more interpretable, easier to evaluate). Second, do I need generalization to unseen inputs with no training data? LLMs are strong here. Third, what are the latency and cost constraints? An LLM call is 100–500ms and costs fractions of a cent — fine for interactive use, but at 10M calls/day the cost adds up and a distilled model or traditional classifier is worth engineering. Fourth, what are the failure mode consequences? For high-stakes decisions (fraud, medical), I want interpretable models I can audit, not LLMs with opaque reasoning. In practice, I often use LLMs to prototype a solution fast, measure quality, and then decide whether to optimize with traditional ML or keep the LLM for production."


"How do you stay current with AI developments?”

What they’re testing: Genuine intellectual engagement with the field, signal vs. noise filtering, breadth and depth balance.

Model answer:
“I have a tiered system. For primary sources, I follow Anthropic’s documentation updates and model release notes directly — release notes often contain specific behavioral changes that affect production systems. I skim arXiv daily using a filter on cs.AI, cs.LG, and cs.CL — I don’t read every paper, but I read abstracts and deep-dive on maybe 2–3 papers per week that are directly relevant to what I’m building. For applied engineering, I follow the Anthropic cookbook on GitHub because it has concrete implementation patterns. I run LLM Arena occasionally to calibrate my intuitions about relative model capabilities. I’m skeptical of most AI Twitter and most newsletters — the signal-to-noise ratio is low and there’s enormous hype amplification. I prioritize papers with code over pure theory, and I try to implement something from every paper I read seriously — even a toy version — because the act of implementation surfaces nuances that reading doesn’t."


"Describe a time you disagreed with your team about an AI approach.”

What they’re testing: Communication, technical conviction backed by evidence, ability to build consensus, knowing when to defer.

Model answer structure:

  1. What the disagreement was specifically (be precise)
  2. What your position was and why (evidence-based)
  3. How you handled the disagreement (ran an experiment, presented data, escalated appropriately)
  4. What the outcome was
  5. What you learned

Example: “My team wanted to fine-tune a model to improve our customer support classification accuracy. My position was that fine-tuning was premature — we hadn’t invested in prompt engineering or few-shot examples yet, and I suspected we could close the gap without the expense and maintenance burden of fine-tuning. Rather than argue, I proposed we spend one sprint running a structured prompt engineering experiment with a test set of 500 classified examples. We went from 78% to 91% accuracy with prompt changes alone, which was close enough to the fine-tuning target (94%) that the team agreed the ROI on fine-tuning wasn’t there yet. What I learned: technical disagreements are best resolved with a time-boxed experiment rather than a debate.”