Chapter 10: AI Engineering Architecture and User Feedback

AI Engineering Architecture (Incremental Build)

Start simple and add components as needs arise:

Step 0 (baseline): query → model API → response (no context, no guardrails)

Step 1: Enhance Context
Add retrieval mechanisms: text/image/tabular retrieval, web search, tool use.
Context construction = feature engineering for foundation models.

Different providers have different limits (doc count, retrieval algorithms, chunk sizes, parallel function execution support).

Step 2: Put in Guardrails

Input guardrails:

  • Leaking private info to external APIs: PII detection and masking/blocking
    • Common PII: phone numbers, bank accounts, human faces, proprietary keywords
    • Masking pattern: replace PII with placeholder → pass to model → unmask in response using reverse dictionary
  • Defending against prompt attacks (see Chapter 5)

Output guardrails:

  • Catch quality failures: malformatted JSON, hallucinations, bad responses
  • Catch security failures: toxic content, PII in output, code execution risks, brand damage
  • Policies: retry logic (sends query again — probabilistic model may produce different result), fallback to human operators (on sentiment detection, after N turns)
  • Trade-off: adding guardrails increases latency; streaming mode makes output guardrails harder

Step 3: Add Model Router and Gateway

Router: intent classifier routes queries to appropriate model/solution:

  • Cost optimization: route simple queries to cheaper models
  • Quality optimization: route specialized queries to specialized models
  • Scope enforcement: detect and reject out-of-scope queries; detect ambiguous queries and ask for clarification
  • Context adjustment: route to models with appropriate context lengths

Routers are typically smaller models (GPT-2, BERT, Llama 7B or custom classifiers). Should be fast and cheap.

Gateway: unified interface to different models (OpenAI, Gemini, Anthropic, self-hosted):

  • Single point of API key management → prevents key leakage
  • Fine-grained access control (which user/app can use which model)
  • Rate limiting and usage monitoring
  • Fallback policies (API failure → retry or route to backup)
  • Load balancing, logging, analytics
  • Prompt caching and guardrail integration (some gateways)

Examples: Portkey, MLflow AI Gateway, Kong, Cloudflare.

Step 4: Reduce Latency with Caches

Exact caching: store (query, response) pairs; reuse exact matches.

  • Useful for: repeated queries, chain-of-thought (expensive multi-step queries), vector search results
  • Eviction policies: LRU, LFU, FIFO
  • Cache leakage risk: user-specific context included in response must not be cached for other users

Semantic caching: reuse cached responses for semantically similar queries.

  • “What’s the capital of Vietnam?” → “What’s the capital city of Vietnam?” → same answer
  • Process: generate query embedding → vector search → if similarity > threshold, return cached response
  • Risks: requires high-quality embeddings + reliable vector search + correct threshold
  • More prone to failure than exact caching; evaluate before adopting

Step 5: Add Agent Patterns

Loop patterns: generated response feeds back into the system if task not complete.
Write actions: compose email, place order, initiate bank transfer — vastly more capable but vastly more risky.

Each added component = more capabilities + more failure modes. Balance capability with complexity.


Monitoring and Observability

Monitoring = tracking system information (metrics, logs).
Observability = instrumenting system so internal state can be inferred from outputs — when something breaks, you can find what broke without shipping new code.

Key DevOps metrics:

  • MTTD (Mean Time to Detection)
  • MTTR (Mean Time to Response)
  • CFR (Change Failure Rate) — high CFR → redesign evaluation pipeline

Evaluation and monitoring must work together: metrics that do well in evaluation should do well in monitoring; issues found in monitoring should feed back to evaluation.

What to Monitor

Format failures (easiest): invalid JSON, missing expected keys.

Quality failures: factual consistency, relevance, conciseness, creativity — computed with AI judges.

Safety failures: toxicity, PII in outputs, abnormal prompt patterns (potential attacks).

User behavior signals:

  • How often do users stop generation halfway?
  • Average turns per conversation
  • Average input/output tokens
  • Output token distribution over time

Latency metrics: TTFT, TPOT, total latency — track per user, track p90/p95/p99.

Cost metrics: tokens/second, cost per request, cache hit rate.

Logs and Traces

Logs: append-only record of events; “log everything” — configs, prompts, outputs, tool calls, intermediate steps.

Traces: reconstructed timeline of a complete request through all system components; shows each step’s latency and cost.

Debugging process: metrics → spot anomaly → logs → identify what happened → correlate to fix.

Tools: LangSmith (trace visualization), Datadog, New Relic.

Drift Detection

Monitor for:

  • System prompt changes: detect when templates or prompts are updated unexpectedly
  • User behavior changes: users learn to phrase queries differently over time
  • Underlying model changes: API providers may update models silently; different GPT versions show significant benchmark differences

AI Pipeline Orchestration

Orchestrator = specifies how components work together; ensures data flows between components in the correct format.

Two functions:

  1. Components definition: declare models, data sources, tools, evaluation monitors
  2. Chaining: compose functions (query preprocessing → retrieval → prompt construction → model → evaluation → return/route)

When to use: later stages of development; orchestrators add abstraction that can hide bugs.

Evaluation criteria:

  • Integration: does it support your models, databases, tools?
  • Extensibility: can you add unsupported components easily?
  • Complex pipeline support: branching, parallel execution, error handling
  • Performance: no hidden API calls; minimal latency overhead; scales with traffic

Examples: LangChain, LlamaIndex, Flowise, Langflow, Haystack.


User Feedback

User feedback is proprietary data → competitive advantage (data flywheel). The product that attracts users earliest and collects the most data wins.

Two types of feedback:

  • Explicit: requested by app (thumbs up/down, star rating, yes/no)
  • Implicit: inferred from user actions (purchases, clicks, edits, session duration)

Natural Language Feedback

Extracted from content of conversational messages:

Early termination: user stops generation → conversation likely not going well.

Error correction: “No, I meant…” / rephrasing attempts → model misunderstood intent.

  • User edits: original response = losing response; edited = winning response → preference data for RLHF.

Complaints: “You’re wrong”, “Too cliche”, “Too short” → identify specific failure modes.

Sentiment: frustration/disappointment in messages → model not meeting user needs.

Model refusal rate: model says “Sorry, I don’t know” or “As a language model…” → user likely unhappy.

Other Conversational Feedback

  • Regeneration: user generates another response → first may be unsatisfactory (or curiosity)
  • Comparison after regeneration: explicit choice between old and new → strong preference signal
  • Conversation organization: delete (strong negative), rename (positive), share, bookmark
  • Conversation length: long for AI companion = positive; long for customer support = negative (inefficiency)
  • Dialogue diversity: long + repetitive → user stuck in loop

Feedback Design

When to collect:

  • During onboarding (calibrate preferences; optional to avoid friction)
  • When model fails (allow downvote, regenerate, transfer to human)
  • When model is uncertain (show two options side-by-side for comparative feedback)
  • After remarkable successes (optional; helps identify high-impact features)

How to collect:

  • Seamlessly integrate into user workflow (Midjourney: select/vary/regenerate; GitHub Copilot: Tab to accept)
  • Make feedback easy to ignore (non-intrusive)
  • Never ask users to evaluate what they can’t understand (math questions shouldn’t be preference-based)
  • Use context alongside feedback (preceding 5-10 turns) for debugging — requires user consent
  • Explain how feedback is used → motivates higher-quality feedback
  • Show incentives (personalization) and privacy guarantees (data won’t leave device)
  • Private vs. public signals: X found likes increased after making them private (more candid)

Feedback Limitations

Biases:

  • Leniency bias: users rate positively to avoid conflict or extra work; “5 stars unless angry” pattern
    • Fix: replace numerical scales with semantic descriptions
  • Randomness: users click randomly on long side-by-side comparisons they don’t read
  • Position bias: users favor first suggestion → randomize positions to mitigate
  • Preference biases: length preference (longer = better), recency bias (last answer preferred)

Degenerate feedback loops:

  • Model shows popular content → popular content gets clicks → model shows more of it → amplifies initial bias
  • “Sycophancy” (Sharma et al., 2023): models trained on human feedback learn to tell users what they want to hear rather than what’s accurate
  • Solution: analyze feedback distribution, monitor for drift, don’t incorporate feedback indiscriminately

Key Takeaways

  • Build AI architecture incrementally; each step adds capability and failure modes
  • Context construction (Step 1) is almost always the first enhancement that matters
  • Guardrails (Step 2) protect against both input and output risks; balance security with false refusal rate
  • Model gateway provides a unified, secure, observable interface to all models
  • Caching (especially exact caching) delivers large latency and cost reductions for repeated queries
  • Build observability in from the start; log everything; design metrics around failure modes
  • User feedback is proprietary data that feeds the data flywheel; design for quality collection
  • Conversational feedback (natural language errors, edits, early termination) is abundant but noisy
  • Feedback biases (leniency, position, length) and degenerate feedback loops are real risks; monitor distributions
  • AI engineering is moving closer to product engineering — the data flywheel and user experience are the primary competitive advantages