Chapter 10: AI Engineering Architecture and User Feedback
AI Engineering Architecture (Incremental Build)
Start simple and add components as needs arise:
Step 0 (baseline): query → model API → response (no context, no guardrails)
Step 1: Enhance Context
Add retrieval mechanisms: text/image/tabular retrieval, web search, tool use.
Context construction = feature engineering for foundation models.
Different providers have different limits (doc count, retrieval algorithms, chunk sizes, parallel function execution support).
Step 2: Put in Guardrails
Input guardrails:
- Leaking private info to external APIs: PII detection and masking/blocking
- Common PII: phone numbers, bank accounts, human faces, proprietary keywords
- Masking pattern: replace PII with placeholder → pass to model → unmask in response using reverse dictionary
- Defending against prompt attacks (see Chapter 5)
Output guardrails:
- Catch quality failures: malformatted JSON, hallucinations, bad responses
- Catch security failures: toxic content, PII in output, code execution risks, brand damage
- Policies: retry logic (sends query again — probabilistic model may produce different result), fallback to human operators (on sentiment detection, after N turns)
- Trade-off: adding guardrails increases latency; streaming mode makes output guardrails harder
Step 3: Add Model Router and Gateway
Router: intent classifier routes queries to appropriate model/solution:
- Cost optimization: route simple queries to cheaper models
- Quality optimization: route specialized queries to specialized models
- Scope enforcement: detect and reject out-of-scope queries; detect ambiguous queries and ask for clarification
- Context adjustment: route to models with appropriate context lengths
Routers are typically smaller models (GPT-2, BERT, Llama 7B or custom classifiers). Should be fast and cheap.
Gateway: unified interface to different models (OpenAI, Gemini, Anthropic, self-hosted):
- Single point of API key management → prevents key leakage
- Fine-grained access control (which user/app can use which model)
- Rate limiting and usage monitoring
- Fallback policies (API failure → retry or route to backup)
- Load balancing, logging, analytics
- Prompt caching and guardrail integration (some gateways)
Examples: Portkey, MLflow AI Gateway, Kong, Cloudflare.
Step 4: Reduce Latency with Caches
Exact caching: store (query, response) pairs; reuse exact matches.
- Useful for: repeated queries, chain-of-thought (expensive multi-step queries), vector search results
- Eviction policies: LRU, LFU, FIFO
- Cache leakage risk: user-specific context included in response must not be cached for other users
Semantic caching: reuse cached responses for semantically similar queries.
- “What’s the capital of Vietnam?” → “What’s the capital city of Vietnam?” → same answer
- Process: generate query embedding → vector search → if similarity > threshold, return cached response
- Risks: requires high-quality embeddings + reliable vector search + correct threshold
- More prone to failure than exact caching; evaluate before adopting
Step 5: Add Agent Patterns
Loop patterns: generated response feeds back into the system if task not complete.
Write actions: compose email, place order, initiate bank transfer — vastly more capable but vastly more risky.
Each added component = more capabilities + more failure modes. Balance capability with complexity.
Monitoring and Observability
Monitoring = tracking system information (metrics, logs).
Observability = instrumenting system so internal state can be inferred from outputs — when something breaks, you can find what broke without shipping new code.
Key DevOps metrics:
- MTTD (Mean Time to Detection)
- MTTR (Mean Time to Response)
- CFR (Change Failure Rate) — high CFR → redesign evaluation pipeline
Evaluation and monitoring must work together: metrics that do well in evaluation should do well in monitoring; issues found in monitoring should feed back to evaluation.
What to Monitor
Format failures (easiest): invalid JSON, missing expected keys.
Quality failures: factual consistency, relevance, conciseness, creativity — computed with AI judges.
Safety failures: toxicity, PII in outputs, abnormal prompt patterns (potential attacks).
User behavior signals:
- How often do users stop generation halfway?
- Average turns per conversation
- Average input/output tokens
- Output token distribution over time
Latency metrics: TTFT, TPOT, total latency — track per user, track p90/p95/p99.
Cost metrics: tokens/second, cost per request, cache hit rate.
Logs and Traces
Logs: append-only record of events; “log everything” — configs, prompts, outputs, tool calls, intermediate steps.
Traces: reconstructed timeline of a complete request through all system components; shows each step’s latency and cost.
Debugging process: metrics → spot anomaly → logs → identify what happened → correlate to fix.
Tools: LangSmith (trace visualization), Datadog, New Relic.
Drift Detection
Monitor for:
- System prompt changes: detect when templates or prompts are updated unexpectedly
- User behavior changes: users learn to phrase queries differently over time
- Underlying model changes: API providers may update models silently; different GPT versions show significant benchmark differences
AI Pipeline Orchestration
Orchestrator = specifies how components work together; ensures data flows between components in the correct format.
Two functions:
- Components definition: declare models, data sources, tools, evaluation monitors
- Chaining: compose functions (query preprocessing → retrieval → prompt construction → model → evaluation → return/route)
When to use: later stages of development; orchestrators add abstraction that can hide bugs.
Evaluation criteria:
- Integration: does it support your models, databases, tools?
- Extensibility: can you add unsupported components easily?
- Complex pipeline support: branching, parallel execution, error handling
- Performance: no hidden API calls; minimal latency overhead; scales with traffic
Examples: LangChain, LlamaIndex, Flowise, Langflow, Haystack.
User Feedback
User feedback is proprietary data → competitive advantage (data flywheel). The product that attracts users earliest and collects the most data wins.
Two types of feedback:
- Explicit: requested by app (thumbs up/down, star rating, yes/no)
- Implicit: inferred from user actions (purchases, clicks, edits, session duration)
Natural Language Feedback
Extracted from content of conversational messages:
Early termination: user stops generation → conversation likely not going well.
Error correction: “No, I meant…” / rephrasing attempts → model misunderstood intent.
- User edits: original response = losing response; edited = winning response → preference data for RLHF.
Complaints: “You’re wrong”, “Too cliche”, “Too short” → identify specific failure modes.
Sentiment: frustration/disappointment in messages → model not meeting user needs.
Model refusal rate: model says “Sorry, I don’t know” or “As a language model…” → user likely unhappy.
Other Conversational Feedback
- Regeneration: user generates another response → first may be unsatisfactory (or curiosity)
- Comparison after regeneration: explicit choice between old and new → strong preference signal
- Conversation organization: delete (strong negative), rename (positive), share, bookmark
- Conversation length: long for AI companion = positive; long for customer support = negative (inefficiency)
- Dialogue diversity: long + repetitive → user stuck in loop
Feedback Design
When to collect:
- During onboarding (calibrate preferences; optional to avoid friction)
- When model fails (allow downvote, regenerate, transfer to human)
- When model is uncertain (show two options side-by-side for comparative feedback)
- After remarkable successes (optional; helps identify high-impact features)
How to collect:
- Seamlessly integrate into user workflow (Midjourney: select/vary/regenerate; GitHub Copilot: Tab to accept)
- Make feedback easy to ignore (non-intrusive)
- Never ask users to evaluate what they can’t understand (math questions shouldn’t be preference-based)
- Use context alongside feedback (preceding 5-10 turns) for debugging — requires user consent
- Explain how feedback is used → motivates higher-quality feedback
- Show incentives (personalization) and privacy guarantees (data won’t leave device)
- Private vs. public signals: X found likes increased after making them private (more candid)
Feedback Limitations
Biases:
- Leniency bias: users rate positively to avoid conflict or extra work; “5 stars unless angry” pattern
- Fix: replace numerical scales with semantic descriptions
- Randomness: users click randomly on long side-by-side comparisons they don’t read
- Position bias: users favor first suggestion → randomize positions to mitigate
- Preference biases: length preference (longer = better), recency bias (last answer preferred)
Degenerate feedback loops:
- Model shows popular content → popular content gets clicks → model shows more of it → amplifies initial bias
- “Sycophancy” (Sharma et al., 2023): models trained on human feedback learn to tell users what they want to hear rather than what’s accurate
- Solution: analyze feedback distribution, monitor for drift, don’t incorporate feedback indiscriminately
Key Takeaways
- Build AI architecture incrementally; each step adds capability and failure modes
- Context construction (Step 1) is almost always the first enhancement that matters
- Guardrails (Step 2) protect against both input and output risks; balance security with false refusal rate
- Model gateway provides a unified, secure, observable interface to all models
- Caching (especially exact caching) delivers large latency and cost reductions for repeated queries
- Build observability in from the start; log everything; design metrics around failure modes
- User feedback is proprietary data that feeds the data flywheel; design for quality collection
- Conversational feedback (natural language errors, edits, early termination) is abundant but noisy
- Feedback biases (leniency, position, length) and degenerate feedback loops are real risks; monitor distributions
- AI engineering is moving closer to product engineering — the data flywheel and user experience are the primary competitive advantages