Chapter 10: AI Engineering Architecture and User Feedback

AI Engineering Architecture (Incremental Build)

Start simple and add components as needs arise:

Step 0 (baseline): query → model API → response (no context, no guardrails)

Step 1: Enhance Context
Add retrieval mechanisms: text/image/tabular retrieval, web search, tool use.
Context construction = feature engineering for foundation models.

Different providers have different limits (doc count, retrieval algorithms, chunk sizes, parallel function execution support).

Step 2: Put in Guardrails

Input guardrails:

Leaking private info to external APIs: PII detection and masking/blocking
- Common PII: phone numbers, bank accounts, human faces, proprietary keywords
- Masking pattern: replace PII with placeholder → pass to model → unmask in response using reverse dictionary
Defending against prompt attacks (see Chapter 5)

Output guardrails:

Catch quality failures: malformatted JSON, hallucinations, bad responses
Catch security failures: toxic content, PII in output, code execution risks, brand damage
Policies: retry logic (sends query again — probabilistic model may produce different result), fallback to human operators (on sentiment detection, after N turns)
Trade-off: adding guardrails increases latency; streaming mode makes output guardrails harder

Step 3: Add Model Router and Gateway

Router: intent classifier routes queries to appropriate model/solution:

Cost optimization: route simple queries to cheaper models
Quality optimization: route specialized queries to specialized models
Scope enforcement: detect and reject out-of-scope queries; detect ambiguous queries and ask for clarification
Context adjustment: route to models with appropriate context lengths

Routers are typically smaller models (GPT-2, BERT, Llama 7B or custom classifiers). Should be fast and cheap.

Gateway: unified interface to different models (OpenAI, Gemini, Anthropic, self-hosted):

Single point of API key management → prevents key leakage
Fine-grained access control (which user/app can use which model)
Rate limiting and usage monitoring
Fallback policies (API failure → retry or route to backup)
Load balancing, logging, analytics
Prompt caching and guardrail integration (some gateways)

Examples: Portkey, MLflow AI Gateway, Kong, Cloudflare.

Step 4: Reduce Latency with Caches

Exact caching: store (query, response) pairs; reuse exact matches.

Useful for: repeated queries, chain-of-thought (expensive multi-step queries), vector search results
Eviction policies: LRU, LFU, FIFO
Cache leakage risk: user-specific context included in response must not be cached for other users

Semantic caching: reuse cached responses for semantically similar queries.

“What’s the capital of Vietnam?” → “What’s the capital city of Vietnam?” → same answer
Process: generate query embedding → vector search → if similarity > threshold, return cached response
Risks: requires high-quality embeddings + reliable vector search + correct threshold
More prone to failure than exact caching; evaluate before adopting

Step 5: Add Agent Patterns

Loop patterns: generated response feeds back into the system if task not complete.
Write actions: compose email, place order, initiate bank transfer — vastly more capable but vastly more risky.

Each added component = more capabilities + more failure modes. Balance capability with complexity.

Monitoring and Observability

Monitoring = tracking system information (metrics, logs).
Observability = instrumenting system so internal state can be inferred from outputs — when something breaks, you can find what broke without shipping new code.

Key DevOps metrics:

MTTD (Mean Time to Detection)
MTTR (Mean Time to Response)
CFR (Change Failure Rate) — high CFR → redesign evaluation pipeline

Evaluation and monitoring must work together: metrics that do well in evaluation should do well in monitoring; issues found in monitoring should feed back to evaluation.

What to Monitor

Format failures (easiest): invalid JSON, missing expected keys.

Quality failures: factual consistency, relevance, conciseness, creativity — computed with AI judges.

Safety failures: toxicity, PII in outputs, abnormal prompt patterns (potential attacks).

User behavior signals:

How often do users stop generation halfway?
Average turns per conversation
Average input/output tokens
Output token distribution over time

Latency metrics: TTFT, TPOT, total latency — track per user, track p90/p95/p99.

Cost metrics: tokens/second, cost per request, cache hit rate.

Logs and Traces

Logs: append-only record of events; “log everything” — configs, prompts, outputs, tool calls, intermediate steps.

Traces: reconstructed timeline of a complete request through all system components; shows each step’s latency and cost.

Debugging process: metrics → spot anomaly → logs → identify what happened → correlate to fix.

Tools: LangSmith (trace visualization), Datadog, New Relic.

Drift Detection

Monitor for:

System prompt changes: detect when templates or prompts are updated unexpectedly
User behavior changes: users learn to phrase queries differently over time
Underlying model changes: API providers may update models silently; different GPT versions show significant benchmark differences

AI Pipeline Orchestration

Orchestrator = specifies how components work together; ensures data flows between components in the correct format.

Two functions:

Components definition: declare models, data sources, tools, evaluation monitors
Chaining: compose functions (query preprocessing → retrieval → prompt construction → model → evaluation → return/route)

When to use: later stages of development; orchestrators add abstraction that can hide bugs.

Evaluation criteria:

Integration: does it support your models, databases, tools?
Extensibility: can you add unsupported components easily?
Complex pipeline support: branching, parallel execution, error handling
Performance: no hidden API calls; minimal latency overhead; scales with traffic

Examples: LangChain, LlamaIndex, Flowise, Langflow, Haystack.

User Feedback

User feedback is proprietary data → competitive advantage (data flywheel). The product that attracts users earliest and collects the most data wins.

Two types of feedback:

Explicit: requested by app (thumbs up/down, star rating, yes/no)
Implicit: inferred from user actions (purchases, clicks, edits, session duration)

Natural Language Feedback

Extracted from content of conversational messages:

Early termination: user stops generation → conversation likely not going well.

Error correction: “No, I meant…” / rephrasing attempts → model misunderstood intent.

User edits: original response = losing response; edited = winning response → preference data for RLHF.

Complaints: “You’re wrong”, “Too cliche”, “Too short” → identify specific failure modes.

Sentiment: frustration/disappointment in messages → model not meeting user needs.

Model refusal rate: model says “Sorry, I don’t know” or “As a language model…” → user likely unhappy.

Other Conversational Feedback

Regeneration: user generates another response → first may be unsatisfactory (or curiosity)
Comparison after regeneration: explicit choice between old and new → strong preference signal
Conversation organization: delete (strong negative), rename (positive), share, bookmark
Conversation length: long for AI companion = positive; long for customer support = negative (inefficiency)
Dialogue diversity: long + repetitive → user stuck in loop

Feedback Design

When to collect:

During onboarding (calibrate preferences; optional to avoid friction)
When model fails (allow downvote, regenerate, transfer to human)
When model is uncertain (show two options side-by-side for comparative feedback)
After remarkable successes (optional; helps identify high-impact features)

How to collect:

Seamlessly integrate into user workflow (Midjourney: select/vary/regenerate; GitHub Copilot: Tab to accept)
Make feedback easy to ignore (non-intrusive)
Never ask users to evaluate what they can’t understand (math questions shouldn’t be preference-based)
Use context alongside feedback (preceding 5-10 turns) for debugging — requires user consent
Explain how feedback is used → motivates higher-quality feedback
Show incentives (personalization) and privacy guarantees (data won’t leave device)
Private vs. public signals: X found likes increased after making them private (more candid)

Feedback Limitations

Biases:

Leniency bias: users rate positively to avoid conflict or extra work; “5 stars unless angry” pattern
- Fix: replace numerical scales with semantic descriptions
Randomness: users click randomly on long side-by-side comparisons they don’t read
Position bias: users favor first suggestion → randomize positions to mitigate
Preference biases: length preference (longer = better), recency bias (last answer preferred)

Degenerate feedback loops:

Model shows popular content → popular content gets clicks → model shows more of it → amplifies initial bias
“Sycophancy” (Sharma et al., 2023): models trained on human feedback learn to tell users what they want to hear rather than what’s accurate
Solution: analyze feedback distribution, monitor for drift, don’t incorporate feedback indiscriminately

Key Takeaways

Build AI architecture incrementally; each step adds capability and failure modes
Context construction (Step 1) is almost always the first enhancement that matters
Guardrails (Step 2) protect against both input and output risks; balance security with false refusal rate
Model gateway provides a unified, secure, observable interface to all models
Caching (especially exact caching) delivers large latency and cost reductions for repeated queries
Build observability in from the start; log everything; design metrics around failure modes
User feedback is proprietary data that feeds the data flywheel; design for quality collection
Conversational feedback (natural language errors, edits, early termination) is abundant but noisy
Feedback biases (leniency, position, length) and degenerate feedback loops are real risks; monitor distributions
AI engineering is moving closer to product engineering — the data flywheel and user experience are the primary competitive advantages

Study Notes by Niladri & AI

Explorer

10-architecture-and-user-feedback