System Design: Production Agent System
Interview Prompt: “Design an AI agent that can manage a developer’s GitHub workflow — triage issues, draft PR descriptions, and answer questions about the codebase.”
Step 1: Clarifying Questions
Scope and permissions:
- What actions can the agent take autonomously vs. what requires human approval? (Can it close issues, or only label/comment?)
- What’s the blast radius if the agent makes a mistake? (Posting a bad comment is recoverable; force-pushing to main is not.)
- Does the agent act on behalf of a bot account or impersonate individual developers?
- Which GitHub resources does it need access to? (Issues, PRs, code, wikis, Actions?)
Scale and integration:
- How many repositories? How many developers?
- What’s the scale of issue/PR volume per day?
- Is this a standalone tool, a GitHub App, or integrated into an IDE or Slack?
- Does “answer questions about the codebase” require reading the entire codebase, or just recent changes?
User interaction model:
- Does the developer invoke the agent explicitly (slash command, button click) or does it operate autonomously on GitHub events?
- How does the developer override or correct the agent?
Data and privacy:
- Is the codebase proprietary? Can we send code to a cloud LLM API?
- Are there compliance requirements (SOC2, HIPAA) that limit what data can leave the network?
For this walkthrough, I’ll assume:
- Single organization, ~20 repositories, ~50 developers
- Agent is triggered by explicit invocation (slash commands in issue/PR comments like
/triage,/draft-pr,/ask) and by GitHub webhooks on new issues - Agent operates as a bot account with write access to issues/PRs but cannot merge PRs or push code
- Code is proprietary SaaS product; using Anthropic’s API (data processed under data privacy agreements)
- All consequential actions go through a human-approval step
- Integration via GitHub App webhook + Slack notifications
Step 2: Requirements
Functional Requirements
Issue triage:
- Auto-label new issues with type (bug, feature, question), priority (P0–P3), and affected component
- Assign to the right team based on codebase ownership
- Identify duplicate issues and link them
- Ask clarifying questions on vague issues
PR description drafting:
- Given a PR with a diff, generate a structured description (summary, changes, testing notes)
- Link to related issues automatically
- Flag potential review concerns (large diff, changes to critical paths, missing tests)
Codebase Q&A:
- Answer questions about code: “What does the payment service do?”, “Where is the auth logic?”, “Why does this function exist?”
- Explain recent changes in a PR
- Find relevant code for a given task
Non-Functional Requirements
- Response latency: < 30 seconds for most operations (async acceptable for some)
- Actions are reversible: the agent never performs irreversible actions without human approval
- Auditability: every agent action logged with reasoning
- Cost: < $500/month at stated scale
- GitHub API compliance: stay within GitHub App rate limits
Step 3: High-Level Architecture
TRIGGER LAYER
┌───────────────────────────────────────────────────────────┐
│ │
│ GitHub Webhooks Slack / IDE Slash Commands │
│ (new issue, PR opened, (/triage #123, /ask "...") │
│ comment with /command) │ │
│ │ │ │
│ └────────┬───────┘ │
│ ▼ │
│ Event Router │
│ (parse intent, authenticate user, │
│ route to correct agent task) │
│ │
└───────────────────────────────────────────────────────────┘
│
▼
AGENT CORE
┌───────────────────────────────────────────────────────────┐
│ │
│ Task Planner (LLM) │
│ - Understands task type │
│ - Selects relevant tools │
│ - Produces step-by-step plan │
│ │ │
│ ▼ │
│ ReAct Loop │
│ ┌──────────────────────────────────────────────┐ │
│ │ Thought → Action → Observation → Thought │ │
│ │ (max iterations: 10, timeout: 60s) │ │
│ └──────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Tool Dispatcher │
│ (validates tool calls, rate limit management, │
│ error handling, retry logic) │
│ │
└───────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
GitHub Tools Code Search Memory Store
(API wrappers) (RAG over code) (Redis + pgvector)
│
▼
HUMAN-IN-THE-LOOP
┌───────────────────────────────────────────────────────────┐
│ │
│ Action Queue │
│ (consequential actions hold here pending approval) │
│ │ │
│ ▼ │
│ Slack/GitHub notification with approve/reject buttons │
│ │ │
│ ▼ │
│ Action Executor (runs approved actions) │
│ │ │
│ ▼ │
│ Audit Log (every action, its reasoning, approver) │
│ │
└───────────────────────────────────────────────────────────┘
Step 4: Component Breakdown
4.1 Tool Inventory
Every tool the agent can call must have a precise name, description, input schema, and output format. The LLM reads these descriptions as instructions — imprecise tool descriptions cause incorrect tool calls.
get_issue(issue_number: int) → Issue
Returns full issue content including title, body, labels, comments, timeline. Used as the first tool call in almost every triage task.
list_issues(filters: IssueFilters) → List[Issue]
Lists issues matching filters (label, assignee, state, date range). Used for duplicate detection and triage context.
search_issues(query: str) → List[Issue]
Full-text GitHub issue search. Used for duplicate detection — search for issues semantically similar to a new issue before triaging.
add_labels(issue_number: int, labels: List[str]) → void
Adds labels to an issue. Labels must be from the allowed set (validated before calling). This is a consequential action — goes through approval queue.
add_comment(issue_number: int, body: str) → Comment
Posts a comment. The agent drafts the comment; a human approves before posting. Exception: automated “I’ve been asked to triage this — give me a moment” acknowledgment comments post immediately.
assign_issue(issue_number: int, assignees: List[str]) → void
Assigns issue to developers. Consequential, requires approval.
get_pull_request(pr_number: int) → PullRequest
Returns PR metadata, linked issues, and review status.
get_pull_request_diff(pr_number: int) → Diff
Returns the unified diff. Critical for PR description drafting. Note: large diffs (>500KB) need truncation or selective sampling — the agent should request only changed file paths first and then pull specific files.
search_code(query: str, repo: str) → List[CodeResult]
GitHub code search. Returns file paths and matching lines. Used for codebase Q&A.
get_file_contents(path: str, repo: str, ref: str = "main") → FileContent
Returns a specific file’s contents. Used for deep code Q&A.
get_codebase_summary(repo: str) → Summary
Custom tool (not a raw GitHub API call) — returns a cached summary of the repo: top-level directory structure, key services, README extract, recent commit history. Generated once per repo per day and cached. This is the first tool called for codebase Q&A.
get_blame(file_path: str, lines: range) → BlameResult
Returns git blame for a file section — who wrote it, when, and the commit message. Useful for understanding why code exists.
create_pull_request_comment(pr_number: int, body: str) → Comment
Posts a review comment on a PR (not a line-level comment). Used for the generated PR description.
update_pull_request_description(pr_number: int, body: str) → void
Updates the PR description body. Consequential action — requires developer approval (the PR author is notified and asked to accept or edit).
Why these tools and not others:
Tools I deliberately excluded and why:
merge_pull_request: Too high blast radius. Merging code is an irreversible action that could break production. Never automated.push_to_branch: Same reasoning. No code writes.delete_branch: Consequential and sometimes irreversible.close_issue: The agent can recommend closing but a human must confirm.create_issue: Excluded for MVP — reduces risk of spam/noise.
The principle: the agent can read everything and draft everything, but write actions that change shared state require human confirmation.
4.2 Memory Design
The agent needs multiple types of memory with different persistence and retrieval characteristics.
Working Memory (In-context, per-session)
The current conversation thread including all tool calls and results. This is the LLM’s context window. For a complex triage task, this might be:
- System prompt (~1000 tokens)
- Task description (~200 tokens)
- Tool call history (~2000–5000 tokens)
- Total: well within 200K context window
Working memory is ephemeral — it disappears when the session ends. This is appropriate for task execution.
Short-Term Session Memory (Redis, TTL: 24 hours)
Stores state between steps of a multi-step task:
- Pending action queue (actions awaiting human approval)
- In-progress task state (partially completed triage where agent is waiting for human input)
- Rate limit state (how many GitHub API calls made this hour)
Key format: agent:session:{session_id}:state
Long-Term Knowledge Memory (pgvector, persistent)
Stores knowledge that the agent should retain across sessions:
- Codebase knowledge: Embeddings of repo summaries, key file descriptions, architecture docs. Refreshed nightly.
- Issue patterns: Embeddings of resolved issues and their resolutions. Enables “similar issues” lookup. “This issue looks similar to #1234 which was fixed by updating the auth middleware.”
- Team ownership map: Which team owns which directories/services. Used for auto-assignment. Stored as structured data, not vector.
- Historical triage decisions: “Component X issues go to team Y with priority P1” — learned from past human triage decisions. Used to calibrate automated triage.
Memory retrieval flow:
new_issue → embed issue text →
query 1: similar past issues (find duplicates)
query 2: similar resolved issues (suggest fix direction)
query 3: get team ownership for affected component
User Preference Memory (PostgreSQL, persistent)
Per-developer preferences and feedback:
- Does this developer want verbose or terse agent comments?
- Did they previously approve or reject suggested labels?
- Which tools have they authorized for auto-execution (no approval needed)?
This personalizes the agent’s behavior over time.
4.3 Safety and Human-in-the-Loop Design
The key insight for agent safety: the agent should never take an action where the cost of being wrong is higher than the cost of asking a human.
Action Classification
Classify every potential action into one of three tiers:
Tier 1 — Auto-execute (no human approval):
- Reading any data (issues, code, PRs)
- Posting a “working on it” acknowledgment comment
- Caching and indexing operations
- Reporting results to the invoking user
Tier 2 — Notify and auto-execute (execute but send notification):
- Adding labels (with a known, safe label set)
- Posting drafted responses (when user has opted into auto-post)
- Creating calendar reminders
Tier 3 — Queue for approval (hold, present to human, wait):
- Assigning issues to people
- Updating PR descriptions
- Posting formal comments on behalf of the agent
- Any action that affects others beyond the invoking user
Tier 4 — Never (hardcoded off):
- Merging PRs
- Pushing code
- Deleting branches or issues
- Modifying CI/CD configurations
- Any action involving payment or billing
The Approval Interface
For Tier 3 actions, the agent posts to Slack with:
[Agent] I'd like to take the following action on issue #123:
- Add labels: [bug, P1, auth-service]
- Assign to: @alice
My reasoning: This issue describes a login failure (bug),
affecting paying users (P1), in the authentication module (auth-service).
@alice owns the auth service per CODEOWNERS.
[Approve] [Edit] [Reject]
The [Edit] button opens an inline form where the developer can modify labels or assignee before approving. This keeps the human in control while leveraging the agent’s work.
Irreversibility Check
Before any action, check: “Can this be undone?” If no, escalate to Tier 4 or Tier 3 regardless of action type. This is a hardcoded check in the Tool Dispatcher, not a prompt instruction — you can’t prompt-engineer your way to safety.
Confidence Thresholding
For automated actions (Tier 1/2), the agent includes a self-assessed confidence score in its tool call. Below a threshold, auto-escalate to Tier 3.
# Agent output structure for triage
{
"proposed_labels": ["bug", "auth-service"],
"confidence": 0.92,
"proposed_priority": "P1",
"confidence": 0.71, # lower confidence on priority
"reasoning": "Login failure affecting paying users..."
}
# If any confidence < 0.80, escalate to Tier 3Step 5: Failure Mode Analysis
Infinite Loop Prevention
The problem: A ReAct agent can loop indefinitely — especially if a tool returns an unexpected result and the agent keeps retrying with variations.
Mitigations:
- Hard iteration cap: Maximum 10 tool calls per task. After 10 calls, the agent must either output a result or escalate to human.
- Repetition detection: Track tool calls in the session. If the same tool is called with the same arguments twice, immediately stop and report: “I seem to be stuck in a loop. Here’s what I’ve found so far: [partial results].”
- Timeout: 60-second wall-clock timeout per task. On timeout, output whatever partial results exist.
- Progress requirement: Every 3 tool calls, check: “Am I making progress toward the original goal?” If no, halt.
class AgentLoop:
MAX_ITERATIONS = 10
TIMEOUT_SECONDS = 60
def run(self, task):
tool_call_history = []
start_time = time.time()
for iteration in range(self.MAX_ITERATIONS):
if time.time() - start_time > self.TIMEOUT_SECONDS:
return self.partial_result(tool_call_history, "timeout")
response = self.llm.call(self.build_prompt(task, tool_call_history))
if response.is_final_answer:
return response.answer
tool_call = response.tool_call
# Detect repetition
if self.is_duplicate_call(tool_call, tool_call_history):
return self.partial_result(tool_call_history, "repeated_call")
result = self.execute_tool(tool_call)
tool_call_history.append((tool_call, result))
return self.partial_result(tool_call_history, "max_iterations")Bad Tool Calls (Wrong Arguments, Invalid IDs)
The problem: LLMs hallucinate argument values — issue numbers that don’t exist, label names that aren’t in the allowed set, usernames that aren’t real.
Mitigations:
- Schema validation before execution: Validate every tool call against a strict Pydantic schema before dispatching. If invalid, return the validation error to the model and ask it to correct.
- Existence validation for IDs: For any ID argument (issue_number, pr_number), verify it exists before passing to the real API.
- Enum constraints for labels: The
add_labelstool only accepts labels from a hardcoded allowed set. The model cannot hallucinate a new label into existence. - Retry budget: Allow the model 2 retries on validation failure. After 2 failures, escalate to human with error details.
# Example: label validation
ALLOWED_LABELS = {"bug", "feature", "question", "P0", "P1", "P2", "P3",
"auth-service", "payment-service", "api", "frontend", ...}
def validate_add_labels_call(call: ToolCall):
invalid = set(call.args.labels) - ALLOWED_LABELS
if invalid:
return ToolCallError(
f"Invalid labels: {invalid}. Allowed labels are: {ALLOWED_LABELS}"
)Sensitive Data Exposure
The problem: The codebase may contain API keys, credentials, PII. The agent reads code and could inadvertently include this in its responses or logs.
Mitigations:
- Secret scanning before LLM input: Run a regex-based secret scanner (Gitleaks patterns) on any file content before inserting it into the prompt. Redact detected secrets.
- Response filtering: Scan agent output for patterns that look like API keys, credentials, or PII before posting.
- Scope minimization: The
get_file_contentstool only has access to specific repos and branches. Configuration files (.env,secrets.yaml) are blocklisted. - Log sanitization: Never log full prompt/response content in production logs. Log token counts, tool call names, and task IDs instead.
Cascading Rate Limit Exhaustion
The problem: One buggy task triggers 50 GitHub API calls, exhausting the hourly rate limit for the entire organization (5,000 requests/hour for GitHub Apps).
Mitigations:
- Per-task API call budget: Each task gets a budget of max 20 API calls. Tracked in the Tool Dispatcher.
- Rate limit monitoring: Check remaining rate limit before each API call. If < 500 remaining, pause and notify the operator.
- Caching layer: Cache GitHub API responses (issues, PR diffs) for 5 minutes. Repeat calls within the cache TTL don’t hit the API.
- Exponential backoff: On rate limit 429 responses, back off exponentially and schedule retry.
class GitHubRateLimiter:
def __init__(self):
self.remaining = 5000
self.reset_at = None
self.per_task_budget = {}
def check_and_deduct(self, session_id: str) -> bool:
if self.per_task_budget.get(session_id, 0) >= 20:
raise TaskBudgetExceeded()
if self.remaining < 500:
raise RateLimitLow(self.reset_at)
self.per_task_budget[session_id] = self.per_task_budget.get(session_id, 0) + 1
self.remaining -= 1
return TrueModel Miscalibration on Codebase Context
The problem: The agent confidently answers codebase questions with outdated information (the code was refactored, the function was renamed, the service was deprecated).
Detection: Add a last_indexed_at timestamp to every code summary in memory. If the code has been updated since the last index (check via GitHub commit timestamp), add a caveat: “Note: my codebase knowledge was last updated 3 days ago. The answer below may not reflect recent changes.”
Step 6: Rate Limit Strategy
GitHub API limits: 5,000 requests/hour for authenticated GitHub Apps (can be increased with additional tokens).
Strategy for 20 repos × 50 developers:
At 50 developers each triggering 2–3 agent tasks/day, and each task making 5–10 API calls, that’s 50 × 2.5 × 7.5 = ~937 API calls/day = ~40 API calls/hour average, well within limits.
Peak usage (end of sprint, everyone creating PRs): 10x average = ~400 calls/hour. Still fine.
When you’d hit limits:
- Bulk ingestion of all 20 repos’ code history: aggressive batching could hit limits
- A bug causing infinite loops (hence the per-task budget)
- A nightly batch job re-indexing all repos at the same time
Solutions:
- Stagger nightly indexing jobs across repos
- Per-task call budget (already described)
- Aggressive caching with appropriate TTLs
- If scale grows: use separate GitHub App tokens for different repo groups (each gets its own rate limit)
Step 7: Evaluation Strategy
Triage accuracy:
- Golden dataset: 200 historical issues manually labeled with correct labels, priority, and assignee
- Metric: exact match on labels (%), priority within 1 level (%), correct team assignment (%)
- Threshold for shipping: > 85% label accuracy, > 80% correct team assignment
PR description quality:
- Human evaluation: developers rate generated descriptions 1–5 on completeness, accuracy, usefulness
- Quantitative proxy: does the generated description include the correct issue references?
- Collect and store: every time a developer edits an agent-generated description before accepting, store the (generated, accepted) pair — these become training signal
Codebase Q&A:
- Harder to evaluate automatically
- Monthly sample: have 5 developers each ask 10 questions, rate answers 1–5
- Track: “How often do developers ask follow-up questions?” (proxy for incomplete initial answer)
Safety metrics:
- Zero tolerance: no instance of the agent taking a Tier 4 action (these are hardcoded off, so 0 should be achievable)
- Tier 3 approval rate: what fraction of queued actions do developers approve vs. reject? High rejection rate indicates poor agent judgment.
- Monitor false confidence: instances where agent had high confidence and the human rejected the action.
Step 8: Cost Estimate
At 50 developers, ~100 agent tasks/day:
Each triage task: ~5 tool calls × 200 tokens/call return + ~2000 input tokens system prompt + ~500 tokens output = ~4000 tokens/task
Each PR description: ~1 large diff read (~10K tokens) + 3000 tokens output = ~13K tokens/task
Each codebase Q&A: ~3 tool calls + code context (~8K tokens) + 1500 tokens output = ~12K tokens/task
Mix: 50% triage, 30% PR description, 20% Q&A
Daily token consumption:
- 50 triage × 4K = 200K tokens
- 30 PR description × 13K = 390K tokens
- 20 Q&A × 12K = 240K tokens
- Total: ~830K tokens/day
At Claude 3.5 Sonnet pricing (15 per M in/out):
- Input: 700K × 2.10/day
- Output: 130K × 1.95/day
- Total: ~120/month**
Plus caching: most of the system prompt (1K tokens) is identical across tasks. With prompt caching, the system prompt costs $0.30/M (10% of base price) for cache reads. At 100 tasks/day × 1K tokens = 100K cached tokens/day → negligible savings vs. the context cost.
Total estimated cost: ~$150/month including infrastructure.
Step 9: What I’d Do Differently With 6 More Months
Months 1–2: Better code understanding
The get_file_contents tool is a blunt instrument. Replace it with a proper code intelligence layer: build an AST-aware code index, understand call graphs, track symbol definitions and usages across files. This dramatically improves codebase Q&A quality.
Months 2–3: Learning from corrections
Every time a human edits an agent-generated output (label, comment, PR description), store the (generated, human-edited) pair. After 500 examples, fine-tune or few-shot the model on these corrections. This creates a system that gets measurably better over time.
Months 3–4: Proactive issue detection
Instead of only responding to explicit triggers, have the agent scan open issues weekly and proactively flag: stale issues with no activity, issues that were fixed but not closed, duplicate issues that were filed separately. This moves from reactive to proactive value.
Months 4–5: Multi-repo reasoning
For organizations with microservices spread across many repos, an issue in service A might be caused by a change in service B. Build cross-repo context linking so the agent can reason across repository boundaries.
Month 6: Metrics dashboard
Build a proper dashboard showing: how many issues were triaged, accuracy rates, developer adoption (what fraction of agent suggestions were accepted), time saved estimate, and cost per task. Without this, you can’t make a business case for the system or prioritize improvements.