System Design: Production Agent System

Interview Prompt: “Design an AI agent that can manage a developer’s GitHub workflow — triage issues, draft PR descriptions, and answer questions about the codebase.”

Step 1: Clarifying Questions

Scope and permissions:

What actions can the agent take autonomously vs. what requires human approval? (Can it close issues, or only label/comment?)
What’s the blast radius if the agent makes a mistake? (Posting a bad comment is recoverable; force-pushing to main is not.)
Does the agent act on behalf of a bot account or impersonate individual developers?
Which GitHub resources does it need access to? (Issues, PRs, code, wikis, Actions?)

Scale and integration:

How many repositories? How many developers?
What’s the scale of issue/PR volume per day?
Is this a standalone tool, a GitHub App, or integrated into an IDE or Slack?
Does “answer questions about the codebase” require reading the entire codebase, or just recent changes?

User interaction model:

Does the developer invoke the agent explicitly (slash command, button click) or does it operate autonomously on GitHub events?
How does the developer override or correct the agent?

Data and privacy:

Is the codebase proprietary? Can we send code to a cloud LLM API?
Are there compliance requirements (SOC2, HIPAA) that limit what data can leave the network?

For this walkthrough, I’ll assume:

Single organization, ~20 repositories, ~50 developers
Agent is triggered by explicit invocation (slash commands in issue/PR comments like /triage, /draft-pr, /ask) and by GitHub webhooks on new issues
Agent operates as a bot account with write access to issues/PRs but cannot merge PRs or push code
Code is proprietary SaaS product; using Anthropic’s API (data processed under data privacy agreements)
All consequential actions go through a human-approval step
Integration via GitHub App webhook + Slack notifications

Step 2: Requirements

Functional Requirements

Issue triage:

Auto-label new issues with type (bug, feature, question), priority (P0–P3), and affected component
Assign to the right team based on codebase ownership
Identify duplicate issues and link them
Ask clarifying questions on vague issues

PR description drafting:

Given a PR with a diff, generate a structured description (summary, changes, testing notes)
Link to related issues automatically
Flag potential review concerns (large diff, changes to critical paths, missing tests)

Codebase Q&A:

Answer questions about code: “What does the payment service do?”, “Where is the auth logic?”, “Why does this function exist?”
Explain recent changes in a PR
Find relevant code for a given task

Non-Functional Requirements

Response latency: < 30 seconds for most operations (async acceptable for some)
Actions are reversible: the agent never performs irreversible actions without human approval
Auditability: every agent action logged with reasoning
Cost: < $500/month at stated scale
GitHub API compliance: stay within GitHub App rate limits

Step 3: High-Level Architecture

                    TRIGGER LAYER
┌───────────────────────────────────────────────────────────┐
│                                                           │
│  GitHub Webhooks         Slack / IDE Slash Commands       │
│  (new issue, PR opened,  (/triage #123, /ask "...")       │
│   comment with /command) │                                │
│         │                │                                │
│         └────────┬───────┘                                │
│                  ▼                                        │
│           Event Router                                    │
│           (parse intent, authenticate user,               │
│            route to correct agent task)                   │
│                                                           │
└───────────────────────────────────────────────────────────┘
                          │
                          ▼
                    AGENT CORE
┌───────────────────────────────────────────────────────────┐
│                                                           │
│  Task Planner (LLM)                                       │
│  - Understands task type                                  │
│  - Selects relevant tools                                 │
│  - Produces step-by-step plan                             │
│         │                                                 │
│         ▼                                                 │
│  ReAct Loop                                               │
│  ┌──────────────────────────────────────────────┐        │
│  │  Thought → Action → Observation → Thought   │        │
│  │  (max iterations: 10, timeout: 60s)          │        │
│  └──────────────────────────────────────────────┘        │
│         │                                                 │
│         ▼                                                 │
│  Tool Dispatcher                                          │
│  (validates tool calls, rate limit management,            │
│   error handling, retry logic)                            │
│                                                           │
└───────────────────────────────────────────────────────────┘
          │                    │                    │
          ▼                    ▼                    ▼
   GitHub Tools         Code Search          Memory Store
   (API wrappers)       (RAG over code)      (Redis + pgvector)

                          │
                          ▼
                   HUMAN-IN-THE-LOOP
┌───────────────────────────────────────────────────────────┐
│                                                           │
│  Action Queue                                             │
│  (consequential actions hold here pending approval)       │
│         │                                                 │
│         ▼                                                 │
│  Slack/GitHub notification with approve/reject buttons    │
│         │                                                 │
│         ▼                                                 │
│  Action Executor (runs approved actions)                  │
│         │                                                 │
│         ▼                                                 │
│  Audit Log (every action, its reasoning, approver)        │
│                                                           │
└───────────────────────────────────────────────────────────┘

Step 4: Component Breakdown

4.1 Tool Inventory

Every tool the agent can call must have a precise name, description, input schema, and output format. The LLM reads these descriptions as instructions — imprecise tool descriptions cause incorrect tool calls.

`get_issue(issue_number: int) → Issue`

Returns full issue content including title, body, labels, comments, timeline. Used as the first tool call in almost every triage task.

`list_issues(filters: IssueFilters) → List[Issue]`

Lists issues matching filters (label, assignee, state, date range). Used for duplicate detection and triage context.

`search_issues(query: str) → List[Issue]`

Full-text GitHub issue search. Used for duplicate detection — search for issues semantically similar to a new issue before triaging.

`add_labels(issue_number: int, labels: List[str]) → void`

Adds labels to an issue. Labels must be from the allowed set (validated before calling). This is a consequential action — goes through approval queue.

`add_comment(issue_number: int, body: str) → Comment`

Posts a comment. The agent drafts the comment; a human approves before posting. Exception: automated “I’ve been asked to triage this — give me a moment” acknowledgment comments post immediately.

`assign_issue(issue_number: int, assignees: List[str]) → void`

Assigns issue to developers. Consequential, requires approval.

`get_pull_request(pr_number: int) → PullRequest`

Returns PR metadata, linked issues, and review status.

`get_pull_request_diff(pr_number: int) → Diff`

Returns the unified diff. Critical for PR description drafting. Note: large diffs (>500KB) need truncation or selective sampling — the agent should request only changed file paths first and then pull specific files.

`search_code(query: str, repo: str) → List[CodeResult]`

GitHub code search. Returns file paths and matching lines. Used for codebase Q&A.

`get_file_contents(path: str, repo: str, ref: str = "main") → FileContent`

Returns a specific file’s contents. Used for deep code Q&A.

`get_codebase_summary(repo: str) → Summary`

Custom tool (not a raw GitHub API call) — returns a cached summary of the repo: top-level directory structure, key services, README extract, recent commit history. Generated once per repo per day and cached. This is the first tool called for codebase Q&A.

`get_blame(file_path: str, lines: range) → BlameResult`

Returns git blame for a file section — who wrote it, when, and the commit message. Useful for understanding why code exists.

`create_pull_request_comment(pr_number: int, body: str) → Comment`

Posts a review comment on a PR (not a line-level comment). Used for the generated PR description.

`update_pull_request_description(pr_number: int, body: str) → void`

Updates the PR description body. Consequential action — requires developer approval (the PR author is notified and asked to accept or edit).

Why these tools and not others:

Tools I deliberately excluded and why:

merge_pull_request: Too high blast radius. Merging code is an irreversible action that could break production. Never automated.
push_to_branch: Same reasoning. No code writes.
delete_branch: Consequential and sometimes irreversible.
close_issue: The agent can recommend closing but a human must confirm.
create_issue: Excluded for MVP — reduces risk of spam/noise.

The principle: the agent can read everything and draft everything, but write actions that change shared state require human confirmation.

4.2 Memory Design

The agent needs multiple types of memory with different persistence and retrieval characteristics.

Working Memory (In-context, per-session)

The current conversation thread including all tool calls and results. This is the LLM’s context window. For a complex triage task, this might be:

System prompt (~1000 tokens)
Task description (~200 tokens)
Tool call history (~2000–5000 tokens)
Total: well within 200K context window

Working memory is ephemeral — it disappears when the session ends. This is appropriate for task execution.

Short-Term Session Memory (Redis, TTL: 24 hours)

Stores state between steps of a multi-step task:

Pending action queue (actions awaiting human approval)
In-progress task state (partially completed triage where agent is waiting for human input)
Rate limit state (how many GitHub API calls made this hour)

Key format: agent:session:{session_id}:state

Long-Term Knowledge Memory (pgvector, persistent)

Stores knowledge that the agent should retain across sessions:

Codebase knowledge: Embeddings of repo summaries, key file descriptions, architecture docs. Refreshed nightly.
Issue patterns: Embeddings of resolved issues and their resolutions. Enables “similar issues” lookup. “This issue looks similar to #1234 which was fixed by updating the auth middleware.”
Team ownership map: Which team owns which directories/services. Used for auto-assignment. Stored as structured data, not vector.
Historical triage decisions: “Component X issues go to team Y with priority P1” — learned from past human triage decisions. Used to calibrate automated triage.

Memory retrieval flow:
new_issue → embed issue text → 
  query 1: similar past issues (find duplicates)
  query 2: similar resolved issues (suggest fix direction)
  query 3: get team ownership for affected component

User Preference Memory (PostgreSQL, persistent)

Per-developer preferences and feedback:

Does this developer want verbose or terse agent comments?
Did they previously approve or reject suggested labels?
Which tools have they authorized for auto-execution (no approval needed)?

This personalizes the agent’s behavior over time.

4.3 Safety and Human-in-the-Loop Design

The key insight for agent safety: the agent should never take an action where the cost of being wrong is higher than the cost of asking a human.

Action Classification

Classify every potential action into one of three tiers:

Tier 1 — Auto-execute (no human approval):

Reading any data (issues, code, PRs)
Posting a “working on it” acknowledgment comment
Caching and indexing operations
Reporting results to the invoking user

Tier 2 — Notify and auto-execute (execute but send notification):

Adding labels (with a known, safe label set)
Posting drafted responses (when user has opted into auto-post)
Creating calendar reminders

Tier 3 — Queue for approval (hold, present to human, wait):

Assigning issues to people
Updating PR descriptions
Posting formal comments on behalf of the agent
Any action that affects others beyond the invoking user

Tier 4 — Never (hardcoded off):

Merging PRs
Pushing code
Deleting branches or issues
Modifying CI/CD configurations
Any action involving payment or billing

The Approval Interface

For Tier 3 actions, the agent posts to Slack with:

[Agent] I'd like to take the following action on issue #123:
  - Add labels: [bug, P1, auth-service]
  - Assign to: @alice

My reasoning: This issue describes a login failure (bug),
affecting paying users (P1), in the authentication module (auth-service).
@alice owns the auth service per CODEOWNERS.

[Approve] [Edit] [Reject]

The [Edit] button opens an inline form where the developer can modify labels or assignee before approving. This keeps the human in control while leveraging the agent’s work.

Irreversibility Check

Before any action, check: “Can this be undone?” If no, escalate to Tier 4 or Tier 3 regardless of action type. This is a hardcoded check in the Tool Dispatcher, not a prompt instruction — you can’t prompt-engineer your way to safety.

Confidence Thresholding

For automated actions (Tier 1/2), the agent includes a self-assessed confidence score in its tool call. Below a threshold, auto-escalate to Tier 3.

# Agent output structure for triage
{
  "proposed_labels": ["bug", "auth-service"],
  "confidence": 0.92,
  "proposed_priority": "P1",
  "confidence": 0.71,  # lower confidence on priority
  "reasoning": "Login failure affecting paying users..."
}
 
# If any confidence < 0.80, escalate to Tier 3

Step 5: Failure Mode Analysis

Infinite Loop Prevention

The problem: A ReAct agent can loop indefinitely — especially if a tool returns an unexpected result and the agent keeps retrying with variations.

Mitigations:

Hard iteration cap: Maximum 10 tool calls per task. After 10 calls, the agent must either output a result or escalate to human.
Repetition detection: Track tool calls in the session. If the same tool is called with the same arguments twice, immediately stop and report: “I seem to be stuck in a loop. Here’s what I’ve found so far: [partial results].”
Timeout: 60-second wall-clock timeout per task. On timeout, output whatever partial results exist.
Progress requirement: Every 3 tool calls, check: “Am I making progress toward the original goal?” If no, halt.

class AgentLoop:
    MAX_ITERATIONS = 10
    TIMEOUT_SECONDS = 60
    
    def run(self, task):
        tool_call_history = []
        start_time = time.time()
        
        for iteration in range(self.MAX_ITERATIONS):
            if time.time() - start_time > self.TIMEOUT_SECONDS:
                return self.partial_result(tool_call_history, "timeout")
            
            response = self.llm.call(self.build_prompt(task, tool_call_history))
            
            if response.is_final_answer:
                return response.answer
            
            tool_call = response.tool_call
            
            # Detect repetition
            if self.is_duplicate_call(tool_call, tool_call_history):
                return self.partial_result(tool_call_history, "repeated_call")
            
            result = self.execute_tool(tool_call)
            tool_call_history.append((tool_call, result))
        
        return self.partial_result(tool_call_history, "max_iterations")

Bad Tool Calls (Wrong Arguments, Invalid IDs)

The problem: LLMs hallucinate argument values — issue numbers that don’t exist, label names that aren’t in the allowed set, usernames that aren’t real.

Mitigations:

Schema validation before execution: Validate every tool call against a strict Pydantic schema before dispatching. If invalid, return the validation error to the model and ask it to correct.
Existence validation for IDs: For any ID argument (issue_number, pr_number), verify it exists before passing to the real API.
Enum constraints for labels: The add_labels tool only accepts labels from a hardcoded allowed set. The model cannot hallucinate a new label into existence.
Retry budget: Allow the model 2 retries on validation failure. After 2 failures, escalate to human with error details.

# Example: label validation
ALLOWED_LABELS = {"bug", "feature", "question", "P0", "P1", "P2", "P3",
                  "auth-service", "payment-service", "api", "frontend", ...}
 
def validate_add_labels_call(call: ToolCall):
    invalid = set(call.args.labels) - ALLOWED_LABELS
    if invalid:
        return ToolCallError(
            f"Invalid labels: {invalid}. Allowed labels are: {ALLOWED_LABELS}"
        )

Sensitive Data Exposure

The problem: The codebase may contain API keys, credentials, PII. The agent reads code and could inadvertently include this in its responses or logs.

Mitigations:

Secret scanning before LLM input: Run a regex-based secret scanner (Gitleaks patterns) on any file content before inserting it into the prompt. Redact detected secrets.
Response filtering: Scan agent output for patterns that look like API keys, credentials, or PII before posting.
Scope minimization: The get_file_contents tool only has access to specific repos and branches. Configuration files (.env, secrets.yaml) are blocklisted.
Log sanitization: Never log full prompt/response content in production logs. Log token counts, tool call names, and task IDs instead.

Cascading Rate Limit Exhaustion

The problem: One buggy task triggers 50 GitHub API calls, exhausting the hourly rate limit for the entire organization (5,000 requests/hour for GitHub Apps).

Mitigations:

Per-task API call budget: Each task gets a budget of max 20 API calls. Tracked in the Tool Dispatcher.
Rate limit monitoring: Check remaining rate limit before each API call. If < 500 remaining, pause and notify the operator.
Caching layer: Cache GitHub API responses (issues, PR diffs) for 5 minutes. Repeat calls within the cache TTL don’t hit the API.
Exponential backoff: On rate limit 429 responses, back off exponentially and schedule retry.

class GitHubRateLimiter:
    def __init__(self):
        self.remaining = 5000
        self.reset_at = None
        self.per_task_budget = {}
    
    def check_and_deduct(self, session_id: str) -> bool:
        if self.per_task_budget.get(session_id, 0) >= 20:
            raise TaskBudgetExceeded()
        if self.remaining < 500:
            raise RateLimitLow(self.reset_at)
        self.per_task_budget[session_id] = self.per_task_budget.get(session_id, 0) + 1
        self.remaining -= 1
        return True

Model Miscalibration on Codebase Context

The problem: The agent confidently answers codebase questions with outdated information (the code was refactored, the function was renamed, the service was deprecated).

Detection: Add a last_indexed_at timestamp to every code summary in memory. If the code has been updated since the last index (check via GitHub commit timestamp), add a caveat: “Note: my codebase knowledge was last updated 3 days ago. The answer below may not reflect recent changes.”

Step 6: Rate Limit Strategy

GitHub API limits: 5,000 requests/hour for authenticated GitHub Apps (can be increased with additional tokens).

Strategy for 20 repos × 50 developers:

At 50 developers each triggering 2–3 agent tasks/day, and each task making 5–10 API calls, that’s 50 × 2.5 × 7.5 = ~937 API calls/day = ~40 API calls/hour average, well within limits.

Peak usage (end of sprint, everyone creating PRs): 10x average = ~400 calls/hour. Still fine.

When you’d hit limits:

Bulk ingestion of all 20 repos’ code history: aggressive batching could hit limits
A bug causing infinite loops (hence the per-task budget)
A nightly batch job re-indexing all repos at the same time

Solutions:

Stagger nightly indexing jobs across repos
Per-task call budget (already described)
Aggressive caching with appropriate TTLs
If scale grows: use separate GitHub App tokens for different repo groups (each gets its own rate limit)

Step 7: Evaluation Strategy

Triage accuracy:

Golden dataset: 200 historical issues manually labeled with correct labels, priority, and assignee
Metric: exact match on labels (%), priority within 1 level (%), correct team assignment (%)
Threshold for shipping: > 85% label accuracy, > 80% correct team assignment

PR description quality:

Human evaluation: developers rate generated descriptions 1–5 on completeness, accuracy, usefulness
Quantitative proxy: does the generated description include the correct issue references?
Collect and store: every time a developer edits an agent-generated description before accepting, store the (generated, accepted) pair — these become training signal

Codebase Q&A:

Harder to evaluate automatically
Monthly sample: have 5 developers each ask 10 questions, rate answers 1–5
Track: “How often do developers ask follow-up questions?” (proxy for incomplete initial answer)

Safety metrics:

Zero tolerance: no instance of the agent taking a Tier 4 action (these are hardcoded off, so 0 should be achievable)
Tier 3 approval rate: what fraction of queued actions do developers approve vs. reject? High rejection rate indicates poor agent judgment.
Monitor false confidence: instances where agent had high confidence and the human rejected the action.

Step 8: Cost Estimate

At 50 developers, ~100 agent tasks/day:

Each triage task: ~5 tool calls × 200 tokens/call return + ~2000 input tokens system prompt + ~500 tokens output = ~4000 tokens/task

Each PR description: ~1 large diff read (~10K tokens) + 3000 tokens output = ~13K tokens/task

Each codebase Q&A: ~3 tool calls + code context (~8K tokens) + 1500 tokens output = ~12K tokens/task

Mix: 50% triage, 30% PR description, 20% Q&A

Daily token consumption:

50 triage × 4K = 200K tokens
30 PR description × 13K = 390K tokens
20 Q&A × 12K = 240K tokens
Total: ~830K tokens/day

At Claude 3.5 Sonnet pricing ( $3/$ 15 per M in/out):

Input: 700K × $3/ M =$ 2.10/day
Output: 130K × $15/ M =$ 1.95/day
Total: ~ $4/ d a y = * *$ 120/month**

Plus caching: most of the system prompt (1K tokens) is identical across tasks. With prompt caching, the system prompt costs $0.30/M (10% of base price) for cache reads. At 100 tasks/day × 1K tokens = 100K cached tokens/day → negligible savings vs. the context cost.

Total estimated cost: ~$150/month including infrastructure.

Step 9: What I’d Do Differently With 6 More Months

Months 1–2: Better code understanding
The get_file_contents tool is a blunt instrument. Replace it with a proper code intelligence layer: build an AST-aware code index, understand call graphs, track symbol definitions and usages across files. This dramatically improves codebase Q&A quality.

Months 2–3: Learning from corrections
Every time a human edits an agent-generated output (label, comment, PR description), store the (generated, human-edited) pair. After 500 examples, fine-tune or few-shot the model on these corrections. This creates a system that gets measurably better over time.

Months 3–4: Proactive issue detection
Instead of only responding to explicit triggers, have the agent scan open issues weekly and proactively flag: stale issues with no activity, issues that were fixed but not closed, duplicate issues that were filed separately. This moves from reactive to proactive value.

Months 4–5: Multi-repo reasoning
For organizations with microservices spread across many repos, an issue in service A might be caused by a change in service B. Build cross-repo context linking so the agent can reason across repository boundaries.

Month 6: Metrics dashboard
Build a proper dashboard showing: how many issues were triaged, accuracy rates, developer adoption (what fraction of agent suggestions were accepted), time saved estimate, and cost per task. Without this, you can’t make a business case for the system or prioritize improvements.

Study Notes by Niladri & AI

Explorer

agent-system

System Design: Production Agent System

Step 1: Clarifying Questions

Step 2: Requirements

Functional Requirements

Non-Functional Requirements

Step 3: High-Level Architecture

Step 4: Component Breakdown

4.1 Tool Inventory

get_issue(issue_number: int) → Issue

list_issues(filters: IssueFilters) → List[Issue]

search_issues(query: str) → List[Issue]

add_labels(issue_number: int, labels: List[str]) → void

add_comment(issue_number: int, body: str) → Comment

assign_issue(issue_number: int, assignees: List[str]) → void

get_pull_request(pr_number: int) → PullRequest

get_pull_request_diff(pr_number: int) → Diff

search_code(query: str, repo: str) → List[CodeResult]

get_file_contents(path: str, repo: str, ref: str = "main") → FileContent

get_codebase_summary(repo: str) → Summary

get_blame(file_path: str, lines: range) → BlameResult

create_pull_request_comment(pr_number: int, body: str) → Comment

update_pull_request_description(pr_number: int, body: str) → void

4.2 Memory Design

Working Memory (In-context, per-session)

Short-Term Session Memory (Redis, TTL: 24 hours)

Long-Term Knowledge Memory (pgvector, persistent)

User Preference Memory (PostgreSQL, persistent)

4.3 Safety and Human-in-the-Loop Design

Action Classification

The Approval Interface

Irreversibility Check

Confidence Thresholding

Step 5: Failure Mode Analysis

Infinite Loop Prevention

Bad Tool Calls (Wrong Arguments, Invalid IDs)

Sensitive Data Exposure

Cascading Rate Limit Exhaustion

Model Miscalibration on Codebase Context

Step 6: Rate Limit Strategy

Step 7: Evaluation Strategy

Step 8: Cost Estimate

Step 9: What I’d Do Differently With 6 More Months

Graph View

Table of Contents

`get_issue(issue_number: int) → Issue`

`list_issues(filters: IssueFilters) → List[Issue]`

`search_issues(query: str) → List[Issue]`

`add_labels(issue_number: int, labels: List[str]) → void`

`add_comment(issue_number: int, body: str) → Comment`

`assign_issue(issue_number: int, assignees: List[str]) → void`

`get_pull_request(pr_number: int) → PullRequest`

`get_pull_request_diff(pr_number: int) → Diff`

`search_code(query: str, repo: str) → List[CodeResult]`

`get_file_contents(path: str, repo: str, ref: str = "main") → FileContent`

`get_codebase_summary(repo: str) → Summary`

`get_blame(file_path: str, lines: range) → BlameResult`

`create_pull_request_comment(pr_number: int, body: str) → Comment`

`update_pull_request_description(pr_number: int, body: str) → void`