Design a Tax Refund Status Checking Portal

Difficulty: Medium
Duration: 40-45 minutes
Category: Data-Intensive, Security-Critical
Key Topics: Encryption, Gen AI, Global Scale, Compliance


Problem Statement

Design a tax refund status checking portal that allows taxpayers to check the status of their filed tax returns and expected refund amount. The system should be secure, scalable, and potentially integrate AI-powered assistance.

Initial Context from Interviewer:

  • Tax refund status checking
  • Large number of users (~1K TPS)
  • Data safety critical (encryption required)
  • Gen AI use case to integrate
  • Users have already filed their returns

Step 1: Understand the Problem & Establish Scope (5-7 min)

Clarifying Questions to Ask

Functional Requirements:

  1. Q: “What information can users check?”
    A: Refund status (filed, processing, approved, rejected, refunded), amount, expected date, tracking number.

  2. Q: “Do users need to authenticate? What’s the auth mechanism?”
    A: Yes. Multi-factor auth using PAN/SSN, filing year, date of birth, phone OTP.

  3. Q: “Can users only check or also update/appeal?”
    A: Initially just check. Appeal workflow is phase 2 (out of scope for now).

  4. Q: “What’s the Gen AI use case specifically?”
    A: AI chatbot to answer common questions, explain status, provide guidance on next steps.

  5. Q: “Is this real-time status or updated periodically?”
    A: Near real-time. Backend tax processing system updates status, our system reflects within minutes.

  6. Q: “What regions do we start with and scale to?”
    A: Start with India (~100M taxpayers), expand to other countries (US, EU, APAC).

Non-Functional Requirements:

  1. Q: “What’s the expected traffic pattern? Peak periods?”
    A: Tax season peaks (Jan-Apr in India). Normal: 1K TPS, Peak: 10K TPS.

  2. Q: “What’s acceptable latency?”
    A: Status check: <500ms p99. Chatbot: <2s response time.

  3. Q: “Availability SLA?”
    A: 99.9% (8.7 hours downtime/year). Tax season must be higher (99.99%).

  4. Q: “Data retention requirements?”
    A: Keep refund data for 7 years (regulatory requirement).

  5. Q: “Security & compliance requirements?”
    A: PCI DSS for payment data, encryption at rest and in transit, audit logs, GDPR for EU users.

  6. Q: “How many concurrent users?”
    A: Peak: 50K concurrent users during tax season.

Requirements Summary

Functional:

  • ✅ User authentication (multi-factor: PAN + DOB + phone OTP)
  • ✅ Check refund status (filed → processing → approved → refunded)
  • ✅ View refund amount, expected date, tracking details
  • ✅ AI chatbot for questions and guidance
  • ✅ Multi-language support (start: English/Hindi, expand globally)
  • ✅ Integration with backend tax processing system
  • ⬜ Appeal/dispute filing (Phase 2, out of scope)
  • ⬜ Notifications (email/SMS) - nice to have

Non-Functional:

  • Scale: 1K TPS normal, 10K TPS peak, 100M users (India), expand globally
  • Latency: Status check <500ms p99, Chatbot <2s
  • Availability: 99.9% (99.99% during tax season)
  • Security: Encryption (at rest + in transit), audit logs, compliance (PCI DSS, GDPR)
  • Data Retention: 7 years
  • Consistency: Eventual consistency OK (status updated within minutes)

Step 2: Capacity Estimation (3-5 min)

Traffic Estimation

Users: 100M taxpayers in India
Active users during tax season: 60M (60% file returns)

Daily Active Users (DAU):
- Peak tax season: 5M DAU
- Normal: 500K DAU

Requests per user per day:
- Checking status: 3 times/day average (anxious users check often!)
- Chatbot queries: 1-2 times/day

Total requests/day (peak):
- Status checks: 5M users × 3 = 15M requests/day
- Chatbot: 5M users × 1.5 = 7.5M requests/day
- Total: 22.5M requests/day

QPS Calculation:
- Average QPS: 22.5M / 86,400 ≈ 260 QPS
- Peak QPS (assume 10x traffic during work hours 9am-6pm):
  - Peak: 260 × 10 = 2,600 QPS
  - Interviewer mentioned 1K TPS → matches our estimate!
  - Peak during tax season: 10K TPS (given)

Storage Estimation

User Authentication Data:
- 100M users × 500 bytes = 50 GB

Refund Records:
- 60M filings/year
- Each record: user_id, filing_year, status, amount, dates, tracking
- Size per record: 2 KB (with indexes)
- Annual: 60M × 2 KB = 120 GB/year
- 7 years retention: 120 GB × 7 = 840 GB ≈ 1 TB

Audit Logs (every access logged):
- 22.5M requests/day × 365 days = 8.2B logs/year
- Each log: 500 bytes (timestamp, user_id, action, IP, etc.)
- Annual: 8.2B × 500 bytes = 4.1 TB/year
- 7 years: 4.1 TB × 7 = 28.7 TB ≈ 30 TB

Chatbot Conversation History:
- 7.5M conversations/day × 365 = 2.7B conversations/year
- Average conversation: 5 messages × 500 bytes = 2.5 KB
- Annual: 2.7B × 2.5 KB = 6.75 TB/year
- Keep 1 year: 6.75 TB

Total Storage:
- Refund data: 1 TB
- Audit logs: 30 TB
- Chatbot history: 6.75 TB
- Total: ~38 TB (round to 40 TB with overhead)

Bandwidth Estimation

Read Traffic (status checks):
- 15M reads/day × 2 KB = 30 GB/day
- Per second: 30 GB / 86,400 ≈ 347 KB/s ≈ 2.8 Mbps

Chatbot Traffic:
- 7.5M interactions/day
- Average: 5 messages × 500 bytes = 2.5 KB per conversation
- Daily: 7.5M × 2.5 KB = 18.75 GB/day
- Per second: 217 KB/s ≈ 1.7 Mbps

Total Bandwidth:
- Read: ~3 Mbps average, 30 Mbps peak
- Write (logs, updates): ~1 Mbps
- Total: 5-40 Mbps (very manageable)

Summary

MetricValue
Users100M (India), expand globally
DAU (peak)5M
QPS (average)260 QPS
QPS (peak)2,600 QPS (10K during tax season)
Storage40 TB (with 7-year retention)
Bandwidth5-40 Mbps
Latency Target<500ms (status), <2s (chatbot)
Availability99.9% (99.99% tax season)

Step 3: High-Level Design & Iterations (20-25 min)

Design Iteration 1: Basic Architecture (MVP)

Goal: Simple, working system for India market

┌─────────────┐
│   Users     │
│ (Web/Mobile)│
└──────┬──────┘
       │ HTTPS
       │
┌──────▼──────────────────────────────────────┐
│         Load Balancer (ALB)                  │
└──────┬──────────────────────────────────────┘
       │
┌──────▼──────────────┐
│  Web Servers         │
│  (Authentication +   │
│   Status API)        │
└──────┬──────────────┘
       │
┌──────▼──────────────────────────────────────┐
│           Database (PostgreSQL)              │
│   Tables: users, refunds, audit_logs         │
│   Encryption: AES-256 at rest                │
└──────▲──────────────────────────────────────┘
       │
       │ Updates (batch sync every 5 min)
       │
┌──────┴──────────────┐
│  Tax Processing      │
│  System (Backend)    │
│  (External)          │
└─────────────────────┘

Components:

  1. Load Balancer: Distribute traffic, SSL termination
  2. Web Servers: Handle auth, status queries, business logic
  3. PostgreSQL: Store refund data, user info, audit logs
  4. Batch Sync: Periodic updates from tax backend

Pros:

  • ✅ Simple to implement
  • ✅ Strong consistency (single DB)
  • ✅ Easy to secure (encryption at rest)

Cons:

  • ❌ Single point of failure (database)
  • ❌ Doesn’t scale beyond 1K QPS
  • ❌ No caching (slow reads)
  • ❌ No Gen AI integration
  • ❌ Can’t handle 10K peak TPS

When this works: Early stage, India only, <1K QPS


Design Iteration 2: Scalable with Caching & AI

Improvements:

  • Add caching layer for read-heavy workload
  • Database replication for read scalability
  • Integrate Gen AI chatbot
  • Separate auth service
┌─────────────┐
│   Users     │
└──────┬──────┘
       │ HTTPS
       │
┌──────▼───────────────────────────────────────┐
│     CDN (CloudFront)                          │
│     - Static assets (JS, CSS, images)         │
└──────┬───────────────────────────────────────┘
       │
┌──────▼───────────────────────────────────────┐
│     API Gateway + WAF                         │
│     - Rate limiting (100 req/min per user)    │
│     - DDoS protection                         │
└──────┬───────────────────────────────────────┘
       │
       ├──────────────────┬────────────────────┐
       │                  │                    │
┌──────▼──────┐  ┌───────▼────────┐  ┌───────▼──────────┐
│   Auth      │  │  Status API    │  │  AI Chatbot API  │
│   Service   │  │  Service       │  │                  │
│ (Stateless) │  │  (Stateless)   │  │  (Gen AI)        │
└──────┬──────┘  └────────┬───────┘  └───────┬──────────┘
       │                  │                    │
       │         ┌────────▼────────┐           │
       │         │  Redis Cache     │           │
       │         │  - Hot refund    │           │
       │         │    status (TTL:  │           │
       │         │    5 min)        │           │
       │         └────────┬────────┘           │
       │                  │                    │
┌──────▼──────────────────▼────────────────────▼──────┐
│              Database Layer                          │
│  ┌──────────────┐    ┌──────────┐  ┌──────────┐    │
│  │  Primary DB  │───▶│ Replica1 │  │ Replica2 │    │
│  │ (PostgreSQL) │    │  (Read)  │  │  (Read)  │    │
│  │   (Write)    │    └──────────┘  └──────────┘    │
│  └──────▲───────┘                                    │
└─────────┼──────────────────────────────────────────┘
          │
          │ Kafka topic: refund_updates
          │
┌─────────▼────────────┐
│  Message Queue       │
│  (Kafka)             │
│  - Status updates    │
│  - Audit events      │
└─────────▲────────────┘
          │
┌─────────┴────────────┐
│  Tax Processing      │
│  System (Backend)    │
│  Publishes updates   │
└──────────────────────┘

Separate storage for logs:
┌──────────────────────┐
│  Elasticsearch       │
│  - Audit logs        │
│  - Search & analysis │
└──────────────────────┘

New Components:

  1. API Gateway + WAF: Rate limiting, DDoS protection
  2. Auth Service: Separate microservice for authentication (PAN + OTP)
  3. Redis Cache: Cache hot refund data (80% hit rate), TTL 5 min
  4. DB Replication: Primary for writes, replicas for reads
  5. Kafka: Event stream for status updates from tax backend
  6. AI Chatbot API: Connects to LLM (OpenAI, Claude, or local model)
  7. Elasticsearch: Audit logs with search capability

Data Flow:

Read Path (Status Check):

User → API Gateway → Status API → Check Redis cache
  ├─ Cache hit (80%): Return from Redis (10-20ms)
  └─ Cache miss (20%): Query DB replica → Store in cache → Return (100-200ms)

Write Path (Status Update):

Tax Backend → Kafka → Consumer Service → Update Primary DB → Invalidate cache

Chatbot Path:

User → API Gateway → AI Chatbot API → Retrieve context from DB →
  LLM API (OpenAI/Claude) → Generate response → Return to user

Pros:

  • ✅ Scales to 10K+ TPS (cache handles 80% reads)
  • ✅ High availability (DB replicas, cache)
  • ✅ Gen AI integration
  • ✅ Better security (API Gateway + WAF)
  • ✅ Audit logs searchable (Elasticsearch)

Cons:

  • ❌ Single region (can’t scale globally)
  • ❌ Cache invalidation complexity
  • ❌ DB still bottleneck for writes
  • ❌ No data residency for GDPR

When this works: India + neighboring countries, 10K peak TPS


Design Iteration 3: Global Scale (Final Architecture)

Improvements:

  • Multi-region deployment
  • Database sharding by country
  • Read replicas in multiple regions
  • Edge caching globally
  • Advanced security (encryption, key rotation)
                    ┌────────────────────┐
                    │   Global Users     │
                    │  (India, US, EU)   │
                    └──────────┬─────────┘
                               │
                    ┌──────────▼─────────────┐
                    │  Route 53 (GeoDNS)     │
                    │  Routes to nearest     │
                    │  region                │
                    └──┬────────────────┬────┘
                       │                │
            ┌──────────▼─────┐   ┌─────▼──────────┐
            │  India Region  │   │  US Region     │
            │  (Primary)     │   │  (Secondary)   │
            └──────┬─────────┘   └─────┬──────────┘
                   │                   │
        ┌──────────▼───────────────────▼──────────┐
        │         CloudFront (CDN)                 │
        │   - Global edge caching                  │
        │   - Static assets + API responses        │
        └──────────────────┬──────────────────────┘
                           │
        ┌──────────────────▼──────────────────────┐
        │      API Gateway + AWS WAF               │
        │   - Rate limiting: 100 req/min           │
        │   - DDoS protection                      │
        │   - Request signing & validation         │
        └───┬──────────────────────────────────┬───┘
            │                                  │
  ┌─────────▼────────┐              ┌─────────▼─────────┐
  │  Auth Service    │              │  Status API        │
  │  - MFA (OTP)     │              │  - Stateless       │
  │  - JWT tokens    │              │  - Read-heavy      │
  └─────────┬────────┘              └─────────┬─────────┘
            │                                  │
  ┌─────────▼────────────┐         ┌──────────▼─────────┐
  │  Redis (Elasticache) │         │  Redis (Cache)     │
  │  - Session store     │         │  - Refund data     │
  │  - Distributed       │         │  - TTL: 5 min      │
  └──────────────────────┘         │  - Hit rate: 80%   │
                                   └──────────┬─────────┘
                                              │
                             ┌────────────────▼────────────────┐
                             │      Database Sharding          │
                             │                                 │
  ┌──────────────────────────┼─────────────────────────────────┼──┐
  │  India Shard             │  US Shard          │ EU Shard      │
  │  ┌─────────┐  ┌────────┐ │  ┌──────┐ ┌─────┐ │ ┌──────┐      │
  │  │Primary  │─▶│Replica │ │  │Primary││Repl.│ │ │Primary│     │
  │  │(Write)  │  │(Read)  │ │  │(Write)││(Read)│ │ │(Write)│    │
  │  └────▲────┘  └────────┘ │  └───▲──┘ └─────┘ │ └───▲──┘     │
  └───────┼───────────────────┴──────┼────────────┴─────┼────────┘
          │                          │                   │
  ┌───────▼──────────────────────────▼───────────────────▼────┐
  │              Kafka (Multi-region)                          │
  │   Topics: refund_updates_india, refund_updates_us, etc.   │
  │   Cross-region replication for DR                          │
  └───────▲────────────────────────────────────────────────────┘
          │
  ┌───────┴──────────────┐
  │  Tax Processing      │
  │  Systems (Regional)  │
  │  - India Tax Dept    │
  │  - IRS (US)          │
  │  - EU Tax Systems    │
  └──────────────────────┘

Additional Components:

┌──────────────────────────────────────────────────┐
│  AI Chatbot Infrastructure                       │
│  ┌────────────────┐  ┌──────────────┐           │
│  │ Chatbot API    │  │ Vector DB    │           │
│  │ (Stateless)    │─▶│ (Pinecone)   │           │
│  └────────┬───────┘  │ - FAQs       │           │
│           │          │ - Tax rules  │           │
│           │          └──────────────┘           │
│  ┌────────▼───────────────────┐                 │
│  │  LLM Service               │                 │
│  │  - OpenAI GPT-4 / Claude  │                 │
│  │  - Self-hosted (fallback) │                 │
│  │  - Response caching       │                 │
│  └────────────────────────────┘                 │
└──────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────┐
│  Security & Compliance                           │
│  ┌────────────────┐  ┌──────────────┐           │
│  │ KMS            │  │ Secrets Mgr  │           │
│  │ - Key rotation │  │ - DB creds   │           │
│  │ - Encryption   │  │ - API keys   │           │
│  └────────────────┘  └──────────────┘           │
│                                                  │
│  ┌────────────────┐  ┌──────────────┐           │
│  │ Audit Logs     │  │ Compliance   │           │
│  │ (Elasticsearch)│  │ - GDPR       │           │
│  │ - All access   │  │ - PCI DSS    │           │
│  └────────────────┘  └──────────────┘           │
└──────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────┐
│  Observability                                   │
│  ┌────────────────┐  ┌──────────────┐           │
│  │ Prometheus +   │  │ CloudWatch   │           │
│  │ Grafana        │  │ - Metrics    │           │
│  │ - Metrics      │  │ - Alarms     │           │
│  └────────────────┘  └──────────────┘           │
│                                                  │
│  ┌────────────────┐  ┌──────────────┐           │
│  │ Distributed    │  │ ELK Stack    │           │
│  │ Tracing (Jaeger│  │ - Logs       │           │
│  │ - Request flow │  │ - Analysis   │           │
│  └────────────────┘  └──────────────┘           │
└──────────────────────────────────────────────────┘

Deep Dives

1. Security & Encryption

Encryption at Rest:

Database (PostgreSQL):
- AES-256 encryption for data at rest
- Transparent Data Encryption (TDE)
- Encrypted backups

Sensitive Fields (Extra Layer):
- PAN/SSN: Encrypt with application-level encryption
- Bank account: Tokenization + encryption
- Key: AWS KMS with automatic rotation (90 days)

Example:
encrypted_pan = AES_ENCRYPT(pan, kms_key_id)
Stored as: "ENC:v1:kms-key-id:ciphertext"

Encryption in Transit:

All communication uses TLS 1.3:
- Client ↔ API Gateway: HTTPS
- API Gateway ↔ Services: TLS mutual auth
- Services ↔ Database: SSL/TLS
- Services ↔ Redis: TLS enabled

Certificate management:
- AWS Certificate Manager (ACM)
- Auto-renewal
- Strong cipher suites only

Authentication & Authorization:

Multi-Factor Authentication:
1. User enters: PAN + Filing Year + Date of Birth
2. System generates OTP → sent to registered mobile
3. User enters OTP (valid 5 min)
4. System generates JWT token (valid 30 min)
5. Refresh token for extended session (2 hours)

JWT Token Structure:
{
  "sub": "user_id",
  "pan": "XXXXX1234Y",  // masked
  "country": "IN",
  "roles": ["taxpayer"],
  "exp": 1234567890,
  "iat": 1234567890
}

Authorization:
- User can only access their own refund data
- Query: WHERE user_id = jwt.sub AND pan = jwt.pan
- Prevents horizontal privilege escalation

Audit Logging:

Every action logged:
{
  "timestamp": "2026-04-09T10:30:00Z",
  "user_id": "user123",
  "action": "VIEW_REFUND_STATUS",
  "resource": "refund:2024:user123",
  "ip_address": "203.0.113.0",
  "user_agent": "Mozilla/5.0...",
  "result": "SUCCESS",
  "latency_ms": 45
}

Retention: 7 years
Immutable: Write-once, can't be modified
Storage: Elasticsearch (searchable) + S3 (archival)

Compliance:

  • GDPR (EU users): Right to erasure, data portability, consent management
  • PCI DSS: Payment card data protection (if storing bank info)
  • SOC 2 Type II: Annual audit for security controls
  • Data residency: EU data stored in EU region, India data in India

2. Gen AI Chatbot Integration

Architecture:

User Question: "Why is my refund delayed?"

1. User → Chatbot API

2. Retrieve User Context:
   - Query DB: Get user's refund status, amount, filing date
   - Context: "User filed on 2024-02-15, status=PROCESSING"

3. Semantic Search (Vector DB):
   - Embed question: "refund delayed" → vector
   - Search Pinecone: Find top 3 similar FAQs
   - Results:
     * "Common reasons for delays"
     * "Processing times by status"
     * "How to contact support"

4. Generate Prompt:
   System: "You are a helpful tax assistant. User context: [user data]
            FAQs: [retrieved FAQs]
            Question: Why is my refund delayed?"

5. LLM API Call (OpenAI GPT-4 / Claude):
   - Max tokens: 500
   - Temperature: 0.3 (more factual)
   - Response time: 1-2s

6. Post-Processing:
   - Sanitize response (no PII leaked)
   - Add disclaimer: "This is AI-generated..."
   - Track conversation for improvement

7. Return to User

8. Log Conversation:
   - Store in DB for analytics
   - Track: useful/not useful feedback

Components:

Vector Database (Pinecone/Weaviate):

Store embeddings for:
- FAQ documents (500+ common questions)
- Tax rules and regulations
- Status explanations
- Step-by-step guides

Embedding model: text-embedding-ada-002 (OpenAI)
Similarity search: Cosine similarity
Top K: 3 most relevant results

LLM Options:

OptionProsConsCost
OpenAI GPT-4Best quality, low latencyExpensive, data privacy$30/1M tokens
Claude 3.5Great quality, longer contextSimilar cost$15/1M tokens
Self-hosted (Llama 3)Full control, privacyNeed GPU infra, lower quality$500/mo infra
Azure OpenAIEnterprise complianceVendor lock-inSimilar to OpenAI

Recommendation: Start with Claude 3.5 (good balance), fallback to self-hosted for cost savings at scale.

Response Caching:

Common questions cached:
- "How long does processing take?" → Cache for 1 hour
- "What does status X mean?" → Cache for 1 day

Cache key: hash(question + user_status)
Cache hit rate: 40-50% (significant cost savings)

Safety & Compliance:

Guardrails:
1. Input validation: Reject offensive/irrelevant questions
2. PII detection: Never return PAN/bank details in response
3. Hallucination check: Fact-check critical info (amounts, dates)
4. Disclaimer: "AI-generated, not official advice"
5. Human escalation: Complex cases → support ticket

Prompt injection prevention:
- Sanitize user input
- Use system prompts that can't be overridden
- Monitor for suspicious patterns

Cost Optimization:

7.5M chatbot queries/day:
- Average tokens per query: 500 input + 300 output = 800 tokens
- Daily tokens: 7.5M × 800 = 6B tokens
- Cost (Claude): 6B/1M × $15 = $90/day = $32K/month

With caching (50% hit rate):
- Actual LLM calls: 3.75M/day
- Cost: $16K/month

Additional savings:
- Smaller model for simple queries (Claude Haiku): $1/1M tokens
- Self-hosted for 80% of queries, LLM for complex 20%
- Estimated: $5-10K/month at scale

3. Database Sharding Strategy

Why Shard?

  • Single DB won’t scale beyond 100M users
  • Global expansion requires data residency
  • Write load increases with more countries

Sharding Key: country_code

Rationale:
✅ Natural partitioning (users belong to one country)
✅ Data residency compliance (GDPR, local laws)
✅ Query pattern: Users query their own country's data (no cross-shard)
✅ Even distribution (if countries have similar user counts)

Shard mapping:
- India (IN): Shard 1 (60M users)
- US (US): Shard 2 (40M users)
- UK (UK): Shard 3 (10M users)
- EU (multiple countries): Shards 4-7

Routing Logic:

def get_shard(user_country_code):
    shard_map = {
        'IN': 'db-india.example.com',
        'US': 'db-us.example.com',
        'UK': 'db-uk.example.com',
        'DE': 'db-eu-1.example.com',
        'FR': 'db-eu-2.example.com',
        # ...
    }
    return shard_map.get(user_country_code)
 
# In API service
def get_refund_status(user_id, country_code):
    db_conn = get_shard(country_code)
    query = "SELECT * FROM refunds WHERE user_id = ?"
    result = db_conn.execute(query, [user_id])
    return result

Challenges:

  1. Uneven distribution: India has 100M users, small countries have 1M

    • Solution: Split large shards further (India-North, India-South)
  2. User moves countries: Rare for tax data

    • Solution: Keep data in original country, add pointer
  3. Global analytics: Need cross-shard queries

    • Solution: Replicate to data warehouse (Redshift) for analytics

Replication:

Each shard has:
- 1 Primary (writes)
- 2 Replicas in same region (reads)
- 1 Cross-region replica (disaster recovery)

India Shard:
- Primary: Mumbai
- Replica 1: Mumbai (read)
- Replica 2: Delhi (read)
- DR Replica: Singapore (disaster recovery)

4. Handling Tax Season Peak Load

Problem: Traffic spikes 10x during tax season (Jan-Apr)

Strategies:

1. Auto-Scaling:

Normal: 20 web servers (handle 1K TPS)
Peak: 200 web servers (handle 10K TPS)

Auto-scaling policy:
- Scale up when: CPU > 70% for 5 min
- Scale down when: CPU < 30% for 15 min
- Min instances: 20
- Max instances: 500
- Cooldown: 5 minutes

Cost:
- Normal: 20 × $100/mo = $2,000/mo
- Peak (3 months): 200 × $100/mo = $20,000/mo
- Annual: (9 × $2K) + (3 × $20K) = $78K/year

2. Database Read Replicas:

Normal: 1 primary + 2 replicas
Peak: 1 primary + 10 replicas

Read traffic (80% of total):
- 10K TPS × 0.8 = 8K read TPS
- Each replica: 1K TPS
- Need: 8 replicas minimum, provision 10 for buffer

3. Cache Warming:

Before tax season:
- Pre-populate Redis with all active refund records
- 60M records × 2 KB = 120 GB
- Redis cluster: 150 GB (r5.xlarge × 6 nodes)

Result: 95% cache hit rate during first week

4. Rate Limiting:

Per user: 100 requests/minute
Per IP: 1000 requests/minute (office/shared IPs)

Prevents:
- Accidental DDoS (bugs in clients)
- Malicious scraping
- Individual user abuse

5. Graceful Degradation:

If database is slow (p99 > 1s):
- Return cached data (even if stale up to 1 hour)
- Show banner: "Status may be delayed during high traffic"

If chatbot service down:
- Show static FAQ page
- Disable chatbot, keep core status check working

5. Monitoring & Alerting

Key Metrics:

Golden Signals:
1. Latency: p50, p95, p99 for each endpoint
   - Status API: <500ms p99
   - Chatbot API: <2s p99

2. Traffic: Requests per second
   - Alert if > 12K TPS (20% above peak)

3. Errors: Error rate
   - Alert if > 1% (should be <0.1%)
   - 5xx errors: Infrastructure issues
   - 4xx errors: Client issues (rate limiting, auth failures)

4. Saturation: Resource utilization
   - CPU > 80%: Scale up
   - Memory > 85%: Scale up
   - DB connections > 90%: Add read replicas
   - Cache memory > 90%: Increase cache size

Business Metrics:
- Successful authentications/min
- Refund status checks/min
- Chatbot queries/min
- Average response time by endpoint
- Cache hit rate (target: >80%)

Alerting Thresholds:

Critical (PagerDuty, wake up on-call):
- API down (health check fails for 2 min)
- Error rate > 5%
- p99 latency > 2s
- Database primary down

Warning (Slack, investigate next day):
- Error rate > 1%
- p99 latency > 1s
- Cache hit rate < 70%
- Disk space > 80%

Dashboards:

Real-time Dashboard (Grafana):
- QPS by endpoint
- Latency percentiles (p50, p95, p99)
- Error rate
- Cache hit rate
- Active users
- Database query time

Ops Dashboard:
- Server CPU/memory/disk
- Database connections
- Redis memory usage
- Kafka lag
- LLM API costs (real-time burn rate)

Business Dashboard:
- Daily active users
- Refund checks by status
- Chatbot usage
- Peak hour traffic patterns

Step 4: Wrap Up (5 min)

Bottlenecks & Scaling

Current Design Limits:

ComponentCurrent CapacityBottleneck AtSolution
API Servers10K TPS50K TPSAuto-scale to 1000 instances
Redis Cache100K ops/s500K ops/sRedis Cluster (10 nodes)
Database (per shard)10K TPS (read)100K TPSMore replicas (20+)
Database (write)5K TPS10K TPSPartition by year within shard
Kafka100K msg/s1M msg/sMore partitions, more brokers
LLM API1K req/s5K req/sSelf-hosted model cluster

At 10x Scale (1B users globally):

  1. Add more shards per country (shard by state/region)
  2. Cache layer becomes distributed (Redis Cluster with 100+ nodes)
  3. Self-hosted LLM mandatory (cost prohibitive otherwise)
  4. Data warehouse for analytics (can’t query across shards)

Failure Scenarios

FailureImpactMitigationRecovery Time
Primary DB downWrites failFailover to replica (automatic)<30 seconds
Redis downCache misses spikeDegrade gracefully, DB handles load5 minutes (restart)
API Gateway downAll traffic failsMulti-region deployment, DNS failover2 minutes
LLM API downChatbot unavailableFallback to static FAQsN/A (external service)
Entire region downRegion’s users affectedRoute to another region via GeoDNS5-10 minutes
Tax backend system downNo status updatesShow last known status, queue updatesWait for backend

Multi-Region Failover:

Normal: India users → India region
Failure: India region down → Route to Singapore region
- Data replicated to DR region (async)
- Last 5 minutes of updates may be lost (eventual consistency)
- Show banner: "Service running on backup, status may be delayed"

Cost Estimation

Infrastructure Costs (Annual, India only):

Compute:
- Web servers: 20-200 instances (avg 50)
  50 × $100/mo × 12 = $60K/year
- Database: 1 primary + 4 replicas
  5 × $400/mo × 12 = $24K/year

Storage:
- Database: 1 TB × $115/month = $1.4K/year
- Audit logs: 30 TB × $25/month (S3) = $9K/year
- Chatbot history: 7 TB × $25/month = $2.1K/year

Caching:
- Redis: 150 GB cluster
  6 nodes × $137/mo × 12 = $9.8K/year

Networking:
- Load balancer: $20/mo × 12 = $240/year
- Data transfer: 10 TB/mo × $90 = $10.8K/year
- CDN: 10 TB/mo × $40 = $4.8K/year

Kafka:
- 3 brokers × $100/mo × 12 = $3.6K/year

AI/LLM:
- Claude API: $16K/mo × 12 = $192K/year (expensive!)
- OR Self-hosted: $10K/mo × 12 = $120K/year (GPU cluster)

Monitoring:
- DataDog/New Relic: $5K/year

Total (India only):
- With Claude API: ~$320K/year
- With self-hosted LLM: ~$250K/year

Global (5 countries):
- Multiply by 3-5x (not linear due to shared components)
- Estimated: $1M - $1.5M/year

Cost Optimizations:

  1. Reserved instances for base load (40% savings)
  2. Spot instances for tax season spikes (70% savings)
  3. S3 lifecycle policies (move old logs to Glacier)
  4. Self-hosted LLM for 80% of queries, API for complex 20%
  5. Aggressive caching (reduces DB load and costs)

Future Enhancements

Phase 2 Features:

  1. Notifications: Email/SMS when status changes
  2. Appeal workflow: File disputes directly
  3. Document upload: Upload missing documents
  4. Payment tracking: Track refund deposit to bank account
  5. Historical data: View past years’ refunds
  6. Predictive ETA: ML model to predict refund date

Phase 3 (Advanced):

  1. Proactive notifications: AI predicts delays, notifies user
  2. Personalized dashboard: Tax planning, refund maximization tips
  3. Voice interface: “Alexa, check my refund status”
  4. Blockchain: Immutable audit trail for compliance
  5. Real-time updates: WebSocket connection for live status

Alternative Approaches Comparison

Database Choices

OptionProsConsWhen to Use
PostgreSQL (Chosen)Strong consistency, ACID, SQL, matureHarder to scale horizontallyTransactional data, complex queries, strong consistency needed
MongoDBFlexible schema, easy horizontal scalingEventual consistency, no joinsRapidly changing schema, simple queries
DynamoDBFully managed, infinite scale, fastExpensive at scale, limited queriesServerless, unpredictable load, simple key-value
CassandraExtreme write throughput, multi-DCEventual consistency, complex setupWrite-heavy, multi-region, eventual consistency OK

Decision: PostgreSQL for strong consistency in tax data, shard by country for horizontal scaling.


Caching Strategies

StrategyProsConsWhen to Use
Cache-Aside (Chosen)Simple, cache only what’s neededCache miss penalty, stale dataRead-heavy, flexible caching
Write-ThroughCache always consistentSlower writes (double write)Strong consistency needed
Write-BehindFast writes, async to DBRisk of data lossWrite-heavy, eventual consistency OK
Refresh-AheadProactive refresh, no missesComplex, unnecessary refreshesPredictable access patterns

Decision: Cache-aside with 5-minute TTL (refund status doesn’t change frequently).


Gen AI LLM Choices

OptionQualityLatencyCostPrivacyWhen to Use
OpenAI GPT-4Excellent1-2s$30/1M tokensExternal APIBest quality, low volume
Claude 3.5 (Chosen)Excellent1-2s$15/1M tokensExternal APIGood balance, enterprise compliance
Self-Hosted Llama 3Good500ms-1s$500/mo infraFull controlHigh volume, privacy critical
Azure OpenAIExcellent1-2s$30/1M tokensEnterprise SLAExisting Azure infra
AWS BedrockGood-Excellent1-2sVariesAWS ecosystemExisting AWS infra

Decision: Claude 3.5 initially for quality, migrate to hybrid (self-hosted for simple + Claude for complex) at scale.


Authentication Methods

MethodSecurityUXCostWhen to Use
PAN + DOB + OTP (Chosen)HighModerateLowIndia (PAN common), phone available
SSN + DOB + OTPHighModerateLowUS (SSN common)
National ID + BiometricVery HighPoorHighHigh-security requirements
OAuth (Google/Apple)ModerateExcellentLowConsumer apps, less sensitive
Hardware Token (YubiKey)Very HighPoorHighEnterprise, high security

Decision: PAN + OB + OTP for India (PAN universal), adapt to local IDs for other countries.


Sharding Strategies

StrategyProsConsWhen to Use
By Country (Chosen)Data residency, no cross-shard queriesUneven distributionGlobal app, compliance needs
By User ID (Hash)Even distributionCross-shard queries, no data residencySingle country, even load
By YearTime-based queries efficientRecent years get all loadTime-series data
Hybrid (Country + Year)Best of bothMost complexVery large scale, multi-year queries

Decision: By country for data residency, add by-year within shard if needed at scale.


Monitoring Tools

ToolProsConsWhen to Use
DataDogAll-in-one, great UXExpensive ($5-20K/year)Enterprises, full observability
Prometheus + Grafana (Chosen)Open-source, flexibleSelf-managedCost-conscious, customization
New RelicEasy setup, AI insightsExpensiveQuick setup needed
CloudWatch (AWS)Integrated with AWSLimited, AWS-onlyAWS-heavy, basic needs
ELK StackPowerful log analysisComplex setupLog-heavy workloads

Decision: Prometheus + Grafana for metrics (cost-effective), ELK for logs (searchable audit logs).


Key Takeaways

Critical Decisions

  1. Security First: Multi-layer encryption, MFA, audit logs (non-negotiable for tax data)
  2. Shard by Country: Data residency compliance, natural partitioning
  3. Cache-Aside: 80% cache hit rate reduces DB load 5x
  4. Hybrid LLM: Self-hosted for common queries, API for complex (cost optimization)
  5. Auto-Scaling: 10x traffic spikes during tax season handled gracefully

What Makes This Design Good?

Scales globally: Multi-region, sharded database
Secure: Encryption, MFA, audit logs, compliance (GDPR, PCI DSS)
Cost-effective: Caching, auto-scaling, hybrid LLM approach
Resilient: Replication, failover, graceful degradation
Observable: Comprehensive monitoring, alerting, dashboards
Compliant: Data residency, audit logs, GDPR support

Interview Performance Tips

What Impressed:

  • Started with clarifying questions (functional + non-functional requirements)
  • Did capacity estimation (showed quantitative thinking)
  • Iterative design (v1 → v2 → v3, not jumping to complex solution)
  • Security deep dive (encryption, auth, audit logs)
  • Trade-off discussion (PostgreSQL vs DynamoDB, Claude vs self-hosted)
  • Cost awareness (mentioned $250K-320K/year, optimizations)
  • Failure scenarios (multi-region failover, graceful degradation)

What to Avoid:

  • Don’t ignore security (this is tax data!)
  • Don’t over-engineer Gen AI (could be simple FAQ initially)
  • Don’t forget compliance (GDPR, PCI DSS, data residency)
  • Don’t skip monitoring (critical for high-availability systems)

Practice this problem in 45 minutes! Focus on explaining trade-offs and iterating through designs.

Last Updated: 2026-04-09