Design a Tax Refund Status Checking Portal

Difficulty: Medium
Duration: 40-45 minutes
Category: Data-Intensive, Security-Critical
Key Topics: Encryption, Gen AI, Global Scale, Compliance

Problem Statement

Design a tax refund status checking portal that allows taxpayers to check the status of their filed tax returns and expected refund amount. The system should be secure, scalable, and potentially integrate AI-powered assistance.

Initial Context from Interviewer:

Tax refund status checking
Large number of users (~1K TPS)
Data safety critical (encryption required)
Gen AI use case to integrate
Users have already filed their returns

Step 1: Understand the Problem & Establish Scope (5-7 min)

Clarifying Questions to Ask

Functional Requirements:

Q: “What information can users check?”
A: Refund status (filed, processing, approved, rejected, refunded), amount, expected date, tracking number.
Q: “Do users need to authenticate? What’s the auth mechanism?”
A: Yes. Multi-factor auth using PAN/SSN, filing year, date of birth, phone OTP.
Q: “Can users only check or also update/appeal?”
A: Initially just check. Appeal workflow is phase 2 (out of scope for now).
Q: “What’s the Gen AI use case specifically?”
A: AI chatbot to answer common questions, explain status, provide guidance on next steps.
Q: “Is this real-time status or updated periodically?”
A: Near real-time. Backend tax processing system updates status, our system reflects within minutes.
Q: “What regions do we start with and scale to?”
A: Start with India (~100M taxpayers), expand to other countries (US, EU, APAC).

Non-Functional Requirements:

Q: “What’s the expected traffic pattern? Peak periods?”
A: Tax season peaks (Jan-Apr in India). Normal: 1K TPS, Peak: 10K TPS.
Q: “What’s acceptable latency?”
A: Status check: <500ms p99. Chatbot: <2s response time.
Q: “Availability SLA?”
A: 99.9% (8.7 hours downtime/year). Tax season must be higher (99.99%).
Q: “Data retention requirements?”
A: Keep refund data for 7 years (regulatory requirement).
Q: “Security & compliance requirements?”
A: PCI DSS for payment data, encryption at rest and in transit, audit logs, GDPR for EU users.
Q: “How many concurrent users?”
A: Peak: 50K concurrent users during tax season.

Requirements Summary

Functional:

✅ User authentication (multi-factor: PAN + DOB + phone OTP)
✅ Check refund status (filed → processing → approved → refunded)
✅ View refund amount, expected date, tracking details
✅ AI chatbot for questions and guidance
✅ Multi-language support (start: English/Hindi, expand globally)
✅ Integration with backend tax processing system
⬜ Appeal/dispute filing (Phase 2, out of scope)
⬜ Notifications (email/SMS) - nice to have

Non-Functional:

Scale: 1K TPS normal, 10K TPS peak, 100M users (India), expand globally
Latency: Status check <500ms p99, Chatbot <2s
Availability: 99.9% (99.99% during tax season)
Security: Encryption (at rest + in transit), audit logs, compliance (PCI DSS, GDPR)
Data Retention: 7 years
Consistency: Eventual consistency OK (status updated within minutes)

Step 2: Capacity Estimation (3-5 min)

Traffic Estimation

Users: 100M taxpayers in India
Active users during tax season: 60M (60% file returns)

Daily Active Users (DAU):
- Peak tax season: 5M DAU
- Normal: 500K DAU

Requests per user per day:
- Checking status: 3 times/day average (anxious users check often!)
- Chatbot queries: 1-2 times/day

Total requests/day (peak):
- Status checks: 5M users × 3 = 15M requests/day
- Chatbot: 5M users × 1.5 = 7.5M requests/day
- Total: 22.5M requests/day

QPS Calculation:
- Average QPS: 22.5M / 86,400 ≈ 260 QPS
- Peak QPS (assume 10x traffic during work hours 9am-6pm):
  - Peak: 260 × 10 = 2,600 QPS
  - Interviewer mentioned 1K TPS → matches our estimate!
  - Peak during tax season: 10K TPS (given)

Storage Estimation

User Authentication Data:
- 100M users × 500 bytes = 50 GB

Refund Records:
- 60M filings/year
- Each record: user_id, filing_year, status, amount, dates, tracking
- Size per record: 2 KB (with indexes)
- Annual: 60M × 2 KB = 120 GB/year
- 7 years retention: 120 GB × 7 = 840 GB ≈ 1 TB

Audit Logs (every access logged):
- 22.5M requests/day × 365 days = 8.2B logs/year
- Each log: 500 bytes (timestamp, user_id, action, IP, etc.)
- Annual: 8.2B × 500 bytes = 4.1 TB/year
- 7 years: 4.1 TB × 7 = 28.7 TB ≈ 30 TB

Chatbot Conversation History:
- 7.5M conversations/day × 365 = 2.7B conversations/year
- Average conversation: 5 messages × 500 bytes = 2.5 KB
- Annual: 2.7B × 2.5 KB = 6.75 TB/year
- Keep 1 year: 6.75 TB

Total Storage:
- Refund data: 1 TB
- Audit logs: 30 TB
- Chatbot history: 6.75 TB
- Total: ~38 TB (round to 40 TB with overhead)

Bandwidth Estimation

Read Traffic (status checks):
- 15M reads/day × 2 KB = 30 GB/day
- Per second: 30 GB / 86,400 ≈ 347 KB/s ≈ 2.8 Mbps

Chatbot Traffic:
- 7.5M interactions/day
- Average: 5 messages × 500 bytes = 2.5 KB per conversation
- Daily: 7.5M × 2.5 KB = 18.75 GB/day
- Per second: 217 KB/s ≈ 1.7 Mbps

Total Bandwidth:
- Read: ~3 Mbps average, 30 Mbps peak
- Write (logs, updates): ~1 Mbps
- Total: 5-40 Mbps (very manageable)

Summary

Metric	Value
Users	100M (India), expand globally
DAU (peak)	5M
QPS (average)	260 QPS
QPS (peak)	2,600 QPS (10K during tax season)
Storage	40 TB (with 7-year retention)
Bandwidth	5-40 Mbps
Latency Target	<500ms (status), <2s (chatbot)
Availability	99.9% (99.99% tax season)

Step 3: High-Level Design & Iterations (20-25 min)

Design Iteration 1: Basic Architecture (MVP)

Goal: Simple, working system for India market

┌─────────────┐
│   Users     │
│ (Web/Mobile)│
└──────┬──────┘
       │ HTTPS
       │
┌──────▼──────────────────────────────────────┐
│         Load Balancer (ALB)                  │
└──────┬──────────────────────────────────────┘
       │
┌──────▼──────────────┐
│  Web Servers         │
│  (Authentication +   │
│   Status API)        │
└──────┬──────────────┘
       │
┌──────▼──────────────────────────────────────┐
│           Database (PostgreSQL)              │
│   Tables: users, refunds, audit_logs         │
│   Encryption: AES-256 at rest                │
└──────▲──────────────────────────────────────┘
       │
       │ Updates (batch sync every 5 min)
       │
┌──────┴──────────────┐
│  Tax Processing      │
│  System (Backend)    │
│  (External)          │
└─────────────────────┘

Components:

Load Balancer: Distribute traffic, SSL termination
Web Servers: Handle auth, status queries, business logic
PostgreSQL: Store refund data, user info, audit logs
Batch Sync: Periodic updates from tax backend

Pros:

✅ Simple to implement
✅ Strong consistency (single DB)
✅ Easy to secure (encryption at rest)

Cons:

❌ Single point of failure (database)
❌ Doesn’t scale beyond 1K QPS
❌ No caching (slow reads)
❌ No Gen AI integration
❌ Can’t handle 10K peak TPS

When this works: Early stage, India only, <1K QPS

Design Iteration 2: Scalable with Caching & AI

Improvements:

Add caching layer for read-heavy workload
Database replication for read scalability
Integrate Gen AI chatbot
Separate auth service

┌─────────────┐
│   Users     │
└──────┬──────┘
       │ HTTPS
       │
┌──────▼───────────────────────────────────────┐
│     CDN (CloudFront)                          │
│     - Static assets (JS, CSS, images)         │
└──────┬───────────────────────────────────────┘
       │
┌──────▼───────────────────────────────────────┐
│     API Gateway + WAF                         │
│     - Rate limiting (100 req/min per user)    │
│     - DDoS protection                         │
└──────┬───────────────────────────────────────┘
       │
       ├──────────────────┬────────────────────┐
       │                  │                    │
┌──────▼──────┐  ┌───────▼────────┐  ┌───────▼──────────┐
│   Auth      │  │  Status API    │  │  AI Chatbot API  │
│   Service   │  │  Service       │  │                  │
│ (Stateless) │  │  (Stateless)   │  │  (Gen AI)        │
└──────┬──────┘  └────────┬───────┘  └───────┬──────────┘
       │                  │                    │
       │         ┌────────▼────────┐           │
       │         │  Redis Cache     │           │
       │         │  - Hot refund    │           │
       │         │    status (TTL:  │           │
       │         │    5 min)        │           │
       │         └────────┬────────┘           │
       │                  │                    │
┌──────▼──────────────────▼────────────────────▼──────┐
│              Database Layer                          │
│  ┌──────────────┐    ┌──────────┐  ┌──────────┐    │
│  │  Primary DB  │───▶│ Replica1 │  │ Replica2 │    │
│  │ (PostgreSQL) │    │  (Read)  │  │  (Read)  │    │
│  │   (Write)    │    └──────────┘  └──────────┘    │
│  └──────▲───────┘                                    │
└─────────┼──────────────────────────────────────────┘
          │
          │ Kafka topic: refund_updates
          │
┌─────────▼────────────┐
│  Message Queue       │
│  (Kafka)             │
│  - Status updates    │
│  - Audit events      │
└─────────▲────────────┘
          │
┌─────────┴────────────┐
│  Tax Processing      │
│  System (Backend)    │
│  Publishes updates   │
└──────────────────────┘

Separate storage for logs:
┌──────────────────────┐
│  Elasticsearch       │
│  - Audit logs        │
│  - Search & analysis │
└──────────────────────┘

New Components:

API Gateway + WAF: Rate limiting, DDoS protection
Auth Service: Separate microservice for authentication (PAN + OTP)
Redis Cache: Cache hot refund data (80% hit rate), TTL 5 min
DB Replication: Primary for writes, replicas for reads
Kafka: Event stream for status updates from tax backend
AI Chatbot API: Connects to LLM (OpenAI, Claude, or local model)
Elasticsearch: Audit logs with search capability

Data Flow:

Read Path (Status Check):

User → API Gateway → Status API → Check Redis cache
  ├─ Cache hit (80%): Return from Redis (10-20ms)
  └─ Cache miss (20%): Query DB replica → Store in cache → Return (100-200ms)

Write Path (Status Update):

Tax Backend → Kafka → Consumer Service → Update Primary DB → Invalidate cache

Chatbot Path:

User → API Gateway → AI Chatbot API → Retrieve context from DB →
  LLM API (OpenAI/Claude) → Generate response → Return to user

Pros:

✅ Scales to 10K+ TPS (cache handles 80% reads)
✅ High availability (DB replicas, cache)
✅ Gen AI integration
✅ Better security (API Gateway + WAF)
✅ Audit logs searchable (Elasticsearch)

Cons:

❌ Single region (can’t scale globally)
❌ Cache invalidation complexity
❌ DB still bottleneck for writes
❌ No data residency for GDPR

When this works: India + neighboring countries, 10K peak TPS

Design Iteration 3: Global Scale (Final Architecture)

Improvements:

Multi-region deployment
Database sharding by country
Read replicas in multiple regions
Edge caching globally
Advanced security (encryption, key rotation)

                    ┌────────────────────┐
                    │   Global Users     │
                    │  (India, US, EU)   │
                    └──────────┬─────────┘
                               │
                    ┌──────────▼─────────────┐
                    │  Route 53 (GeoDNS)     │
                    │  Routes to nearest     │
                    │  region                │
                    └──┬────────────────┬────┘
                       │                │
            ┌──────────▼─────┐   ┌─────▼──────────┐
            │  India Region  │   │  US Region     │
            │  (Primary)     │   │  (Secondary)   │
            └──────┬─────────┘   └─────┬──────────┘
                   │                   │
        ┌──────────▼───────────────────▼──────────┐
        │         CloudFront (CDN)                 │
        │   - Global edge caching                  │
        │   - Static assets + API responses        │
        └──────────────────┬──────────────────────┘
                           │
        ┌──────────────────▼──────────────────────┐
        │      API Gateway + AWS WAF               │
        │   - Rate limiting: 100 req/min           │
        │   - DDoS protection                      │
        │   - Request signing & validation         │
        └───┬──────────────────────────────────┬───┘
            │                                  │
  ┌─────────▼────────┐              ┌─────────▼─────────┐
  │  Auth Service    │              │  Status API        │
  │  - MFA (OTP)     │              │  - Stateless       │
  │  - JWT tokens    │              │  - Read-heavy      │
  └─────────┬────────┘              └─────────┬─────────┘
            │                                  │
  ┌─────────▼────────────┐         ┌──────────▼─────────┐
  │  Redis (Elasticache) │         │  Redis (Cache)     │
  │  - Session store     │         │  - Refund data     │
  │  - Distributed       │         │  - TTL: 5 min      │
  └──────────────────────┘         │  - Hit rate: 80%   │
                                   └──────────┬─────────┘
                                              │
                             ┌────────────────▼────────────────┐
                             │      Database Sharding          │
                             │                                 │
  ┌──────────────────────────┼─────────────────────────────────┼──┐
  │  India Shard             │  US Shard          │ EU Shard      │
  │  ┌─────────┐  ┌────────┐ │  ┌──────┐ ┌─────┐ │ ┌──────┐      │
  │  │Primary  │─▶│Replica │ │  │Primary││Repl.│ │ │Primary│     │
  │  │(Write)  │  │(Read)  │ │  │(Write)││(Read)│ │ │(Write)│    │
  │  └────▲────┘  └────────┘ │  └───▲──┘ └─────┘ │ └───▲──┘     │
  └───────┼───────────────────┴──────┼────────────┴─────┼────────┘
          │                          │                   │
  ┌───────▼──────────────────────────▼───────────────────▼────┐
  │              Kafka (Multi-region)                          │
  │   Topics: refund_updates_india, refund_updates_us, etc.   │
  │   Cross-region replication for DR                          │
  └───────▲────────────────────────────────────────────────────┘
          │
  ┌───────┴──────────────┐
  │  Tax Processing      │
  │  Systems (Regional)  │
  │  - India Tax Dept    │
  │  - IRS (US)          │
  │  - EU Tax Systems    │
  └──────────────────────┘

Additional Components:

┌──────────────────────────────────────────────────┐
│  AI Chatbot Infrastructure                       │
│  ┌────────────────┐  ┌──────────────┐           │
│  │ Chatbot API    │  │ Vector DB    │           │
│  │ (Stateless)    │─▶│ (Pinecone)   │           │
│  └────────┬───────┘  │ - FAQs       │           │
│           │          │ - Tax rules  │           │
│           │          └──────────────┘           │
│  ┌────────▼───────────────────┐                 │
│  │  LLM Service               │                 │
│  │  - OpenAI GPT-4 / Claude  │                 │
│  │  - Self-hosted (fallback) │                 │
│  │  - Response caching       │                 │
│  └────────────────────────────┘                 │
└──────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────┐
│  Security & Compliance                           │
│  ┌────────────────┐  ┌──────────────┐           │
│  │ KMS            │  │ Secrets Mgr  │           │
│  │ - Key rotation │  │ - DB creds   │           │
│  │ - Encryption   │  │ - API keys   │           │
│  └────────────────┘  └──────────────┘           │
│                                                  │
│  ┌────────────────┐  ┌──────────────┐           │
│  │ Audit Logs     │  │ Compliance   │           │
│  │ (Elasticsearch)│  │ - GDPR       │           │
│  │ - All access   │  │ - PCI DSS    │           │
│  └────────────────┘  └──────────────┘           │
└──────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────┐
│  Observability                                   │
│  ┌────────────────┐  ┌──────────────┐           │
│  │ Prometheus +   │  │ CloudWatch   │           │
│  │ Grafana        │  │ - Metrics    │           │
│  │ - Metrics      │  │ - Alarms     │           │
│  └────────────────┘  └──────────────┘           │
│                                                  │
│  ┌────────────────┐  ┌──────────────┐           │
│  │ Distributed    │  │ ELK Stack    │           │
│  │ Tracing (Jaeger│  │ - Logs       │           │
│  │ - Request flow │  │ - Analysis   │           │
│  └────────────────┘  └──────────────┘           │
└──────────────────────────────────────────────────┘

Deep Dives

1. Security & Encryption

Encryption at Rest:

Database (PostgreSQL):
- AES-256 encryption for data at rest
- Transparent Data Encryption (TDE)
- Encrypted backups

Sensitive Fields (Extra Layer):
- PAN/SSN: Encrypt with application-level encryption
- Bank account: Tokenization + encryption
- Key: AWS KMS with automatic rotation (90 days)

Example:
encrypted_pan = AES_ENCRYPT(pan, kms_key_id)
Stored as: "ENC:v1:kms-key-id:ciphertext"

Encryption in Transit:

All communication uses TLS 1.3:
- Client ↔ API Gateway: HTTPS
- API Gateway ↔ Services: TLS mutual auth
- Services ↔ Database: SSL/TLS
- Services ↔ Redis: TLS enabled

Certificate management:
- AWS Certificate Manager (ACM)
- Auto-renewal
- Strong cipher suites only

Authentication & Authorization:

Multi-Factor Authentication:
1. User enters: PAN + Filing Year + Date of Birth
2. System generates OTP → sent to registered mobile
3. User enters OTP (valid 5 min)
4. System generates JWT token (valid 30 min)
5. Refresh token for extended session (2 hours)

JWT Token Structure:
{
  "sub": "user_id",
  "pan": "XXXXX1234Y",  // masked
  "country": "IN",
  "roles": ["taxpayer"],
  "exp": 1234567890,
  "iat": 1234567890
}

Authorization:
- User can only access their own refund data
- Query: WHERE user_id = jwt.sub AND pan = jwt.pan
- Prevents horizontal privilege escalation

Audit Logging:

Every action logged:
{
  "timestamp": "2026-04-09T10:30:00Z",
  "user_id": "user123",
  "action": "VIEW_REFUND_STATUS",
  "resource": "refund:2024:user123",
  "ip_address": "203.0.113.0",
  "user_agent": "Mozilla/5.0...",
  "result": "SUCCESS",
  "latency_ms": 45
}

Retention: 7 years
Immutable: Write-once, can't be modified
Storage: Elasticsearch (searchable) + S3 (archival)

Compliance:

GDPR (EU users): Right to erasure, data portability, consent management
PCI DSS: Payment card data protection (if storing bank info)
SOC 2 Type II: Annual audit for security controls
Data residency: EU data stored in EU region, India data in India

2. Gen AI Chatbot Integration

Architecture:

User Question: "Why is my refund delayed?"

1. User → Chatbot API

2. Retrieve User Context:
   - Query DB: Get user's refund status, amount, filing date
   - Context: "User filed on 2024-02-15, status=PROCESSING"

3. Semantic Search (Vector DB):
   - Embed question: "refund delayed" → vector
   - Search Pinecone: Find top 3 similar FAQs
   - Results:
     * "Common reasons for delays"
     * "Processing times by status"
     * "How to contact support"

4. Generate Prompt:
   System: "You are a helpful tax assistant. User context: [user data]
            FAQs: [retrieved FAQs]
            Question: Why is my refund delayed?"

5. LLM API Call (OpenAI GPT-4 / Claude):
   - Max tokens: 500
   - Temperature: 0.3 (more factual)
   - Response time: 1-2s

6. Post-Processing:
   - Sanitize response (no PII leaked)
   - Add disclaimer: "This is AI-generated..."
   - Track conversation for improvement

7. Return to User

8. Log Conversation:
   - Store in DB for analytics
   - Track: useful/not useful feedback

Components:

Vector Database (Pinecone/Weaviate):

Store embeddings for:
- FAQ documents (500+ common questions)
- Tax rules and regulations
- Status explanations
- Step-by-step guides

Embedding model: text-embedding-ada-002 (OpenAI)
Similarity search: Cosine similarity
Top K: 3 most relevant results

LLM Options:

Option	Pros	Cons	Cost
OpenAI GPT-4	Best quality, low latency	Expensive, data privacy	$30/1M tokens
Claude 3.5	Great quality, longer context	Similar cost	$15/1M tokens
Self-hosted (Llama 3)	Full control, privacy	Need GPU infra, lower quality	$500/mo infra
Azure OpenAI	Enterprise compliance	Vendor lock-in	Similar to OpenAI

Recommendation: Start with Claude 3.5 (good balance), fallback to self-hosted for cost savings at scale.

Response Caching:

Common questions cached:
- "How long does processing take?" → Cache for 1 hour
- "What does status X mean?" → Cache for 1 day

Cache key: hash(question + user_status)
Cache hit rate: 40-50% (significant cost savings)

Safety & Compliance:

Guardrails:
1. Input validation: Reject offensive/irrelevant questions
2. PII detection: Never return PAN/bank details in response
3. Hallucination check: Fact-check critical info (amounts, dates)
4. Disclaimer: "AI-generated, not official advice"
5. Human escalation: Complex cases → support ticket

Prompt injection prevention:
- Sanitize user input
- Use system prompts that can't be overridden
- Monitor for suspicious patterns

Cost Optimization:

7.5M chatbot queries/day:
- Average tokens per query: 500 input + 300 output = 800 tokens
- Daily tokens: 7.5M × 800 = 6B tokens
- Cost (Claude): 6B/1M × $15 = $90/day = $32K/month

With caching (50% hit rate):
- Actual LLM calls: 3.75M/day
- Cost: $16K/month

Additional savings:
- Smaller model for simple queries (Claude Haiku): $1/1M tokens
- Self-hosted for 80% of queries, LLM for complex 20%
- Estimated: $5-10K/month at scale

3. Database Sharding Strategy

Why Shard?

Single DB won’t scale beyond 100M users
Global expansion requires data residency
Write load increases with more countries

Sharding Key: country_code

Rationale:
✅ Natural partitioning (users belong to one country)
✅ Data residency compliance (GDPR, local laws)
✅ Query pattern: Users query their own country's data (no cross-shard)
✅ Even distribution (if countries have similar user counts)

Shard mapping:
- India (IN): Shard 1 (60M users)
- US (US): Shard 2 (40M users)
- UK (UK): Shard 3 (10M users)
- EU (multiple countries): Shards 4-7

Routing Logic:

def get_shard(user_country_code):
    shard_map = {
        'IN': 'db-india.example.com',
        'US': 'db-us.example.com',
        'UK': 'db-uk.example.com',
        'DE': 'db-eu-1.example.com',
        'FR': 'db-eu-2.example.com',
        # ...
    }
    return shard_map.get(user_country_code)
 
# In API service
def get_refund_status(user_id, country_code):
    db_conn = get_shard(country_code)
    query = "SELECT * FROM refunds WHERE user_id = ?"
    result = db_conn.execute(query, [user_id])
    return result

Challenges:

Uneven distribution: India has 100M users, small countries have 1M
- Solution: Split large shards further (India-North, India-South)
User moves countries: Rare for tax data
- Solution: Keep data in original country, add pointer
Global analytics: Need cross-shard queries
- Solution: Replicate to data warehouse (Redshift) for analytics

Replication:

Each shard has:
- 1 Primary (writes)
- 2 Replicas in same region (reads)
- 1 Cross-region replica (disaster recovery)

India Shard:
- Primary: Mumbai
- Replica 1: Mumbai (read)
- Replica 2: Delhi (read)
- DR Replica: Singapore (disaster recovery)

4. Handling Tax Season Peak Load

Problem: Traffic spikes 10x during tax season (Jan-Apr)

Strategies:

1. Auto-Scaling:

Normal: 20 web servers (handle 1K TPS)
Peak: 200 web servers (handle 10K TPS)

Auto-scaling policy:
- Scale up when: CPU > 70% for 5 min
- Scale down when: CPU < 30% for 15 min
- Min instances: 20
- Max instances: 500
- Cooldown: 5 minutes

Cost:
- Normal: 20 × $100/mo = $2,000/mo
- Peak (3 months): 200 × $100/mo = $20,000/mo
- Annual: (9 × $2K) + (3 × $20K) = $78K/year

2. Database Read Replicas:

Normal: 1 primary + 2 replicas
Peak: 1 primary + 10 replicas

Read traffic (80% of total):
- 10K TPS × 0.8 = 8K read TPS
- Each replica: 1K TPS
- Need: 8 replicas minimum, provision 10 for buffer

3. Cache Warming:

Before tax season:
- Pre-populate Redis with all active refund records
- 60M records × 2 KB = 120 GB
- Redis cluster: 150 GB (r5.xlarge × 6 nodes)

Result: 95% cache hit rate during first week

4. Rate Limiting:

Per user: 100 requests/minute
Per IP: 1000 requests/minute (office/shared IPs)

Prevents:
- Accidental DDoS (bugs in clients)
- Malicious scraping
- Individual user abuse

5. Graceful Degradation:

If database is slow (p99 > 1s):
- Return cached data (even if stale up to 1 hour)
- Show banner: "Status may be delayed during high traffic"

If chatbot service down:
- Show static FAQ page
- Disable chatbot, keep core status check working

5. Monitoring & Alerting

Key Metrics:

Golden Signals:
1. Latency: p50, p95, p99 for each endpoint
   - Status API: <500ms p99
   - Chatbot API: <2s p99

2. Traffic: Requests per second
   - Alert if > 12K TPS (20% above peak)

3. Errors: Error rate
   - Alert if > 1% (should be <0.1%)
   - 5xx errors: Infrastructure issues
   - 4xx errors: Client issues (rate limiting, auth failures)

4. Saturation: Resource utilization
   - CPU > 80%: Scale up
   - Memory > 85%: Scale up
   - DB connections > 90%: Add read replicas
   - Cache memory > 90%: Increase cache size

Business Metrics:
- Successful authentications/min
- Refund status checks/min
- Chatbot queries/min
- Average response time by endpoint
- Cache hit rate (target: >80%)

Alerting Thresholds:

Critical (PagerDuty, wake up on-call):
- API down (health check fails for 2 min)
- Error rate > 5%
- p99 latency > 2s
- Database primary down

Warning (Slack, investigate next day):
- Error rate > 1%
- p99 latency > 1s
- Cache hit rate < 70%
- Disk space > 80%

Dashboards:

Real-time Dashboard (Grafana):
- QPS by endpoint
- Latency percentiles (p50, p95, p99)
- Error rate
- Cache hit rate
- Active users
- Database query time

Ops Dashboard:
- Server CPU/memory/disk
- Database connections
- Redis memory usage
- Kafka lag
- LLM API costs (real-time burn rate)

Business Dashboard:
- Daily active users
- Refund checks by status
- Chatbot usage
- Peak hour traffic patterns

Step 4: Wrap Up (5 min)

Bottlenecks & Scaling

Current Design Limits:

Component	Current Capacity	Bottleneck At	Solution
API Servers	10K TPS	50K TPS	Auto-scale to 1000 instances
Redis Cache	100K ops/s	500K ops/s	Redis Cluster (10 nodes)
Database (per shard)	10K TPS (read)	100K TPS	More replicas (20+)
Database (write)	5K TPS	10K TPS	Partition by year within shard
Kafka	100K msg/s	1M msg/s	More partitions, more brokers
LLM API	1K req/s	5K req/s	Self-hosted model cluster

At 10x Scale (1B users globally):

Add more shards per country (shard by state/region)
Cache layer becomes distributed (Redis Cluster with 100+ nodes)
Self-hosted LLM mandatory (cost prohibitive otherwise)
Data warehouse for analytics (can’t query across shards)

Failure Scenarios

Failure	Impact	Mitigation	Recovery Time
Primary DB down	Writes fail	Failover to replica (automatic)	<30 seconds
Redis down	Cache misses spike	Degrade gracefully, DB handles load	5 minutes (restart)
API Gateway down	All traffic fails	Multi-region deployment, DNS failover	2 minutes
LLM API down	Chatbot unavailable	Fallback to static FAQs	N/A (external service)
Entire region down	Region’s users affected	Route to another region via GeoDNS	5-10 minutes
Tax backend system down	No status updates	Show last known status, queue updates	Wait for backend

Multi-Region Failover:

Normal: India users → India region
Failure: India region down → Route to Singapore region
- Data replicated to DR region (async)
- Last 5 minutes of updates may be lost (eventual consistency)
- Show banner: "Service running on backup, status may be delayed"

Cost Estimation

Infrastructure Costs (Annual, India only):

Compute:
- Web servers: 20-200 instances (avg 50)
  50 × $100/mo × 12 = $60K/year
- Database: 1 primary + 4 replicas
  5 × $400/mo × 12 = $24K/year

Storage:
- Database: 1 TB × $115/month = $1.4K/year
- Audit logs: 30 TB × $25/month (S3) = $9K/year
- Chatbot history: 7 TB × $25/month = $2.1K/year

Caching:
- Redis: 150 GB cluster
  6 nodes × $137/mo × 12 = $9.8K/year

Networking:
- Load balancer: $20/mo × 12 = $240/year
- Data transfer: 10 TB/mo × $90 = $10.8K/year
- CDN: 10 TB/mo × $40 = $4.8K/year

Kafka:
- 3 brokers × $100/mo × 12 = $3.6K/year

AI/LLM:
- Claude API: $16K/mo × 12 = $192K/year (expensive!)
- OR Self-hosted: $10K/mo × 12 = $120K/year (GPU cluster)

Monitoring:
- DataDog/New Relic: $5K/year

Total (India only):
- With Claude API: ~$320K/year
- With self-hosted LLM: ~$250K/year

Global (5 countries):
- Multiply by 3-5x (not linear due to shared components)
- Estimated: $1M - $1.5M/year

Cost Optimizations:

Reserved instances for base load (40% savings)
Spot instances for tax season spikes (70% savings)
S3 lifecycle policies (move old logs to Glacier)
Self-hosted LLM for 80% of queries, API for complex 20%
Aggressive caching (reduces DB load and costs)

Future Enhancements

Phase 2 Features:

Notifications: Email/SMS when status changes
Appeal workflow: File disputes directly
Document upload: Upload missing documents
Payment tracking: Track refund deposit to bank account
Historical data: View past years’ refunds
Predictive ETA: ML model to predict refund date

Phase 3 (Advanced):

Proactive notifications: AI predicts delays, notifies user
Personalized dashboard: Tax planning, refund maximization tips
Voice interface: “Alexa, check my refund status”
Blockchain: Immutable audit trail for compliance
Real-time updates: WebSocket connection for live status

Alternative Approaches Comparison

Database Choices

Option	Pros	Cons	When to Use
PostgreSQL (Chosen)	Strong consistency, ACID, SQL, mature	Harder to scale horizontally	Transactional data, complex queries, strong consistency needed
MongoDB	Flexible schema, easy horizontal scaling	Eventual consistency, no joins	Rapidly changing schema, simple queries
DynamoDB	Fully managed, infinite scale, fast	Expensive at scale, limited queries	Serverless, unpredictable load, simple key-value
Cassandra	Extreme write throughput, multi-DC	Eventual consistency, complex setup	Write-heavy, multi-region, eventual consistency OK

Decision: PostgreSQL for strong consistency in tax data, shard by country for horizontal scaling.

Caching Strategies

Strategy	Pros	Cons	When to Use
Cache-Aside (Chosen)	Simple, cache only what’s needed	Cache miss penalty, stale data	Read-heavy, flexible caching
Write-Through	Cache always consistent	Slower writes (double write)	Strong consistency needed
Write-Behind	Fast writes, async to DB	Risk of data loss	Write-heavy, eventual consistency OK
Refresh-Ahead	Proactive refresh, no misses	Complex, unnecessary refreshes	Predictable access patterns

Decision: Cache-aside with 5-minute TTL (refund status doesn’t change frequently).

Gen AI LLM Choices

Option	Quality	Latency	Cost	Privacy	When to Use
OpenAI GPT-4	Excellent	1-2s	$30/1M tokens	External API	Best quality, low volume
Claude 3.5 (Chosen)	Excellent	1-2s	$15/1M tokens	External API	Good balance, enterprise compliance
Self-Hosted Llama 3	Good	500ms-1s	$500/mo infra	Full control	High volume, privacy critical
Azure OpenAI	Excellent	1-2s	$30/1M tokens	Enterprise SLA	Existing Azure infra
AWS Bedrock	Good-Excellent	1-2s	Varies	AWS ecosystem	Existing AWS infra

Decision: Claude 3.5 initially for quality, migrate to hybrid (self-hosted for simple + Claude for complex) at scale.

Authentication Methods

Method	Security	UX	Cost	When to Use
PAN + DOB + OTP (Chosen)	High	Moderate	Low	India (PAN common), phone available
SSN + DOB + OTP	High	Moderate	Low	US (SSN common)
National ID + Biometric	Very High	Poor	High	High-security requirements
OAuth (Google/Apple)	Moderate	Excellent	Low	Consumer apps, less sensitive
Hardware Token (YubiKey)	Very High	Poor	High	Enterprise, high security

Decision: PAN + OB + OTP for India (PAN universal), adapt to local IDs for other countries.

Sharding Strategies

Strategy	Pros	Cons	When to Use
By Country (Chosen)	Data residency, no cross-shard queries	Uneven distribution	Global app, compliance needs
By User ID (Hash)	Even distribution	Cross-shard queries, no data residency	Single country, even load
By Year	Time-based queries efficient	Recent years get all load	Time-series data
Hybrid (Country + Year)	Best of both	Most complex	Very large scale, multi-year queries

Decision: By country for data residency, add by-year within shard if needed at scale.

Monitoring Tools

Tool	Pros	Cons	When to Use
DataDog	All-in-one, great UX	Expensive ($5-20K/year)	Enterprises, full observability
Prometheus + Grafana (Chosen)	Open-source, flexible	Self-managed	Cost-conscious, customization
New Relic	Easy setup, AI insights	Expensive	Quick setup needed
CloudWatch (AWS)	Integrated with AWS	Limited, AWS-only	AWS-heavy, basic needs
ELK Stack	Powerful log analysis	Complex setup	Log-heavy workloads

Decision: Prometheus + Grafana for metrics (cost-effective), ELK for logs (searchable audit logs).

Key Takeaways

Critical Decisions

Security First: Multi-layer encryption, MFA, audit logs (non-negotiable for tax data)
Shard by Country: Data residency compliance, natural partitioning
Cache-Aside: 80% cache hit rate reduces DB load 5x
Hybrid LLM: Self-hosted for common queries, API for complex (cost optimization)
Auto-Scaling: 10x traffic spikes during tax season handled gracefully

What Makes This Design Good?

✅ Scales globally: Multi-region, sharded database
✅ Secure: Encryption, MFA, audit logs, compliance (GDPR, PCI DSS)
✅ Cost-effective: Caching, auto-scaling, hybrid LLM approach
✅ Resilient: Replication, failover, graceful degradation
✅ Observable: Comprehensive monitoring, alerting, dashboards
✅ Compliant: Data residency, audit logs, GDPR support

Interview Performance Tips

What Impressed:

Started with clarifying questions (functional + non-functional requirements)
Did capacity estimation (showed quantitative thinking)
Iterative design (v1 → v2 → v3, not jumping to complex solution)
Security deep dive (encryption, auth, audit logs)
Trade-off discussion (PostgreSQL vs DynamoDB, Claude vs self-hosted)
Cost awareness (mentioned $250K-320K/year, optimizations)
Failure scenarios (multi-region failover, graceful degradation)

What to Avoid:

Don’t ignore security (this is tax data!)
Don’t over-engineer Gen AI (could be simple FAQ initially)
Don’t forget compliance (GDPR, PCI DSS, data residency)
Don’t skip monitoring (critical for high-availability systems)

Practice this problem in 45 minutes! Focus on explaining trade-offs and iterating through designs.

Last Updated: 2026-04-09

Study Notes by Niladri & AI

Explorer

tax-refund-portal