Design a Tax Refund Status Checking Portal
Difficulty: Medium
Duration: 40-45 minutes
Category: Data-Intensive, Security-Critical
Key Topics: Encryption, Gen AI, Global Scale, Compliance
Problem Statement
Design a tax refund status checking portal that allows taxpayers to check the status of their filed tax returns and expected refund amount. The system should be secure, scalable, and potentially integrate AI-powered assistance.
Initial Context from Interviewer:
- Tax refund status checking
- Large number of users (~1K TPS)
- Data safety critical (encryption required)
- Gen AI use case to integrate
- Users have already filed their returns
Step 1: Understand the Problem & Establish Scope (5-7 min)
Clarifying Questions to Ask
Functional Requirements:
-
Q: “What information can users check?”
A: Refund status (filed, processing, approved, rejected, refunded), amount, expected date, tracking number. -
Q: “Do users need to authenticate? What’s the auth mechanism?”
A: Yes. Multi-factor auth using PAN/SSN, filing year, date of birth, phone OTP. -
Q: “Can users only check or also update/appeal?”
A: Initially just check. Appeal workflow is phase 2 (out of scope for now). -
Q: “What’s the Gen AI use case specifically?”
A: AI chatbot to answer common questions, explain status, provide guidance on next steps. -
Q: “Is this real-time status or updated periodically?”
A: Near real-time. Backend tax processing system updates status, our system reflects within minutes. -
Q: “What regions do we start with and scale to?”
A: Start with India (~100M taxpayers), expand to other countries (US, EU, APAC).
Non-Functional Requirements:
-
Q: “What’s the expected traffic pattern? Peak periods?”
A: Tax season peaks (Jan-Apr in India). Normal: 1K TPS, Peak: 10K TPS. -
Q: “What’s acceptable latency?”
A: Status check: <500ms p99. Chatbot: <2s response time. -
Q: “Availability SLA?”
A: 99.9% (8.7 hours downtime/year). Tax season must be higher (99.99%). -
Q: “Data retention requirements?”
A: Keep refund data for 7 years (regulatory requirement). -
Q: “Security & compliance requirements?”
A: PCI DSS for payment data, encryption at rest and in transit, audit logs, GDPR for EU users. -
Q: “How many concurrent users?”
A: Peak: 50K concurrent users during tax season.
Requirements Summary
Functional:
- ✅ User authentication (multi-factor: PAN + DOB + phone OTP)
- ✅ Check refund status (filed → processing → approved → refunded)
- ✅ View refund amount, expected date, tracking details
- ✅ AI chatbot for questions and guidance
- ✅ Multi-language support (start: English/Hindi, expand globally)
- ✅ Integration with backend tax processing system
- ⬜ Appeal/dispute filing (Phase 2, out of scope)
- ⬜ Notifications (email/SMS) - nice to have
Non-Functional:
- Scale: 1K TPS normal, 10K TPS peak, 100M users (India), expand globally
- Latency: Status check <500ms p99, Chatbot <2s
- Availability: 99.9% (99.99% during tax season)
- Security: Encryption (at rest + in transit), audit logs, compliance (PCI DSS, GDPR)
- Data Retention: 7 years
- Consistency: Eventual consistency OK (status updated within minutes)
Step 2: Capacity Estimation (3-5 min)
Traffic Estimation
Users: 100M taxpayers in India
Active users during tax season: 60M (60% file returns)
Daily Active Users (DAU):
- Peak tax season: 5M DAU
- Normal: 500K DAU
Requests per user per day:
- Checking status: 3 times/day average (anxious users check often!)
- Chatbot queries: 1-2 times/day
Total requests/day (peak):
- Status checks: 5M users × 3 = 15M requests/day
- Chatbot: 5M users × 1.5 = 7.5M requests/day
- Total: 22.5M requests/day
QPS Calculation:
- Average QPS: 22.5M / 86,400 ≈ 260 QPS
- Peak QPS (assume 10x traffic during work hours 9am-6pm):
- Peak: 260 × 10 = 2,600 QPS
- Interviewer mentioned 1K TPS → matches our estimate!
- Peak during tax season: 10K TPS (given)
Storage Estimation
User Authentication Data:
- 100M users × 500 bytes = 50 GB
Refund Records:
- 60M filings/year
- Each record: user_id, filing_year, status, amount, dates, tracking
- Size per record: 2 KB (with indexes)
- Annual: 60M × 2 KB = 120 GB/year
- 7 years retention: 120 GB × 7 = 840 GB ≈ 1 TB
Audit Logs (every access logged):
- 22.5M requests/day × 365 days = 8.2B logs/year
- Each log: 500 bytes (timestamp, user_id, action, IP, etc.)
- Annual: 8.2B × 500 bytes = 4.1 TB/year
- 7 years: 4.1 TB × 7 = 28.7 TB ≈ 30 TB
Chatbot Conversation History:
- 7.5M conversations/day × 365 = 2.7B conversations/year
- Average conversation: 5 messages × 500 bytes = 2.5 KB
- Annual: 2.7B × 2.5 KB = 6.75 TB/year
- Keep 1 year: 6.75 TB
Total Storage:
- Refund data: 1 TB
- Audit logs: 30 TB
- Chatbot history: 6.75 TB
- Total: ~38 TB (round to 40 TB with overhead)
Bandwidth Estimation
Read Traffic (status checks):
- 15M reads/day × 2 KB = 30 GB/day
- Per second: 30 GB / 86,400 ≈ 347 KB/s ≈ 2.8 Mbps
Chatbot Traffic:
- 7.5M interactions/day
- Average: 5 messages × 500 bytes = 2.5 KB per conversation
- Daily: 7.5M × 2.5 KB = 18.75 GB/day
- Per second: 217 KB/s ≈ 1.7 Mbps
Total Bandwidth:
- Read: ~3 Mbps average, 30 Mbps peak
- Write (logs, updates): ~1 Mbps
- Total: 5-40 Mbps (very manageable)
Summary
| Metric | Value |
|---|---|
| Users | 100M (India), expand globally |
| DAU (peak) | 5M |
| QPS (average) | 260 QPS |
| QPS (peak) | 2,600 QPS (10K during tax season) |
| Storage | 40 TB (with 7-year retention) |
| Bandwidth | 5-40 Mbps |
| Latency Target | <500ms (status), <2s (chatbot) |
| Availability | 99.9% (99.99% tax season) |
Step 3: High-Level Design & Iterations (20-25 min)
Design Iteration 1: Basic Architecture (MVP)
Goal: Simple, working system for India market
┌─────────────┐
│ Users │
│ (Web/Mobile)│
└──────┬──────┘
│ HTTPS
│
┌──────▼──────────────────────────────────────┐
│ Load Balancer (ALB) │
└──────┬──────────────────────────────────────┘
│
┌──────▼──────────────┐
│ Web Servers │
│ (Authentication + │
│ Status API) │
└──────┬──────────────┘
│
┌──────▼──────────────────────────────────────┐
│ Database (PostgreSQL) │
│ Tables: users, refunds, audit_logs │
│ Encryption: AES-256 at rest │
└──────▲──────────────────────────────────────┘
│
│ Updates (batch sync every 5 min)
│
┌──────┴──────────────┐
│ Tax Processing │
│ System (Backend) │
│ (External) │
└─────────────────────┘
Components:
- Load Balancer: Distribute traffic, SSL termination
- Web Servers: Handle auth, status queries, business logic
- PostgreSQL: Store refund data, user info, audit logs
- Batch Sync: Periodic updates from tax backend
Pros:
- ✅ Simple to implement
- ✅ Strong consistency (single DB)
- ✅ Easy to secure (encryption at rest)
Cons:
- ❌ Single point of failure (database)
- ❌ Doesn’t scale beyond 1K QPS
- ❌ No caching (slow reads)
- ❌ No Gen AI integration
- ❌ Can’t handle 10K peak TPS
When this works: Early stage, India only, <1K QPS
Design Iteration 2: Scalable with Caching & AI
Improvements:
- Add caching layer for read-heavy workload
- Database replication for read scalability
- Integrate Gen AI chatbot
- Separate auth service
┌─────────────┐
│ Users │
└──────┬──────┘
│ HTTPS
│
┌──────▼───────────────────────────────────────┐
│ CDN (CloudFront) │
│ - Static assets (JS, CSS, images) │
└──────┬───────────────────────────────────────┘
│
┌──────▼───────────────────────────────────────┐
│ API Gateway + WAF │
│ - Rate limiting (100 req/min per user) │
│ - DDoS protection │
└──────┬───────────────────────────────────────┘
│
├──────────────────┬────────────────────┐
│ │ │
┌──────▼──────┐ ┌───────▼────────┐ ┌───────▼──────────┐
│ Auth │ │ Status API │ │ AI Chatbot API │
│ Service │ │ Service │ │ │
│ (Stateless) │ │ (Stateless) │ │ (Gen AI) │
└──────┬──────┘ └────────┬───────┘ └───────┬──────────┘
│ │ │
│ ┌────────▼────────┐ │
│ │ Redis Cache │ │
│ │ - Hot refund │ │
│ │ status (TTL: │ │
│ │ 5 min) │ │
│ └────────┬────────┘ │
│ │ │
┌──────▼──────────────────▼────────────────────▼──────┐
│ Database Layer │
│ ┌──────────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Primary DB │───▶│ Replica1 │ │ Replica2 │ │
│ │ (PostgreSQL) │ │ (Read) │ │ (Read) │ │
│ │ (Write) │ └──────────┘ └──────────┘ │
│ └──────▲───────┘ │
└─────────┼──────────────────────────────────────────┘
│
│ Kafka topic: refund_updates
│
┌─────────▼────────────┐
│ Message Queue │
│ (Kafka) │
│ - Status updates │
│ - Audit events │
└─────────▲────────────┘
│
┌─────────┴────────────┐
│ Tax Processing │
│ System (Backend) │
│ Publishes updates │
└──────────────────────┘
Separate storage for logs:
┌──────────────────────┐
│ Elasticsearch │
│ - Audit logs │
│ - Search & analysis │
└──────────────────────┘
New Components:
- API Gateway + WAF: Rate limiting, DDoS protection
- Auth Service: Separate microservice for authentication (PAN + OTP)
- Redis Cache: Cache hot refund data (80% hit rate), TTL 5 min
- DB Replication: Primary for writes, replicas for reads
- Kafka: Event stream for status updates from tax backend
- AI Chatbot API: Connects to LLM (OpenAI, Claude, or local model)
- Elasticsearch: Audit logs with search capability
Data Flow:
Read Path (Status Check):
User → API Gateway → Status API → Check Redis cache
├─ Cache hit (80%): Return from Redis (10-20ms)
└─ Cache miss (20%): Query DB replica → Store in cache → Return (100-200ms)
Write Path (Status Update):
Tax Backend → Kafka → Consumer Service → Update Primary DB → Invalidate cache
Chatbot Path:
User → API Gateway → AI Chatbot API → Retrieve context from DB →
LLM API (OpenAI/Claude) → Generate response → Return to user
Pros:
- ✅ Scales to 10K+ TPS (cache handles 80% reads)
- ✅ High availability (DB replicas, cache)
- ✅ Gen AI integration
- ✅ Better security (API Gateway + WAF)
- ✅ Audit logs searchable (Elasticsearch)
Cons:
- ❌ Single region (can’t scale globally)
- ❌ Cache invalidation complexity
- ❌ DB still bottleneck for writes
- ❌ No data residency for GDPR
When this works: India + neighboring countries, 10K peak TPS
Design Iteration 3: Global Scale (Final Architecture)
Improvements:
- Multi-region deployment
- Database sharding by country
- Read replicas in multiple regions
- Edge caching globally
- Advanced security (encryption, key rotation)
┌────────────────────┐
│ Global Users │
│ (India, US, EU) │
└──────────┬─────────┘
│
┌──────────▼─────────────┐
│ Route 53 (GeoDNS) │
│ Routes to nearest │
│ region │
└──┬────────────────┬────┘
│ │
┌──────────▼─────┐ ┌─────▼──────────┐
│ India Region │ │ US Region │
│ (Primary) │ │ (Secondary) │
└──────┬─────────┘ └─────┬──────────┘
│ │
┌──────────▼───────────────────▼──────────┐
│ CloudFront (CDN) │
│ - Global edge caching │
│ - Static assets + API responses │
└──────────────────┬──────────────────────┘
│
┌──────────────────▼──────────────────────┐
│ API Gateway + AWS WAF │
│ - Rate limiting: 100 req/min │
│ - DDoS protection │
│ - Request signing & validation │
└───┬──────────────────────────────────┬───┘
│ │
┌─────────▼────────┐ ┌─────────▼─────────┐
│ Auth Service │ │ Status API │
│ - MFA (OTP) │ │ - Stateless │
│ - JWT tokens │ │ - Read-heavy │
└─────────┬────────┘ └─────────┬─────────┘
│ │
┌─────────▼────────────┐ ┌──────────▼─────────┐
│ Redis (Elasticache) │ │ Redis (Cache) │
│ - Session store │ │ - Refund data │
│ - Distributed │ │ - TTL: 5 min │
└──────────────────────┘ │ - Hit rate: 80% │
└──────────┬─────────┘
│
┌────────────────▼────────────────┐
│ Database Sharding │
│ │
┌──────────────────────────┼─────────────────────────────────┼──┐
│ India Shard │ US Shard │ EU Shard │
│ ┌─────────┐ ┌────────┐ │ ┌──────┐ ┌─────┐ │ ┌──────┐ │
│ │Primary │─▶│Replica │ │ │Primary││Repl.│ │ │Primary│ │
│ │(Write) │ │(Read) │ │ │(Write)││(Read)│ │ │(Write)│ │
│ └────▲────┘ └────────┘ │ └───▲──┘ └─────┘ │ └───▲──┘ │
└───────┼───────────────────┴──────┼────────────┴─────┼────────┘
│ │ │
┌───────▼──────────────────────────▼───────────────────▼────┐
│ Kafka (Multi-region) │
│ Topics: refund_updates_india, refund_updates_us, etc. │
│ Cross-region replication for DR │
└───────▲────────────────────────────────────────────────────┘
│
┌───────┴──────────────┐
│ Tax Processing │
│ Systems (Regional) │
│ - India Tax Dept │
│ - IRS (US) │
│ - EU Tax Systems │
└──────────────────────┘
Additional Components:
┌──────────────────────────────────────────────────┐
│ AI Chatbot Infrastructure │
│ ┌────────────────┐ ┌──────────────┐ │
│ │ Chatbot API │ │ Vector DB │ │
│ │ (Stateless) │─▶│ (Pinecone) │ │
│ └────────┬───────┘ │ - FAQs │ │
│ │ │ - Tax rules │ │
│ │ └──────────────┘ │
│ ┌────────▼───────────────────┐ │
│ │ LLM Service │ │
│ │ - OpenAI GPT-4 / Claude │ │
│ │ - Self-hosted (fallback) │ │
│ │ - Response caching │ │
│ └────────────────────────────┘ │
└──────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────┐
│ Security & Compliance │
│ ┌────────────────┐ ┌──────────────┐ │
│ │ KMS │ │ Secrets Mgr │ │
│ │ - Key rotation │ │ - DB creds │ │
│ │ - Encryption │ │ - API keys │ │
│ └────────────────┘ └──────────────┘ │
│ │
│ ┌────────────────┐ ┌──────────────┐ │
│ │ Audit Logs │ │ Compliance │ │
│ │ (Elasticsearch)│ │ - GDPR │ │
│ │ - All access │ │ - PCI DSS │ │
│ └────────────────┘ └──────────────┘ │
└──────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────┐
│ Observability │
│ ┌────────────────┐ ┌──────────────┐ │
│ │ Prometheus + │ │ CloudWatch │ │
│ │ Grafana │ │ - Metrics │ │
│ │ - Metrics │ │ - Alarms │ │
│ └────────────────┘ └──────────────┘ │
│ │
│ ┌────────────────┐ ┌──────────────┐ │
│ │ Distributed │ │ ELK Stack │ │
│ │ Tracing (Jaeger│ │ - Logs │ │
│ │ - Request flow │ │ - Analysis │ │
│ └────────────────┘ └──────────────┘ │
└──────────────────────────────────────────────────┘
Deep Dives
1. Security & Encryption
Encryption at Rest:
Database (PostgreSQL):
- AES-256 encryption for data at rest
- Transparent Data Encryption (TDE)
- Encrypted backups
Sensitive Fields (Extra Layer):
- PAN/SSN: Encrypt with application-level encryption
- Bank account: Tokenization + encryption
- Key: AWS KMS with automatic rotation (90 days)
Example:
encrypted_pan = AES_ENCRYPT(pan, kms_key_id)
Stored as: "ENC:v1:kms-key-id:ciphertext"
Encryption in Transit:
All communication uses TLS 1.3:
- Client ↔ API Gateway: HTTPS
- API Gateway ↔ Services: TLS mutual auth
- Services ↔ Database: SSL/TLS
- Services ↔ Redis: TLS enabled
Certificate management:
- AWS Certificate Manager (ACM)
- Auto-renewal
- Strong cipher suites only
Authentication & Authorization:
Multi-Factor Authentication:
1. User enters: PAN + Filing Year + Date of Birth
2. System generates OTP → sent to registered mobile
3. User enters OTP (valid 5 min)
4. System generates JWT token (valid 30 min)
5. Refresh token for extended session (2 hours)
JWT Token Structure:
{
"sub": "user_id",
"pan": "XXXXX1234Y", // masked
"country": "IN",
"roles": ["taxpayer"],
"exp": 1234567890,
"iat": 1234567890
}
Authorization:
- User can only access their own refund data
- Query: WHERE user_id = jwt.sub AND pan = jwt.pan
- Prevents horizontal privilege escalation
Audit Logging:
Every action logged:
{
"timestamp": "2026-04-09T10:30:00Z",
"user_id": "user123",
"action": "VIEW_REFUND_STATUS",
"resource": "refund:2024:user123",
"ip_address": "203.0.113.0",
"user_agent": "Mozilla/5.0...",
"result": "SUCCESS",
"latency_ms": 45
}
Retention: 7 years
Immutable: Write-once, can't be modified
Storage: Elasticsearch (searchable) + S3 (archival)
Compliance:
- GDPR (EU users): Right to erasure, data portability, consent management
- PCI DSS: Payment card data protection (if storing bank info)
- SOC 2 Type II: Annual audit for security controls
- Data residency: EU data stored in EU region, India data in India
2. Gen AI Chatbot Integration
Architecture:
User Question: "Why is my refund delayed?"
1. User → Chatbot API
2. Retrieve User Context:
- Query DB: Get user's refund status, amount, filing date
- Context: "User filed on 2024-02-15, status=PROCESSING"
3. Semantic Search (Vector DB):
- Embed question: "refund delayed" → vector
- Search Pinecone: Find top 3 similar FAQs
- Results:
* "Common reasons for delays"
* "Processing times by status"
* "How to contact support"
4. Generate Prompt:
System: "You are a helpful tax assistant. User context: [user data]
FAQs: [retrieved FAQs]
Question: Why is my refund delayed?"
5. LLM API Call (OpenAI GPT-4 / Claude):
- Max tokens: 500
- Temperature: 0.3 (more factual)
- Response time: 1-2s
6. Post-Processing:
- Sanitize response (no PII leaked)
- Add disclaimer: "This is AI-generated..."
- Track conversation for improvement
7. Return to User
8. Log Conversation:
- Store in DB for analytics
- Track: useful/not useful feedback
Components:
Vector Database (Pinecone/Weaviate):
Store embeddings for:
- FAQ documents (500+ common questions)
- Tax rules and regulations
- Status explanations
- Step-by-step guides
Embedding model: text-embedding-ada-002 (OpenAI)
Similarity search: Cosine similarity
Top K: 3 most relevant results
LLM Options:
| Option | Pros | Cons | Cost |
|---|---|---|---|
| OpenAI GPT-4 | Best quality, low latency | Expensive, data privacy | $30/1M tokens |
| Claude 3.5 | Great quality, longer context | Similar cost | $15/1M tokens |
| Self-hosted (Llama 3) | Full control, privacy | Need GPU infra, lower quality | $500/mo infra |
| Azure OpenAI | Enterprise compliance | Vendor lock-in | Similar to OpenAI |
Recommendation: Start with Claude 3.5 (good balance), fallback to self-hosted for cost savings at scale.
Response Caching:
Common questions cached:
- "How long does processing take?" → Cache for 1 hour
- "What does status X mean?" → Cache for 1 day
Cache key: hash(question + user_status)
Cache hit rate: 40-50% (significant cost savings)
Safety & Compliance:
Guardrails:
1. Input validation: Reject offensive/irrelevant questions
2. PII detection: Never return PAN/bank details in response
3. Hallucination check: Fact-check critical info (amounts, dates)
4. Disclaimer: "AI-generated, not official advice"
5. Human escalation: Complex cases → support ticket
Prompt injection prevention:
- Sanitize user input
- Use system prompts that can't be overridden
- Monitor for suspicious patterns
Cost Optimization:
7.5M chatbot queries/day:
- Average tokens per query: 500 input + 300 output = 800 tokens
- Daily tokens: 7.5M × 800 = 6B tokens
- Cost (Claude): 6B/1M × $15 = $90/day = $32K/month
With caching (50% hit rate):
- Actual LLM calls: 3.75M/day
- Cost: $16K/month
Additional savings:
- Smaller model for simple queries (Claude Haiku): $1/1M tokens
- Self-hosted for 80% of queries, LLM for complex 20%
- Estimated: $5-10K/month at scale
3. Database Sharding Strategy
Why Shard?
- Single DB won’t scale beyond 100M users
- Global expansion requires data residency
- Write load increases with more countries
Sharding Key: country_code
Rationale:
✅ Natural partitioning (users belong to one country)
✅ Data residency compliance (GDPR, local laws)
✅ Query pattern: Users query their own country's data (no cross-shard)
✅ Even distribution (if countries have similar user counts)
Shard mapping:
- India (IN): Shard 1 (60M users)
- US (US): Shard 2 (40M users)
- UK (UK): Shard 3 (10M users)
- EU (multiple countries): Shards 4-7
Routing Logic:
def get_shard(user_country_code):
shard_map = {
'IN': 'db-india.example.com',
'US': 'db-us.example.com',
'UK': 'db-uk.example.com',
'DE': 'db-eu-1.example.com',
'FR': 'db-eu-2.example.com',
# ...
}
return shard_map.get(user_country_code)
# In API service
def get_refund_status(user_id, country_code):
db_conn = get_shard(country_code)
query = "SELECT * FROM refunds WHERE user_id = ?"
result = db_conn.execute(query, [user_id])
return resultChallenges:
-
Uneven distribution: India has 100M users, small countries have 1M
- Solution: Split large shards further (India-North, India-South)
-
User moves countries: Rare for tax data
- Solution: Keep data in original country, add pointer
-
Global analytics: Need cross-shard queries
- Solution: Replicate to data warehouse (Redshift) for analytics
Replication:
Each shard has:
- 1 Primary (writes)
- 2 Replicas in same region (reads)
- 1 Cross-region replica (disaster recovery)
India Shard:
- Primary: Mumbai
- Replica 1: Mumbai (read)
- Replica 2: Delhi (read)
- DR Replica: Singapore (disaster recovery)
4. Handling Tax Season Peak Load
Problem: Traffic spikes 10x during tax season (Jan-Apr)
Strategies:
1. Auto-Scaling:
Normal: 20 web servers (handle 1K TPS)
Peak: 200 web servers (handle 10K TPS)
Auto-scaling policy:
- Scale up when: CPU > 70% for 5 min
- Scale down when: CPU < 30% for 15 min
- Min instances: 20
- Max instances: 500
- Cooldown: 5 minutes
Cost:
- Normal: 20 × $100/mo = $2,000/mo
- Peak (3 months): 200 × $100/mo = $20,000/mo
- Annual: (9 × $2K) + (3 × $20K) = $78K/year
2. Database Read Replicas:
Normal: 1 primary + 2 replicas
Peak: 1 primary + 10 replicas
Read traffic (80% of total):
- 10K TPS × 0.8 = 8K read TPS
- Each replica: 1K TPS
- Need: 8 replicas minimum, provision 10 for buffer
3. Cache Warming:
Before tax season:
- Pre-populate Redis with all active refund records
- 60M records × 2 KB = 120 GB
- Redis cluster: 150 GB (r5.xlarge × 6 nodes)
Result: 95% cache hit rate during first week
4. Rate Limiting:
Per user: 100 requests/minute
Per IP: 1000 requests/minute (office/shared IPs)
Prevents:
- Accidental DDoS (bugs in clients)
- Malicious scraping
- Individual user abuse
5. Graceful Degradation:
If database is slow (p99 > 1s):
- Return cached data (even if stale up to 1 hour)
- Show banner: "Status may be delayed during high traffic"
If chatbot service down:
- Show static FAQ page
- Disable chatbot, keep core status check working
5. Monitoring & Alerting
Key Metrics:
Golden Signals:
1. Latency: p50, p95, p99 for each endpoint
- Status API: <500ms p99
- Chatbot API: <2s p99
2. Traffic: Requests per second
- Alert if > 12K TPS (20% above peak)
3. Errors: Error rate
- Alert if > 1% (should be <0.1%)
- 5xx errors: Infrastructure issues
- 4xx errors: Client issues (rate limiting, auth failures)
4. Saturation: Resource utilization
- CPU > 80%: Scale up
- Memory > 85%: Scale up
- DB connections > 90%: Add read replicas
- Cache memory > 90%: Increase cache size
Business Metrics:
- Successful authentications/min
- Refund status checks/min
- Chatbot queries/min
- Average response time by endpoint
- Cache hit rate (target: >80%)
Alerting Thresholds:
Critical (PagerDuty, wake up on-call):
- API down (health check fails for 2 min)
- Error rate > 5%
- p99 latency > 2s
- Database primary down
Warning (Slack, investigate next day):
- Error rate > 1%
- p99 latency > 1s
- Cache hit rate < 70%
- Disk space > 80%
Dashboards:
Real-time Dashboard (Grafana):
- QPS by endpoint
- Latency percentiles (p50, p95, p99)
- Error rate
- Cache hit rate
- Active users
- Database query time
Ops Dashboard:
- Server CPU/memory/disk
- Database connections
- Redis memory usage
- Kafka lag
- LLM API costs (real-time burn rate)
Business Dashboard:
- Daily active users
- Refund checks by status
- Chatbot usage
- Peak hour traffic patterns
Step 4: Wrap Up (5 min)
Bottlenecks & Scaling
Current Design Limits:
| Component | Current Capacity | Bottleneck At | Solution |
|---|---|---|---|
| API Servers | 10K TPS | 50K TPS | Auto-scale to 1000 instances |
| Redis Cache | 100K ops/s | 500K ops/s | Redis Cluster (10 nodes) |
| Database (per shard) | 10K TPS (read) | 100K TPS | More replicas (20+) |
| Database (write) | 5K TPS | 10K TPS | Partition by year within shard |
| Kafka | 100K msg/s | 1M msg/s | More partitions, more brokers |
| LLM API | 1K req/s | 5K req/s | Self-hosted model cluster |
At 10x Scale (1B users globally):
- Add more shards per country (shard by state/region)
- Cache layer becomes distributed (Redis Cluster with 100+ nodes)
- Self-hosted LLM mandatory (cost prohibitive otherwise)
- Data warehouse for analytics (can’t query across shards)
Failure Scenarios
| Failure | Impact | Mitigation | Recovery Time |
|---|---|---|---|
| Primary DB down | Writes fail | Failover to replica (automatic) | <30 seconds |
| Redis down | Cache misses spike | Degrade gracefully, DB handles load | 5 minutes (restart) |
| API Gateway down | All traffic fails | Multi-region deployment, DNS failover | 2 minutes |
| LLM API down | Chatbot unavailable | Fallback to static FAQs | N/A (external service) |
| Entire region down | Region’s users affected | Route to another region via GeoDNS | 5-10 minutes |
| Tax backend system down | No status updates | Show last known status, queue updates | Wait for backend |
Multi-Region Failover:
Normal: India users → India region
Failure: India region down → Route to Singapore region
- Data replicated to DR region (async)
- Last 5 minutes of updates may be lost (eventual consistency)
- Show banner: "Service running on backup, status may be delayed"
Cost Estimation
Infrastructure Costs (Annual, India only):
Compute:
- Web servers: 20-200 instances (avg 50)
50 × $100/mo × 12 = $60K/year
- Database: 1 primary + 4 replicas
5 × $400/mo × 12 = $24K/year
Storage:
- Database: 1 TB × $115/month = $1.4K/year
- Audit logs: 30 TB × $25/month (S3) = $9K/year
- Chatbot history: 7 TB × $25/month = $2.1K/year
Caching:
- Redis: 150 GB cluster
6 nodes × $137/mo × 12 = $9.8K/year
Networking:
- Load balancer: $20/mo × 12 = $240/year
- Data transfer: 10 TB/mo × $90 = $10.8K/year
- CDN: 10 TB/mo × $40 = $4.8K/year
Kafka:
- 3 brokers × $100/mo × 12 = $3.6K/year
AI/LLM:
- Claude API: $16K/mo × 12 = $192K/year (expensive!)
- OR Self-hosted: $10K/mo × 12 = $120K/year (GPU cluster)
Monitoring:
- DataDog/New Relic: $5K/year
Total (India only):
- With Claude API: ~$320K/year
- With self-hosted LLM: ~$250K/year
Global (5 countries):
- Multiply by 3-5x (not linear due to shared components)
- Estimated: $1M - $1.5M/year
Cost Optimizations:
- Reserved instances for base load (40% savings)
- Spot instances for tax season spikes (70% savings)
- S3 lifecycle policies (move old logs to Glacier)
- Self-hosted LLM for 80% of queries, API for complex 20%
- Aggressive caching (reduces DB load and costs)
Future Enhancements
Phase 2 Features:
- Notifications: Email/SMS when status changes
- Appeal workflow: File disputes directly
- Document upload: Upload missing documents
- Payment tracking: Track refund deposit to bank account
- Historical data: View past years’ refunds
- Predictive ETA: ML model to predict refund date
Phase 3 (Advanced):
- Proactive notifications: AI predicts delays, notifies user
- Personalized dashboard: Tax planning, refund maximization tips
- Voice interface: “Alexa, check my refund status”
- Blockchain: Immutable audit trail for compliance
- Real-time updates: WebSocket connection for live status
Alternative Approaches Comparison
Database Choices
| Option | Pros | Cons | When to Use |
|---|---|---|---|
| PostgreSQL (Chosen) | Strong consistency, ACID, SQL, mature | Harder to scale horizontally | Transactional data, complex queries, strong consistency needed |
| MongoDB | Flexible schema, easy horizontal scaling | Eventual consistency, no joins | Rapidly changing schema, simple queries |
| DynamoDB | Fully managed, infinite scale, fast | Expensive at scale, limited queries | Serverless, unpredictable load, simple key-value |
| Cassandra | Extreme write throughput, multi-DC | Eventual consistency, complex setup | Write-heavy, multi-region, eventual consistency OK |
Decision: PostgreSQL for strong consistency in tax data, shard by country for horizontal scaling.
Caching Strategies
| Strategy | Pros | Cons | When to Use |
|---|---|---|---|
| Cache-Aside (Chosen) | Simple, cache only what’s needed | Cache miss penalty, stale data | Read-heavy, flexible caching |
| Write-Through | Cache always consistent | Slower writes (double write) | Strong consistency needed |
| Write-Behind | Fast writes, async to DB | Risk of data loss | Write-heavy, eventual consistency OK |
| Refresh-Ahead | Proactive refresh, no misses | Complex, unnecessary refreshes | Predictable access patterns |
Decision: Cache-aside with 5-minute TTL (refund status doesn’t change frequently).
Gen AI LLM Choices
| Option | Quality | Latency | Cost | Privacy | When to Use |
|---|---|---|---|---|---|
| OpenAI GPT-4 | Excellent | 1-2s | $30/1M tokens | External API | Best quality, low volume |
| Claude 3.5 (Chosen) | Excellent | 1-2s | $15/1M tokens | External API | Good balance, enterprise compliance |
| Self-Hosted Llama 3 | Good | 500ms-1s | $500/mo infra | Full control | High volume, privacy critical |
| Azure OpenAI | Excellent | 1-2s | $30/1M tokens | Enterprise SLA | Existing Azure infra |
| AWS Bedrock | Good-Excellent | 1-2s | Varies | AWS ecosystem | Existing AWS infra |
Decision: Claude 3.5 initially for quality, migrate to hybrid (self-hosted for simple + Claude for complex) at scale.
Authentication Methods
| Method | Security | UX | Cost | When to Use |
|---|---|---|---|---|
| PAN + DOB + OTP (Chosen) | High | Moderate | Low | India (PAN common), phone available |
| SSN + DOB + OTP | High | Moderate | Low | US (SSN common) |
| National ID + Biometric | Very High | Poor | High | High-security requirements |
| OAuth (Google/Apple) | Moderate | Excellent | Low | Consumer apps, less sensitive |
| Hardware Token (YubiKey) | Very High | Poor | High | Enterprise, high security |
Decision: PAN + OB + OTP for India (PAN universal), adapt to local IDs for other countries.
Sharding Strategies
| Strategy | Pros | Cons | When to Use |
|---|---|---|---|
| By Country (Chosen) | Data residency, no cross-shard queries | Uneven distribution | Global app, compliance needs |
| By User ID (Hash) | Even distribution | Cross-shard queries, no data residency | Single country, even load |
| By Year | Time-based queries efficient | Recent years get all load | Time-series data |
| Hybrid (Country + Year) | Best of both | Most complex | Very large scale, multi-year queries |
Decision: By country for data residency, add by-year within shard if needed at scale.
Monitoring Tools
| Tool | Pros | Cons | When to Use |
|---|---|---|---|
| DataDog | All-in-one, great UX | Expensive ($5-20K/year) | Enterprises, full observability |
| Prometheus + Grafana (Chosen) | Open-source, flexible | Self-managed | Cost-conscious, customization |
| New Relic | Easy setup, AI insights | Expensive | Quick setup needed |
| CloudWatch (AWS) | Integrated with AWS | Limited, AWS-only | AWS-heavy, basic needs |
| ELK Stack | Powerful log analysis | Complex setup | Log-heavy workloads |
Decision: Prometheus + Grafana for metrics (cost-effective), ELK for logs (searchable audit logs).
Key Takeaways
Critical Decisions
- Security First: Multi-layer encryption, MFA, audit logs (non-negotiable for tax data)
- Shard by Country: Data residency compliance, natural partitioning
- Cache-Aside: 80% cache hit rate reduces DB load 5x
- Hybrid LLM: Self-hosted for common queries, API for complex (cost optimization)
- Auto-Scaling: 10x traffic spikes during tax season handled gracefully
What Makes This Design Good?
✅ Scales globally: Multi-region, sharded database
✅ Secure: Encryption, MFA, audit logs, compliance (GDPR, PCI DSS)
✅ Cost-effective: Caching, auto-scaling, hybrid LLM approach
✅ Resilient: Replication, failover, graceful degradation
✅ Observable: Comprehensive monitoring, alerting, dashboards
✅ Compliant: Data residency, audit logs, GDPR support
Interview Performance Tips
What Impressed:
- Started with clarifying questions (functional + non-functional requirements)
- Did capacity estimation (showed quantitative thinking)
- Iterative design (v1 → v2 → v3, not jumping to complex solution)
- Security deep dive (encryption, auth, audit logs)
- Trade-off discussion (PostgreSQL vs DynamoDB, Claude vs self-hosted)
- Cost awareness (mentioned $250K-320K/year, optimizations)
- Failure scenarios (multi-region failover, graceful degradation)
What to Avoid:
- Don’t ignore security (this is tax data!)
- Don’t over-engineer Gen AI (could be simple FAQ initially)
- Don’t forget compliance (GDPR, PCI DSS, data residency)
- Don’t skip monitoring (critical for high-availability systems)
Practice this problem in 45 minutes! Focus on explaining trade-offs and iterating through designs.
Last Updated: 2026-04-09