System Design Interview Framework
A step-by-step approach to tackle any system design interview (based on Alex Xu’s methodology).
⏱️ Time Allocation (45 min interview)
| Phase | Time | Activities |
|---|---|---|
| Requirements | 5-7 min | Clarify scope, functional/non-functional requirements |
| High-level Design | 10-15 min | Draw initial architecture, get buy-in |
| Deep Dive | 15-20 min | Drill into 2-3 components, discuss trade-offs |
| Wrap Up | 5-7 min | Bottlenecks, monitoring, future improvements |
📋 Step-by-Step Process
Step 1: Understand the Problem (5-7 min)
Always ask clarifying questions! Never assume.
Functional Requirements
- What are the core features? (List 3-5 main use cases)
- Who are the users? (B2C, B2B, internal)
- What platforms? (Web, mobile, desktop)
- What’s the scope? (MVP vs full product)
Example Questions:
- “Should we support user authentication?”
- “Do we need to handle image uploads?”
- “Is this read-heavy or write-heavy?”
- “Do users need real-time updates?”
Non-Functional Requirements
- Scale: How many users? (DAU, MAU)
- Performance: Latency requirements? (p99 < 100ms?)
- Availability: Uptime SLA? (99.9%, 99.99%?)
- Consistency: Strong or eventual?
- Durability: Can we lose data? How much?
Key Questions:
- “How many daily active users?”
- “How many requests per second?”
- “How much data do we need to store?”
- “What’s the read/write ratio?”
- “Any specific latency requirements?”
Write Down Assumptions
After clarifying, explicitly state assumptions:
- “Let’s assume 100M DAU, 10:1 read/write ratio”
- “I’ll design for 99.9% availability”
- “Let’s assume eventual consistency is acceptable”
Step 2: Back-of-Envelope Estimation (5 min)
Always do quick calculations! Shows quantitative thinking.
Traffic Estimation
DAU = 100M users
Average user makes 10 requests/day
Total daily requests = 100M × 10 = 1B requests/day
QPS = 1B / 86400 ≈ 12K QPS
Peak QPS = 12K × 2 = 24K QPS (assume 2x for peaks)
Storage Estimation
Each post = 1KB metadata + 1MB media (average)
Posts per day = 10M
Daily storage = 10M × 1MB ≈ 10TB/day
Yearly storage = 10TB × 365 ≈ 3.6PB/year
Bandwidth Estimation
Write: 10TB/day / 86400 ≈ 120MB/s
Read (10:1 ratio): 120MB/s × 10 = 1.2GB/s
Quick Reference (memorize these!):
- 1 million ≈ 10
- 1 billion ≈ 10
- 1 day = 86400 seconds ≈ 100K seconds
- 1 KB = 1000 bytes
- 1 MB = 1000 KB
- 1 GB = 1000 MB
Step 3: High-Level Design (10-15 min)
Start simple, iterate based on requirements.
Step 3.1: API Design (2-3 min)
Define core APIs (REST or RPC).
POST /api/v1/posts
- Create new post
- Params: userId, content, media
- Returns: postId
GET /api/v1/feed
- Get user's feed
- Params: userId, page, size
- Returns: list of posts
Step 3.2: Data Model (2-3 min)
Define key entities and relationships.
User
- userId (PK)
- username
- email
- createdAt
Post
- postId (PK)
- userId (FK)
- content
- mediaUrl
- timestamp
Follow
- followerId (FK)
- followeeId (FK)
Step 3.3: System Architecture (5-10 min)
Draw boxes and arrows! Start with these components:
[Client] → [Load Balancer] → [Web Servers] → [Cache] → [Database]
↓
[Message Queue] → [Workers]
Progressive refinement:
- Version 1: Single server (everything on one box)
- Version 2: Separate web and data tier
- Version 3: Add cache and CDN
- Version 4: Horizontal scaling, sharding
Components to consider:
- Load Balancer (distribute traffic)
- Web/App Servers (business logic)
- Cache (Redis/Memcached)
- Database (SQL vs NoSQL?)
- CDN (static content)
- Message Queue (async processing)
- Object Storage (S3 for media)
Always explain:
- Why each component?
- What does it solve?
- What are alternatives?
Step 4: Deep Dive (15-20 min)
Interviewer will guide what to focus on. Pick 2-3 areas.
Common Deep Dive Topics
1. Database Schema & Scaling
- SQL vs NoSQL choice
- Indexing strategy
- Sharding/partitioning key
- Replication setup
2. Caching Strategy
- What to cache? (hot data)
- Cache invalidation (write-through, write-back, TTL)
- Cache consistency
- Cache eviction (LRU, LFU)
3. Handling Hotspots
- Celebrity problem (one user with millions of followers)
- Solutions: Separate handling, rate limiting, fan-out on read
4. Consistency vs Availability
- CAP theorem trade-offs
- Strong vs eventual consistency
- When to choose what?
5. Scalability Bottlenecks
- Database becomes bottleneck → Sharding, read replicas
- Single point of failure → Redundancy
- Network bandwidth → CDN, compression
6. Real-time vs Batch
- Push vs pull notifications
- WebSockets vs polling
- Stream processing vs batch jobs
How to Deep Dive
Pattern:
- Identify the problem: “How do we handle celebrity users with 50M followers?”
- Propose solution: “We can fan-out on read for celebrities instead of write”
- Discuss trade-offs: “This adds read latency but reduces write amplification”
- Consider alternatives: “Alternative is to cache celebrity feeds separately”
Always ask: “Which area would you like me to focus on?”
Step 5: Wrap Up (5-7 min)
Don’t skip this! Shows you think beyond the design.
Discuss Bottlenecks
- “The database might become a bottleneck at 100K QPS”
- “We should monitor cache hit rate”
- “Network bandwidth could be an issue for video”
Monitoring & Alerting
- Key metrics to track (QPS, latency, error rate, cache hit rate)
- Alerting thresholds
- Logging and tracing
Failure Scenarios
- “If a database replica fails, traffic routes to healthy replicas”
- “If cache goes down, we fall back to database (with rate limiting)”
- “Multi-region setup for disaster recovery”
Future Improvements
- “We could add machine learning for personalized recommendations”
- “Implement GraphQL for flexible queries”
- “Add A/B testing framework”
Error Cases
- “Handle duplicate requests with idempotency keys”
- “Rate limiting to prevent abuse”
- “Graceful degradation if downstream services fail”
🎯 The Framework Checklist
Use this for every problem:
- Clarify requirements (functional + non-functional)
- Write down assumptions (scale, SLA, consistency)
- Do estimations (QPS, storage, bandwidth)
- Define APIs (2-3 core endpoints)
- Define data model (key entities)
- Draw high-level design (start simple)
- Explain each component (why it’s needed)
- Identify bottlenecks (what will break first?)
- Deep dive (2-3 areas based on interviewer interest)
- Discuss trade-offs (pros and cons)
- Wrap up (monitoring, failure scenarios, improvements)
💬 Communication Tips
Do’s ✅
- Think out loud: Share your reasoning
- Ask questions: Engage with interviewer
- Start simple: Build complexity gradually
- Justify decisions: Explain why, not just what
- Discuss trade-offs: Every choice has pros/cons
- Be flexible: Adapt based on feedback
- Draw clearly: Label everything
- Manage time: Keep track, don’t spend 30 min on one area
Don’ts ❌
- Don’t stay silent: Interviewer can’t help if they don’t know your thinking
- Don’t jump to code: This is architecture, not coding
- Don’t over-engineer: Start with MVP
- Don’t ignore requirements: Clarify before designing
- Don’t say “I don’t know” and stop: Say “I don’t know, but here’s my reasoning…”
- Don’t argue: Listen to hints, they’re helping you
- Don’t go too deep too fast: High-level first
🎨 Drawing Conventions
Components
[Load Balancer] - Square brackets for services
(Cache) - Parentheses for data stores
{S3} - Braces for external services
Data Flow
→ Data flow direction
↔ Bidirectional
⚡ Async/message queue
🔄 Replication
Labels
Always label:
- Component names
- Protocols (HTTP, TCP, WebSocket)
- Data types (JSON, binary)
- Numbers (QPS, latency)
📝 Example: Design Twitter
Step 1: Requirements (5 min)
Functional:
- Post tweets (text, 280 chars)
- Follow users
- View timeline (tweets from followed users)
Non-Functional:
- 100M DAU
- Fast timeline load (p99 < 1s)
- Eventual consistency OK
- 99.9% availability
Step 2: Estimation (3 min)
Users: 100M DAU
Tweets: 100M tweets/day
QPS: 100M / 86400 ≈ 1200 tweets/sec
Timeline views: 10x reads → 12K QPS
Storage: 100M tweets × 280 bytes ≈ 28GB/day
Step 3: High-level (10 min)
APIs:
POST /v1/tweets
GET /v1/timeline/:userId
POST /v1/follow/:userId
Architecture:
[Client] → [LB] → [Web Servers] → [Redis Cache] → [PostgreSQL]
↓
[Kafka] → [Timeline Service]
↓
[Timeline Cache]
Step 4: Deep Dive (15 min)
Fan-out approaches:
- Fan-out on write (push to all followers)
- Fan-out on read (compute on timeline load)
- Hybrid (celebrities use pull, regular users use push)
Caching strategy:
- Cache timelines in Redis
- TTL of 1 hour
- Cache miss → rebuild from DB
Step 5: Wrap Up (7 min)
Bottlenecks: Celebrity problem, database writes
Monitoring: Tweet QPS, timeline latency, cache hit rate
Future: Recommendation algorithm, ads, analytics
🎓 Practice Recommendations
Week 1: Practice framework with 3-4 simple systems (URL shortener, Pastebin)
Week 2: Medium complexity (Twitter, Instagram, YouTube)
Week 3: Complex systems (Uber, Netflix, E-commerce)
Week 4: Mock interviews with peers, timed practice
After each practice:
- Did I clarify requirements?
- Did I do estimations?
- Did I explain my reasoning?
- Did I discuss trade-offs?
- What would I do differently?
Remember: There’s no one “correct” answer. Interviewers assess:
- Problem-solving approach
- Communication skills
- Technical knowledge
- Trade-off analysis
- Ability to handle ambiguity
Good luck! 🚀
Last Updated: 2026-04-08