System Design Interview Framework

A step-by-step approach to tackle any system design interview (based on Alex Xu’s methodology).

⏱️ Time Allocation (45 min interview)

Phase	Time	Activities
Requirements	5-7 min	Clarify scope, functional/non-functional requirements
High-level Design	10-15 min	Draw initial architecture, get buy-in
Deep Dive	15-20 min	Drill into 2-3 components, discuss trade-offs
Wrap Up	5-7 min	Bottlenecks, monitoring, future improvements

📋 Step-by-Step Process

Step 1: Understand the Problem (5-7 min)

Always ask clarifying questions! Never assume.

Functional Requirements

What are the core features? (List 3-5 main use cases)
Who are the users? (B2C, B2B, internal)
What platforms? (Web, mobile, desktop)
What’s the scope? (MVP vs full product)

Example Questions:

“Should we support user authentication?”
“Do we need to handle image uploads?”
“Is this read-heavy or write-heavy?”
“Do users need real-time updates?”

Non-Functional Requirements

Scale: How many users? (DAU, MAU)
Performance: Latency requirements? (p99 < 100ms?)
Availability: Uptime SLA? (99.9%, 99.99%?)
Consistency: Strong or eventual?
Durability: Can we lose data? How much?

Key Questions:

“How many daily active users?”
“How many requests per second?”
“How much data do we need to store?”
“What’s the read/write ratio?”
“Any specific latency requirements?”

Write Down Assumptions

After clarifying, explicitly state assumptions:

“Let’s assume 100M DAU, 10:1 read/write ratio”
“I’ll design for 99.9% availability”
“Let’s assume eventual consistency is acceptable”

Step 2: Back-of-Envelope Estimation (5 min)

Always do quick calculations! Shows quantitative thinking.

Traffic Estimation

DAU = 100M users
Average user makes 10 requests/day
Total daily requests = 100M × 10 = 1B requests/day
QPS = 1B / 86400 ≈ 12K QPS
Peak QPS = 12K × 2 = 24K QPS (assume 2x for peaks)

Storage Estimation

Each post = 1KB metadata + 1MB media (average)
Posts per day = 10M
Daily storage = 10M × 1MB ≈ 10TB/day
Yearly storage = 10TB × 365 ≈ 3.6PB/year

Bandwidth Estimation

Write: 10TB/day / 86400 ≈ 120MB/s
Read (10:1 ratio): 120MB/s × 10 = 1.2GB/s

Quick Reference (memorize these!):

1 million ≈ 10
1 billion ≈ 10
1 day = 86400 seconds ≈ 100K seconds
1 KB = 1000 bytes
1 MB = 1000 KB
1 GB = 1000 MB

Step 3: High-Level Design (10-15 min)

Start simple, iterate based on requirements.

Step 3.1: API Design (2-3 min)

Define core APIs (REST or RPC).

POST /api/v1/posts
  - Create new post
  - Params: userId, content, media
  - Returns: postId

GET /api/v1/feed
  - Get user's feed
  - Params: userId, page, size
  - Returns: list of posts

Step 3.2: Data Model (2-3 min)

Define key entities and relationships.

User
- userId (PK)
- username
- email
- createdAt

Post
- postId (PK)
- userId (FK)
- content
- mediaUrl
- timestamp

Follow
- followerId (FK)
- followeeId (FK)

Step 3.3: System Architecture (5-10 min)

Draw boxes and arrows! Start with these components:

[Client] → [Load Balancer] → [Web Servers] → [Cache] → [Database]
                                   ↓
                            [Message Queue] → [Workers]

Progressive refinement:

Version 1: Single server (everything on one box)
Version 2: Separate web and data tier
Version 3: Add cache and CDN
Version 4: Horizontal scaling, sharding

Components to consider:

Load Balancer (distribute traffic)
Web/App Servers (business logic)
Cache (Redis/Memcached)
Database (SQL vs NoSQL?)
CDN (static content)
Message Queue (async processing)
Object Storage (S3 for media)

Always explain:

Why each component?
What does it solve?
What are alternatives?

Step 4: Deep Dive (15-20 min)

Interviewer will guide what to focus on. Pick 2-3 areas.

Common Deep Dive Topics

1. Database Schema & Scaling

SQL vs NoSQL choice
Indexing strategy
Sharding/partitioning key
Replication setup

2. Caching Strategy

What to cache? (hot data)
Cache invalidation (write-through, write-back, TTL)
Cache consistency
Cache eviction (LRU, LFU)

3. Handling Hotspots

Celebrity problem (one user with millions of followers)
Solutions: Separate handling, rate limiting, fan-out on read

4. Consistency vs Availability

CAP theorem trade-offs
Strong vs eventual consistency
When to choose what?

5. Scalability Bottlenecks

Database becomes bottleneck → Sharding, read replicas
Single point of failure → Redundancy
Network bandwidth → CDN, compression

6. Real-time vs Batch

Push vs pull notifications
WebSockets vs polling
Stream processing vs batch jobs

How to Deep Dive

Pattern:

Identify the problem: “How do we handle celebrity users with 50M followers?”
Propose solution: “We can fan-out on read for celebrities instead of write”
Discuss trade-offs: “This adds read latency but reduces write amplification”
Consider alternatives: “Alternative is to cache celebrity feeds separately”

Always ask: “Which area would you like me to focus on?”

Step 5: Wrap Up (5-7 min)

Don’t skip this! Shows you think beyond the design.

Discuss Bottlenecks

“The database might become a bottleneck at 100K QPS”
“We should monitor cache hit rate”
“Network bandwidth could be an issue for video”

Monitoring & Alerting

Key metrics to track (QPS, latency, error rate, cache hit rate)
Alerting thresholds
Logging and tracing

Failure Scenarios

“If a database replica fails, traffic routes to healthy replicas”
“If cache goes down, we fall back to database (with rate limiting)”
“Multi-region setup for disaster recovery”

Future Improvements

“We could add machine learning for personalized recommendations”
“Implement GraphQL for flexible queries”
“Add A/B testing framework”

Error Cases

“Handle duplicate requests with idempotency keys”
“Rate limiting to prevent abuse”
“Graceful degradation if downstream services fail”

🎯 The Framework Checklist

Use this for every problem:

💬 Communication Tips

Do’s ✅

Think out loud: Share your reasoning
Ask questions: Engage with interviewer
Start simple: Build complexity gradually
Justify decisions: Explain why, not just what
Discuss trade-offs: Every choice has pros/cons
Be flexible: Adapt based on feedback
Draw clearly: Label everything
Manage time: Keep track, don’t spend 30 min on one area

Don’ts ❌

Don’t stay silent: Interviewer can’t help if they don’t know your thinking
Don’t jump to code: This is architecture, not coding
Don’t over-engineer: Start with MVP
Don’t ignore requirements: Clarify before designing
Don’t say “I don’t know” and stop: Say “I don’t know, but here’s my reasoning…”
Don’t argue: Listen to hints, they’re helping you
Don’t go too deep too fast: High-level first

🎨 Drawing Conventions

Components

[Load Balancer]  - Square brackets for services
(Cache)          - Parentheses for data stores
{S3}             - Braces for external services

Data Flow

→  Data flow direction
↔  Bidirectional
⚡ Async/message queue
🔄 Replication

Labels

Always label:
- Component names
- Protocols (HTTP, TCP, WebSocket)
- Data types (JSON, binary)
- Numbers (QPS, latency)

📝 Example: Design Twitter

Step 1: Requirements (5 min)

Functional:

Post tweets (text, 280 chars)
Follow users
View timeline (tweets from followed users)

Non-Functional:

100M DAU
Fast timeline load (p99 < 1s)
Eventual consistency OK
99.9% availability

Step 2: Estimation (3 min)

Users: 100M DAU
Tweets: 100M tweets/day
QPS: 100M / 86400 ≈ 1200 tweets/sec
Timeline views: 10x reads → 12K QPS
Storage: 100M tweets × 280 bytes ≈ 28GB/day

Step 3: High-level (10 min)

APIs:

POST /v1/tweets
GET /v1/timeline/:userId
POST /v1/follow/:userId

Architecture:

[Client] → [LB] → [Web Servers] → [Redis Cache] → [PostgreSQL]
                        ↓
                   [Kafka] → [Timeline Service]
                        ↓
                   [Timeline Cache]

Step 4: Deep Dive (15 min)

Fan-out approaches:

Fan-out on write (push to all followers)
Fan-out on read (compute on timeline load)
Hybrid (celebrities use pull, regular users use push)

Caching strategy:

Cache timelines in Redis
TTL of 1 hour
Cache miss → rebuild from DB

Step 5: Wrap Up (7 min)

Bottlenecks: Celebrity problem, database writes
Monitoring: Tweet QPS, timeline latency, cache hit rate
Future: Recommendation algorithm, ads, analytics

🎓 Practice Recommendations

Week 1: Practice framework with 3-4 simple systems (URL shortener, Pastebin)
Week 2: Medium complexity (Twitter, Instagram, YouTube)
Week 3: Complex systems (Uber, Netflix, E-commerce)
Week 4: Mock interviews with peers, timed practice

After each practice:

Did I clarify requirements?
Did I do estimations?
Did I explain my reasoning?
Did I discuss trade-offs?
What would I do differently?

Remember: There’s no one “correct” answer. Interviewers assess:

Problem-solving approach
Communication skills
Technical knowledge
Trade-off analysis
Ability to handle ambiguity

Good luck! 🚀

Last Updated: 2026-04-08

Study Notes by Niladri & AI

Explorer

interview-framework