System Design Interview Framework

A step-by-step approach to tackle any system design interview (based on Alex Xu’s methodology).

⏱️ Time Allocation (45 min interview)

PhaseTimeActivities
Requirements5-7 minClarify scope, functional/non-functional requirements
High-level Design10-15 minDraw initial architecture, get buy-in
Deep Dive15-20 minDrill into 2-3 components, discuss trade-offs
Wrap Up5-7 minBottlenecks, monitoring, future improvements

📋 Step-by-Step Process

Step 1: Understand the Problem (5-7 min)

Always ask clarifying questions! Never assume.

Functional Requirements

  • What are the core features? (List 3-5 main use cases)
  • Who are the users? (B2C, B2B, internal)
  • What platforms? (Web, mobile, desktop)
  • What’s the scope? (MVP vs full product)

Example Questions:

  • “Should we support user authentication?”
  • “Do we need to handle image uploads?”
  • “Is this read-heavy or write-heavy?”
  • “Do users need real-time updates?”

Non-Functional Requirements

  • Scale: How many users? (DAU, MAU)
  • Performance: Latency requirements? (p99 < 100ms?)
  • Availability: Uptime SLA? (99.9%, 99.99%?)
  • Consistency: Strong or eventual?
  • Durability: Can we lose data? How much?

Key Questions:

  • “How many daily active users?”
  • “How many requests per second?”
  • “How much data do we need to store?”
  • “What’s the read/write ratio?”
  • “Any specific latency requirements?”

Write Down Assumptions

After clarifying, explicitly state assumptions:

  • “Let’s assume 100M DAU, 10:1 read/write ratio”
  • “I’ll design for 99.9% availability”
  • “Let’s assume eventual consistency is acceptable”

Step 2: Back-of-Envelope Estimation (5 min)

Always do quick calculations! Shows quantitative thinking.

Traffic Estimation

DAU = 100M users
Average user makes 10 requests/day
Total daily requests = 100M × 10 = 1B requests/day
QPS = 1B / 86400 ≈ 12K QPS
Peak QPS = 12K × 2 = 24K QPS (assume 2x for peaks)

Storage Estimation

Each post = 1KB metadata + 1MB media (average)
Posts per day = 10M
Daily storage = 10M × 1MB ≈ 10TB/day
Yearly storage = 10TB × 365 ≈ 3.6PB/year

Bandwidth Estimation

Write: 10TB/day / 86400 ≈ 120MB/s
Read (10:1 ratio): 120MB/s × 10 = 1.2GB/s

Quick Reference (memorize these!):

  • 1 million ≈ 10
  • 1 billion ≈ 10
  • 1 day = 86400 seconds ≈ 100K seconds
  • 1 KB = 1000 bytes
  • 1 MB = 1000 KB
  • 1 GB = 1000 MB

Step 3: High-Level Design (10-15 min)

Start simple, iterate based on requirements.

Step 3.1: API Design (2-3 min)

Define core APIs (REST or RPC).

POST /api/v1/posts
  - Create new post
  - Params: userId, content, media
  - Returns: postId

GET /api/v1/feed
  - Get user's feed
  - Params: userId, page, size
  - Returns: list of posts

Step 3.2: Data Model (2-3 min)

Define key entities and relationships.

User
- userId (PK)
- username
- email
- createdAt

Post
- postId (PK)
- userId (FK)
- content
- mediaUrl
- timestamp

Follow
- followerId (FK)
- followeeId (FK)

Step 3.3: System Architecture (5-10 min)

Draw boxes and arrows! Start with these components:

[Client] → [Load Balancer] → [Web Servers] → [Cache] → [Database]
                                   ↓
                            [Message Queue] → [Workers]

Progressive refinement:

  1. Version 1: Single server (everything on one box)
  2. Version 2: Separate web and data tier
  3. Version 3: Add cache and CDN
  4. Version 4: Horizontal scaling, sharding

Components to consider:

  • Load Balancer (distribute traffic)
  • Web/App Servers (business logic)
  • Cache (Redis/Memcached)
  • Database (SQL vs NoSQL?)
  • CDN (static content)
  • Message Queue (async processing)
  • Object Storage (S3 for media)

Always explain:

  • Why each component?
  • What does it solve?
  • What are alternatives?

Step 4: Deep Dive (15-20 min)

Interviewer will guide what to focus on. Pick 2-3 areas.

Common Deep Dive Topics

1. Database Schema & Scaling

  • SQL vs NoSQL choice
  • Indexing strategy
  • Sharding/partitioning key
  • Replication setup

2. Caching Strategy

  • What to cache? (hot data)
  • Cache invalidation (write-through, write-back, TTL)
  • Cache consistency
  • Cache eviction (LRU, LFU)

3. Handling Hotspots

  • Celebrity problem (one user with millions of followers)
  • Solutions: Separate handling, rate limiting, fan-out on read

4. Consistency vs Availability

  • CAP theorem trade-offs
  • Strong vs eventual consistency
  • When to choose what?

5. Scalability Bottlenecks

  • Database becomes bottleneck → Sharding, read replicas
  • Single point of failure → Redundancy
  • Network bandwidth → CDN, compression

6. Real-time vs Batch

  • Push vs pull notifications
  • WebSockets vs polling
  • Stream processing vs batch jobs

How to Deep Dive

Pattern:

  1. Identify the problem: “How do we handle celebrity users with 50M followers?”
  2. Propose solution: “We can fan-out on read for celebrities instead of write”
  3. Discuss trade-offs: “This adds read latency but reduces write amplification”
  4. Consider alternatives: “Alternative is to cache celebrity feeds separately”

Always ask: “Which area would you like me to focus on?”

Step 5: Wrap Up (5-7 min)

Don’t skip this! Shows you think beyond the design.

Discuss Bottlenecks

  • “The database might become a bottleneck at 100K QPS”
  • “We should monitor cache hit rate”
  • “Network bandwidth could be an issue for video”

Monitoring & Alerting

  • Key metrics to track (QPS, latency, error rate, cache hit rate)
  • Alerting thresholds
  • Logging and tracing

Failure Scenarios

  • “If a database replica fails, traffic routes to healthy replicas”
  • “If cache goes down, we fall back to database (with rate limiting)”
  • “Multi-region setup for disaster recovery”

Future Improvements

  • “We could add machine learning for personalized recommendations”
  • “Implement GraphQL for flexible queries”
  • “Add A/B testing framework”

Error Cases

  • “Handle duplicate requests with idempotency keys”
  • “Rate limiting to prevent abuse”
  • “Graceful degradation if downstream services fail”

🎯 The Framework Checklist

Use this for every problem:

  • Clarify requirements (functional + non-functional)
  • Write down assumptions (scale, SLA, consistency)
  • Do estimations (QPS, storage, bandwidth)
  • Define APIs (2-3 core endpoints)
  • Define data model (key entities)
  • Draw high-level design (start simple)
  • Explain each component (why it’s needed)
  • Identify bottlenecks (what will break first?)
  • Deep dive (2-3 areas based on interviewer interest)
  • Discuss trade-offs (pros and cons)
  • Wrap up (monitoring, failure scenarios, improvements)

💬 Communication Tips

Do’s ✅

  • Think out loud: Share your reasoning
  • Ask questions: Engage with interviewer
  • Start simple: Build complexity gradually
  • Justify decisions: Explain why, not just what
  • Discuss trade-offs: Every choice has pros/cons
  • Be flexible: Adapt based on feedback
  • Draw clearly: Label everything
  • Manage time: Keep track, don’t spend 30 min on one area

Don’ts ❌

  • Don’t stay silent: Interviewer can’t help if they don’t know your thinking
  • Don’t jump to code: This is architecture, not coding
  • Don’t over-engineer: Start with MVP
  • Don’t ignore requirements: Clarify before designing
  • Don’t say “I don’t know” and stop: Say “I don’t know, but here’s my reasoning…”
  • Don’t argue: Listen to hints, they’re helping you
  • Don’t go too deep too fast: High-level first

🎨 Drawing Conventions

Components

[Load Balancer]  - Square brackets for services
(Cache)          - Parentheses for data stores
{S3}             - Braces for external services

Data Flow

→  Data flow direction
↔  Bidirectional
⚡ Async/message queue
🔄 Replication

Labels

Always label:
- Component names
- Protocols (HTTP, TCP, WebSocket)
- Data types (JSON, binary)
- Numbers (QPS, latency)

📝 Example: Design Twitter

Step 1: Requirements (5 min)

Functional:

  • Post tweets (text, 280 chars)
  • Follow users
  • View timeline (tweets from followed users)

Non-Functional:

  • 100M DAU
  • Fast timeline load (p99 < 1s)
  • Eventual consistency OK
  • 99.9% availability

Step 2: Estimation (3 min)

Users: 100M DAU
Tweets: 100M tweets/day
QPS: 100M / 86400 ≈ 1200 tweets/sec
Timeline views: 10x reads → 12K QPS
Storage: 100M tweets × 280 bytes ≈ 28GB/day

Step 3: High-level (10 min)

APIs:

POST /v1/tweets
GET /v1/timeline/:userId
POST /v1/follow/:userId

Architecture:

[Client] → [LB] → [Web Servers] → [Redis Cache] → [PostgreSQL]
                        ↓
                   [Kafka] → [Timeline Service]
                        ↓
                   [Timeline Cache]

Step 4: Deep Dive (15 min)

Fan-out approaches:

  • Fan-out on write (push to all followers)
  • Fan-out on read (compute on timeline load)
  • Hybrid (celebrities use pull, regular users use push)

Caching strategy:

  • Cache timelines in Redis
  • TTL of 1 hour
  • Cache miss → rebuild from DB

Step 5: Wrap Up (7 min)

Bottlenecks: Celebrity problem, database writes
Monitoring: Tweet QPS, timeline latency, cache hit rate
Future: Recommendation algorithm, ads, analytics


🎓 Practice Recommendations

Week 1: Practice framework with 3-4 simple systems (URL shortener, Pastebin)
Week 2: Medium complexity (Twitter, Instagram, YouTube)
Week 3: Complex systems (Uber, Netflix, E-commerce)
Week 4: Mock interviews with peers, timed practice

After each practice:

  1. Did I clarify requirements?
  2. Did I do estimations?
  3. Did I explain my reasoning?
  4. Did I discuss trade-offs?
  5. What would I do differently?

Remember: There’s no one “correct” answer. Interviewers assess:

  • Problem-solving approach
  • Communication skills
  • Technical knowledge
  • Trade-off analysis
  • Ability to handle ambiguity

Good luck! 🚀


Last Updated: 2026-04-08