Chapter 14: Design YouTube (Video Streaming Service)

volume1 youtube video-streaming cdn storage

Status: 🟩 Interview ready - Very common question!
Difficulty: Hard
Time to complete: 45 min read + practice


Overview

YouTube is one of the most visited websites in the world. Designing a video streaming service means solving hard problems: uploading large files reliably, transcoding video into many formats efficiently, and delivering it to millions of concurrent viewers with low latency.

Why this matters:

  • One of the most common hard system design questions
  • Covers CDN, blob storage, message queues, transcoding pipelines
  • Real-world: YouTube, Netflix, TikTok, Twitch, Hulu

Problem Statement

Design a video streaming service that:

  • Allows users to upload videos
  • Streams videos to users (mobile, web, smart TV)
  • Supports search and recommendations
  • Handles multi-resolution playback (360p to 4K)
  • Available internationally with low latency

Step 1: Requirements & Scope (5 min)

Functional Requirements

Clarifying questions:

  • Upload and stream only, or also search/recommendations? β†’ Yes to all
  • What devices? β†’ Mobile, web, smart TV
  • Video resolution support? β†’ Multiple: 360p, 480p, 720p, 1080p, 4K
  • International users? β†’ Yes, globally available
  • Max upload size? β†’ 1 GB max per video
  • Live streaming? β†’ No, focus on pre-recorded (on-demand)

Scope:

  • Upload videos with progress tracking
  • Transcode videos into multiple resolutions/formats
  • Stream videos globally with low buffering
  • Search video library
  • View count, likes, comments (basic metadata)

Non-Functional Requirements

  • Availability: 99.99% uptime (video streaming is loss-tolerant, not mission critical)
  • Reliability: No data loss for uploaded videos
  • Scalability: 5M DAU, 10M concurrent viewers at peak
  • Low latency streaming: Fast startup, smooth playback
  • Durability: Multiple redundant copies of video storage

Scale Estimation

Users:
  5M DAU
  10% upload videos: 500K uploaders/day
  Upload rate: 300 hours of video per minute

Storage (uploads):
  300 hours/min Γ— 60 min/hr Γ— 24 hr/day = 432,000 hours/day
  1 hour video at 1080p β‰ˆ 4 GB compressed
  432,000 hours Γ— 4 GB = ~1.7 PB/day raw storage

Transcoding:
  Each video β†’ 5 formats (360p, 480p, 720p, 1080p, 4K) β‰ˆ 5Γ— storage
  1.7 PB Γ— 5 = ~8.5 PB/day total transcoded storage

Streaming bandwidth:
  5M DAU Γ— 5 min avg watch = 25M min/day
  = 25M Γ— 60 sec Γ— 4 Mbps (avg 720p) / 8 bits = 750 TB/day CDN traffic

CDN bandwidth:
  750 TB / 86,400 sec β‰ˆ 70 Gbps average egress

Step 2: High-Level Design (10 min)

Two Core Flows

Flow 1: Video Upload

Client β†’ Load Balancer β†’ API Servers β†’ Metadata DB (MySQL)
                                    ↓
                           Original Storage (S3)
                                    ↓
                         Transcoding Pipeline
                                    ↓
                        Transcoded Storage (S3)
                                    ↓
                            CDN Distribution

Flow 2: Video Streaming

Client β†’ CDN Edge Server (cache hit) β†’ Video Content
Client β†’ CDN Edge Server (cache miss) β†’ Origin (S3) β†’ CDN β†’ Client

Component Overview

ComponentPurpose
Load BalancerDistribute incoming traffic
API ServersHandle upload requests, metadata CRUD
Original Storage (S3)Store raw uploaded videos
Transcoding ServiceConvert video to multiple formats (CPU-heavy)
Transcoded Storage (S3)Store output for each resolution
CDNDeliver video content globally, low latency
Metadata DB (MySQL)Video info, user info, view counts, likes
Message QueueDecouple upload from transcoding
Completion QueueSignal when transcoding done, trigger CDN push

High-Level Architecture Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Uploaders │────→│  Load    │────→│   API Servers   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚ Balancer β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                    β–Ό         β–Ό          β–Ό
                             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
                             β”‚Metadata  β”‚ β”‚  S3   β”‚ β”‚  MQ    β”‚
                             β”‚  DB      β”‚ β”‚(raw)  β”‚ β”‚(jobs)  β”‚
                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
                                                        β”‚
                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
                                              β”‚  Transcoding    β”‚
                                              β”‚  Workers (DAG)  β”‚
                                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                                        β”‚
                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
                                              β”‚  S3 Transcoded  β”‚
                                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                                        β”‚
                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
                                              β”‚      CDN        β”‚
                                              β”‚  (Edge Servers) β”‚
                                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                                        β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”‚
β”‚   Viewers   │────→│  CDN     β”‚β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  Edge    β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 3: Deep Dive (20 min)

Video Transcoding Pipeline

Why Transcode?

Problem: A video uploaded at 4K, H.264, MOV format:
  - Won't play on older mobile devices (no 4K support)
  - Kills mobile data (4K = ~25 Mbps vs 360p = ~0.5 Mbps)
  - Chrome doesn't support MOV (needs MP4 or WebM)
  - Smart TVs may need HLS format specifically

Solution: Transcode once into ALL needed formats
  1080p MP4 (H.264) β†’ for web/desktop
  720p  MP4 (H.264) β†’ for HD mobile
  480p  MP4 (H.264) β†’ for standard mobile
  360p  MP4 (H.264) β†’ for low-bandwidth mobile
  4K    MP4 (H.265) β†’ for premium devices
  HLS/DASH manifests β†’ for adaptive streaming

DAG Model for Transcoding Pipeline

Instead of one sequential pipeline, use a Directed Acyclic Graph (DAG) to parallelize independent steps:

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚ Original S3  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
                    β”‚ Preprocessor β”‚  (split video into segments, validate)
                    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β–Ό               β–Ό                   β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Video       β”‚  β”‚  Audio    β”‚    β”‚  Thumbnail      β”‚
    β”‚ Encoder     β”‚  β”‚ Encoder   β”‚    β”‚  Generator      β”‚
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚               β”‚                  β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ 360p/480p/  β”‚  β”‚  AAC/MP3  β”‚    β”‚  Thumbnail S3   β”‚
    β”‚ 720p/1080p  β”‚  β”‚  output   β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
           β”‚               β”‚
           β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                   β–Ό
           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β”‚  Watermarker  β”‚  (add logo/DRM)
           β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                   β–Ό
           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β”‚  Manifest     β”‚  (generate HLS/DASH .m3u8 / .mpd)
           β”‚  Generator    β”‚
           β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                   β–Ό
           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β”‚ Transcoded S3 β”‚
           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key insight: Video encoding, audio encoding, and thumbnail generation are independent β†’ run in parallel for max throughput.

Pipeline Stages Explained

StageWhat it doesTool
PreprocessorSplit video into 2-minute segments (GOP-aligned), validate format, extract metadataFFmpeg
Video EncoderRe-encode at each target resolution/bitrateFFmpeg (libx264, libx265)
Audio EncoderNormalize audio, convert to AAC/MP3FFmpeg
Thumbnail GeneratorExtract frames at key timestamps for previewFFmpeg
WatermarkerOverlay platform logo, embed DRM infoFFmpeg + DRM SDK
Manifest GeneratorCreate HLS .m3u8 or MPEG-DASH .mpd filesCustom script

FFmpeg command example:

# Transcode to 720p H.264 MP4
ffmpeg -i input.mov \
  -c:v libx264 -vf scale=1280:720 \
  -b:v 2500k \
  -c:a aac -b:a 128k \
  output_720p.mp4

Adaptive Bitrate Streaming (HLS / DASH)

The problem: Network speed changes mid-stream (Wi-Fi β†’ cellular β†’ congested).

Solution: Adaptive Bitrate Streaming (ABR)

How HLS works:
1. Video split into small segments (2-10 sec each)
2. Same content encoded at multiple bitrates
3. Master playlist (.m3u8) lists all quality levels
4. Player starts at medium quality, monitors download speed
5. If download fast β†’ switch UP (higher quality)
6. If buffering β†’ switch DOWN (lower quality)
7. Seamless quality switching with no interruption

HLS Manifest example:
#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=400000,RESOLUTION=640x360
360p/playlist.m3u8

#EXT-X-STREAM-INF:BANDWIDTH=1500000,RESOLUTION=1280x720
720p/playlist.m3u8

#EXT-X-STREAM-INF:BANDWIDTH=6000000,RESOLUTION=1920x1080
1080p/playlist.m3u8

HLS vs DASH comparison:

AspectHLS (Apple)MPEG-DASH (open standard)
Developed byAppleMPEG group
Apple devicesNative supportRequires plugin
AndroidSupportedNative support
Segment formatMPEG-TS or fMP4fMP4
Codec flexibilityLowerHigher (any codec)
AdoptionYouTube, Netflix iOSNetflix Android, Disney+
RecommendationiOS/SafariCross-platform

In practice: Support both. Detect device, serve appropriate manifest.


Video Storage Architecture

Upload: Pre-signed URLs

Problem: Videos are up to 1 GB. API servers should not be the transfer bottleneck.

Solution: Pre-signed URLs β€” client uploads directly to S3.

1. Client sends POST /upload/init to API server
2. API server generates pre-signed S3 URL (valid 1 hour)
3. API server returns URL to client
4. Client uploads directly to S3 using pre-signed URL
5. S3 notifies completion event β†’ triggers transcoding queue

Client β†’ API Server: "I want to upload video.mp4"
API Server β†’ S3: generate_presigned_url(bucket, key, ttl=3600)
API Server β†’ Client: { upload_url: "https://s3.amazonaws.com/...?sig=xxx" }
Client β†’ S3: PUT video.mp4 (direct, large transfer bypasses API servers)
S3 β†’ SNS/SQS: { event: "ObjectCreated", key: "raw/video123.mp4" }
SQS β†’ Transcoding Workers: pick up job

Why pre-signed URLs?:

  • API servers are NOT the bottleneck for large file transfers
  • S3 handles parallel multi-part upload natively
  • Client gets progress directly from S3
  • More secure: URL is scoped to one object, expires quickly

Resumable Chunked Upload

Problem: 1 GB upload on mobile can fail halfway. Should not restart from scratch.

Solution: Multipart upload β€” split into 5 MB chunks.

Upload flow:
1. Client initiates multipart upload β†’ gets upload_id
2. Client splits file into chunks (e.g., 5 MB each)
3. Client uploads chunks IN PARALLEL (e.g., 4 at a time)
4. Each chunk returns an ETag
5. On failure: only re-upload failed chunks (not whole file)
6. Client sends CompleteMultipartUpload with all ETags

Resuming:
- Client stores chunk completion status locally
- On resume: ListParts β†’ see which ETags received
- Only upload missing chunks
File: video.mp4 (200 MB)
Chunks: Part1(5MB) Part2(5MB) ... Part40(5MB)

Thread 1: Part1 βœ… Part5 βœ… Part9 βœ… ...
Thread 2: Part2 βœ… Part6 ❌ retry Part6 βœ… ...
Thread 3: Part3 βœ… Part7 βœ… Part11 βœ… ...
Thread 4: Part4 βœ… Part8 βœ… Part12 βœ… ...

CompleteMultipartUpload([ETag1, ETag2, ..., ETag40])

CDN Strategy

How CDN works for video:

Edge Server (CDN PoP - e.g., Singapore):
  - Popular videos cached here (80/20 rule: 20% videos = 80% traffic)
  - Client β†’ CDN edge β†’ instant serve from cache

Cache miss flow:
  - CDN edge β†’ Origin (S3) β†’ fetch + cache β†’ serve client
  - First viewer in region = slow, all others = fast

Geographic distribution:
  - AWS CloudFront: 400+ PoPs worldwide
  - Popular video in India: cached at Mumbai, Chennai, Hyderabad PoPs
  - No need to fetch from US-East S3 on every request

CDN Cost Optimization (critical topic in interviews):

Problem: CDN egress is VERY expensive (~$0.08/GB)
750 TB/day Γ— $0.08/GB = $60,000/day just for CDN!

Strategy 1: Only cache popular videos on CDN
  - Top 20% of videos = 80% of traffic (long-tail distribution)
  - Long-tail (old/niche videos) β†’ serve from S3 directly
  - Save 80% CDN cost for 20% of content

Strategy 2: Move unpopular videos to cold storage
  - Videos not watched in 30 days β†’ S3 Glacier Instant Retrieval
  - Cost: $0.004/GB/month vs $0.023/GB/month (standard S3)
  - 85% cost reduction for archived videos

Strategy 3: Regional CDN caching
  - Pre-warm CDN with trending videos before viral spike
  - ML model predicts which videos will trend

Metadata DB Design

Two-database approach:

ConcernDatabaseWhy
Video metadata (title, description, tags, owner)MySQL (relational)Structured, joins with users table
User data (profile, subscriptions, playlists)MySQLACID transactions
Video analytics (views, likes, trending scores)Cassandra (wide-column)High write throughput, time-series friendly
Search index (full-text search on title/description)ElasticsearchFull-text search with ranking

Core tables (MySQL):

-- Video metadata
CREATE TABLE videos (
    video_id     VARCHAR(36) PRIMARY KEY,
    user_id      VARCHAR(36) NOT NULL,
    title        VARCHAR(500) NOT NULL,
    description  TEXT,
    status       ENUM('processing', 'ready', 'failed'),
    duration_sec INT,
    size_bytes   BIGINT,
    created_at   TIMESTAMP,
    INDEX idx_user_id (user_id),
    INDEX idx_created_at (created_at)
);
 
-- Transcoded output per format
CREATE TABLE video_formats (
    format_id    INT AUTO_INCREMENT PRIMARY KEY,
    video_id     VARCHAR(36) NOT NULL,
    resolution   VARCHAR(10),  -- '360p', '720p', '1080p'
    format       VARCHAR(10),  -- 'mp4', 'hls', 'dash'
    s3_key       VARCHAR(500),
    size_bytes   BIGINT,
    FOREIGN KEY (video_id) REFERENCES videos(video_id)
);

Analytics with Cassandra:

-- View counts (high write throughput)
CREATE TABLE video_views (
    video_id    UUID,
    view_date   DATE,
    view_count  COUNTER,
    PRIMARY KEY (video_id, view_date)
);
 
-- Query: views last 30 days
SELECT SUM(view_count) FROM video_views
WHERE video_id = ? AND view_date >= '2026-03-13';

Error Handling & Reliability

Message Queue for Resilience

Without MQ (fragile):
  API Server β†’ Transcoding Worker (direct call)
  If transcoding worker crashes β†’ job lost ❌

With MQ (resilient):
  API Server β†’ SQS/Kafka β†’ Transcoding Workers
  If worker crashes β†’ job remains in queue βœ…
  Worker restarts β†’ picks up job again βœ…
  Dead Letter Queue (DLQ) β†’ jobs that fail 3+ times

Retry strategy:

Attempt 1: Immediate retry (worker crash, not user error)
Attempt 2: Retry after 30 seconds
Attempt 3: Retry after 5 minutes
Failed: Move to Dead Letter Queue
Alert: Notify engineering team via PagerDuty

Video upload error handling:

Upload errors:
  Network timeout β†’ resumable upload, retry from last chunk
  File format invalid β†’ return 400, prompt user to convert
  File too large (>1GB) β†’ return 413 Request Entity Too Large
  Storage full β†’ return 507, trigger auto-scaling alert

Transcoding errors:
  Corrupt input file β†’ mark video as FAILED, notify uploader
  Worker OOM β†’ restart worker, retry job
  All retries exhausted β†’ DLQ, human review

Safety: DRM and Watermarking

Digital Rights Management (DRM):

Problem: Premium content (movies, shows) must not be downloadable/copied

DRM solutions:
  FairPlay (Apple devices)   β†’ AES-128 encrypted segments
  Widevine (Google/Android)  β†’ Encrypted key exchange
  PlayReady (Microsoft)      β†’ Windows/Xbox DRM

Flow:
  1. Encrypted video stored in CDN (no plaintext)
  2. Player sends license request to DRM License Server
  3. License server validates user subscription
  4. License server returns decryption key (short-lived)
  5. Player decrypts and plays segment-by-segment
  6. Key expires after session

Watermarking:

Visible watermark: Platform logo overlay (top-right corner)
Invisible watermark: Embed user ID in video (forensic)
  - If video leaks, can trace which account downloaded it
  - Used by Netflix for screener copies
  - Imperceptible to human eye, detectable by algorithm

Design Summary

Final Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         UPLOAD PATH                               β”‚
β”‚                                                                   β”‚
β”‚ Client ──→ LB ──→ API Server ──→ MySQL (metadata)                β”‚
β”‚                       β”‚                                           β”‚
β”‚                       β”œβ”€β”€β†’ S3 (pre-signed URL) ──→ raw video     β”‚
β”‚                       β”‚         β”‚                                 β”‚
β”‚                       β”‚         └──→ SQS (transcode job)         β”‚
β”‚                       β”‚                    β”‚                      β”‚
β”‚                       β”‚           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚                       β”‚           β”‚  Transcoding    β”‚            β”‚
β”‚                       β”‚           β”‚  Workers (EC2   β”‚            β”‚
β”‚                       β”‚           β”‚  GPU instances) β”‚            β”‚
β”‚                       β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚                       β”‚                    β”‚                      β”‚
β”‚                       β”‚           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚                       β”‚           β”‚  S3 Transcoded  β”‚            β”‚
β”‚                       β”‚           β”‚  (per format)   β”‚            β”‚
β”‚                       └──→ Cassandra (view counts)  β”‚            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                         β”‚   CDN (CloudFront)  β”‚
                         β”‚   400+ PoPs global  β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         STREAM PATH                               β”‚
β”‚                                                                   β”‚
β”‚  Viewer ──→ DNS (GeoDNS) ──→ Nearest CDN Edge                    β”‚
β”‚             Route 53           β”‚                                   β”‚
β”‚                           Cache HIT? ──→ Stream video             β”‚
β”‚                           Cache MISS? ──→ S3 β†’ Cache β†’ Stream    β”‚
β”‚                                                                   β”‚
β”‚  Player: HLS/DASH adaptive bitrate streaming                      β”‚
β”‚    - Monitor download speed every 2 sec                           β”‚
β”‚    - Switch quality seamlessly                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Decisions Summary

DecisionChoiceReasoning
Video uploadPre-signed S3 URLs + multipartAPI servers not bottleneck, resumable
TranscodingDAG pipeline + message queueParallel stages, resilient to worker crashes
Transcoding formatHLS + DASH + MP4Cover all devices and adaptive bitrate
Video storageS3 blob storageCost-effective, durable, CDN-compatible
CDNCloudFront / AkamaiGlobal PoPs, low latency streaming
Metadata DBMySQL + CassandraRelational for structure, wide-column for analytics
Error handlingSQS + DLQ + retryNo job loss, automatic recovery
Cost controlOnly popular content on CDN80% savings from long-tail strategy

Interview Questions & Answers

Q: Why use a DAG model for the transcoding pipeline?
A: DAG allows parallel execution of independent stages. Video encoding at different resolutions, audio encoding, and thumbnail generation are all independent β€” they can run simultaneously on separate workers. A sequential pipeline would be 3-5Γ— slower. DAG also makes it easy to add new stages (e.g., content moderation) without rewriting the whole pipeline.

Q: What is adaptive bitrate streaming and why is it important?
A: ABR (HLS/DASH) splits video into small segments (2-10s) encoded at multiple bitrates. The player monitors download speed and switches quality tier up or down seamlessly. This is critical for mobile users whose bandwidth changes constantly (Wi-Fi β†’ LTE β†’ congested network). Without ABR, users would either buffer constantly (if quality is too high) or get unnecessarily low quality (if quality is too low).

Q: How do you handle the CDN cost problem?
A: CDN egress is expensive (~$0.08/GB). The key insight is that video traffic follows a power-law distribution β€” 20% of videos drive 80% of views. Strategy: Only cache popular/trending videos on CDN (determined by view count threshold), serve long-tail content directly from S3. Also move videos not watched in 30+ days to S3 Glacier to reduce storage cost. This can cut CDN cost by 70-80%.

Q: How do pre-signed URLs work for video upload?
A: The API server generates a time-limited (e.g., 1 hour) signed URL that allows the client to PUT one specific file directly to S3. The signature is computed using AWS credentials the client does not have. This means: (1) large file transfer bypasses API servers entirely, (2) the URL expires and can only be used for that one object, (3) S3 handles parallel multipart upload natively. After upload, S3 triggers an event (SNS/SQS) to kick off the transcoding pipeline.

Q: How would you handle a video that goes viral (sudden 1000Γ— traffic spike)?
A: CDN handles most of the spike automatically β€” the video would already be cached at edge servers. If it’s a new video not yet popular: CDN cache miss goes to S3 (first few requests per PoP), then cached. For upload spikes: auto-scaling transcoding workers (EC2 spot instances). For API layer: horizontal scaling behind load balancer. Pre-warm CDN for anticipated events (sports finals, product launches) by pre-fetching segments to edge PoPs.

Q: How do you ensure no video is lost if a transcoding worker crashes?
A: Message queue (SQS) provides durability. The transcoding job sits in SQS with a visibility timeout. When a worker picks up the job, SQS makes it invisible to other workers. If the worker crashes before sending acknowledgment, the visibility timeout expires, and the job reappears in the queue for another worker to pick up. After N failed attempts, the job moves to a Dead Letter Queue (DLQ) for human inspection.


Key Takeaways

  1. Separate upload and stream paths β€” different scaling characteristics, different optimizations
  2. DAG transcoding pipeline enables parallel processing β€” critical for throughput at scale
  3. Pre-signed URLs for large file upload: keep API servers out of the data path
  4. Chunked/multipart upload for resumability β€” never restart a 1 GB upload from scratch
  5. Adaptive bitrate streaming (HLS/DASH) is standard for all video delivery today
  6. CDN is the core of video delivery β€” 80-90% of video requests should be cache hits
  7. CDN cost optimization with long-tail strategy: only cache popular videos on CDN
  8. Message queue + DLQ for transcoding reliability β€” never lose a job to a worker crash


Practice this design! Very common hard interview question. Be ready to:

  1. Draw the full upload and stream path separately
  2. Explain the transcoding DAG and why it’s parallel
  3. Discuss adaptive bitrate streaming (HLS/DASH)
  4. Talk through CDN cost optimization and long-tail strategy
  5. Handle failure scenarios (worker crash, upload failure, viral spike)

Last Updated: 2026-04-13
Status: Very common hard interview question - Must know!