Chapter 14 Flashcards - Design YouTube

flashcards volume1 youtube video-streaming cdn storage

Why do we need to transcode video after upload?
?
Different devices, browsers, and network speeds need different formats and resolutions. A 4K H.264 MOV won’t play on older mobile browsers (needs MP4/WebM), kills mobile data, and won’t work on all smart TVs. Transcoding once into multiple formats (360p, 480p, 720p, 1080p, 4K) and adaptive-streaming manifests (HLS/DASH) means any device on any connection can play smoothly.

What is a DAG (Directed Acyclic Graph) model for transcoding? Why use it?
?
DAG organizes transcoding as a graph of parallel + sequential stages. Independent stages (video encoding, audio encoding, thumbnail generation) run in PARALLEL simultaneously, then sequential stages (watermarking, manifest generation) run after. Result: 3-5× faster than a sequential pipeline. Also makes adding/removing stages easy. Example: video encoder (360p/720p/1080p), audio encoder (AAC), and thumbnail generator all run at the same time.

What are the stages of the video transcoding DAG pipeline?
?

  1. Preprocessor: split into GOP-aligned 2-min segments, validate format, extract metadata. 2. Video Encoder: FFmpeg re-encode at each target resolution/bitrate (parallel). 3. Audio Encoder: normalize, convert to AAC/MP3 (parallel). 4. Thumbnail Generator: extract frames at key timestamps (parallel). 5. Watermarker: overlay logo, embed DRM info (after encoding). 6. Manifest Generator: create HLS .m3u8 and MPEG-DASH .mpd files.

What is adaptive bitrate streaming (ABR)? How does HLS work?
?
ABR splits video into small segments (2-10 sec) encoded at multiple bitrates. Player monitors download speed and switches quality up or down seamlessly. HLS: Master playlist (.m3u8) lists all quality levels. Player starts at medium quality, measures download speed every segment. Too slow: switches down (360p). Fast enough: switches up (1080p). No rebuffering — quality changes between segments. Critical for mobile users whose bandwidth fluctuates constantly.

HLS vs MPEG-DASH: key differences and when to use each.
?
HLS (Apple): native on iOS/Safari, MPEG-TS or fMP4 segments. MPEG-DASH (open standard): native on Android/Chrome, fMP4, more flexible codecs. In practice: support BOTH. Detect device → serve appropriate manifest. YouTube uses DASH for Android/Chrome, HLS for iOS/Safari. Neither is universally “better” — they’re both adaptive streaming protocols; the choice is device compatibility.

How do pre-signed URLs work for video upload? Why not upload through API servers?
?
API server generates a time-limited signed S3 URL (valid 1 hour) scoped to one specific object. Client uploads directly to S3 using that URL — bypassing API servers entirely. Why: Large files (up to 1 GB) would make API servers the bottleneck. S3 handles parallel multipart upload natively. URL expires quickly so it’s secure. After upload, S3 fires an event (SNS → SQS) to trigger the transcoding pipeline. API servers only handle metadata, not data transfer.

How does chunked (multipart) upload enable resumable uploads?
?
Split file into 5 MB chunks. Upload chunks in parallel (e.g., 4 at a time). Each chunk returns an ETag. If network fails: only re-upload failed chunks (not whole file). Client stores chunk completion state locally. On resume: ListParts shows which ETags arrived → only upload missing chunks. S3 CompleteMultipartUpload assembles all parts. For a 1 GB video: 200 chunks of 5 MB. Crash at chunk 150: only re-upload chunks 151-200 (50 MB, not 1 GB again).

What is the CDN strategy for video delivery? How does it work at a high level?
?
CDN (Content Delivery Network) has 400+ Points of Presence (PoPs) worldwide. Popular videos are cached at the nearest edge server to the viewer. First viewer in a region: CDN fetches from S3 (cache miss, slower). All subsequent viewers: served from CDN cache (fast). GeoDNS routes viewer to nearest PoP automatically. ~80-90% of video requests should be CDN cache hits. This makes YouTube’s video latency low globally without routing all traffic to a central datacenter.

How do you optimize CDN costs for video streaming? (Critical interview topic!)
?
CDN egress is very expensive (~$0.08/GB). Key insight: power-law distribution — 20% of videos drive 80% of views (long tail). Strategy: (1) Only cache popular/trending videos on CDN; serve long-tail content directly from origin S3. (2) Move videos not watched in 30+ days to S3 Glacier (83% cost reduction). (3) Pre-warm CDN for anticipated viral content. (4) Set short TTLs for less-popular content to free cache space. Result: 70-80% CDN cost reduction while serving popular content fast.

What blob storage is used for video? How is it organized?
?
S3 (or S3-compatible) blob storage for both raw and transcoded video. Two buckets/paths: raw/ (original uploaded videos, trigger for transcoding), transcoded/ (output per format: transcoded/{video_id}/360p.mp4, transcoded/{video_id}/720p.mp4, etc. + HLS/DASH manifests). S3 provides 11 nines of durability (99.999999999%), scales infinitely, integrates directly with CloudFront CDN. Raw videos can be moved to Glacier after transcoding completes.

What database(s) are used for video metadata? Why?
?
Two-database approach: (1) MySQL for structured metadata (video title, description, tags, owner, format list) — supports joins with user table, ACID transactions. (2) Cassandra for analytics (view counts, likes, trending scores) — handles very high write throughput, time-series queries (“views last 30 days”), scales horizontally. Also Elasticsearch for full-text search on titles/descriptions. Don’t put high-write analytics in MySQL (bottleneck). Don’t do full-text search in MySQL without Elasticsearch.

How does the message queue enable reliable transcoding?
?
Without MQ: API server calls transcoding worker directly. If worker crashes, job is lost. With MQ (SQS): transcoding job published to queue. Worker picks up job with visibility timeout (job hidden from others while processing). If worker crashes before ACK: visibility timeout expires, job reappears in queue for another worker. Job retried up to 3 times. After 3 failures: moved to Dead Letter Queue (DLQ). DLQ triggers alert to engineering team. No job ever silently lost.

What is the retry strategy for transcoding failures?
?
Attempt 1: Immediate retry (likely transient error — worker OOM, network blip). Attempt 2: Retry after 30 seconds. Attempt 3: Retry after 5 minutes. After 3 failures: move to Dead Letter Queue (DLQ), alert engineering via PagerDuty. Different error types: Corrupt video file → mark FAILED immediately (don’t retry). Worker crash → retry (transient). Storage unavailable → retry with backoff. Always log which attempt failed and why for debugging.

What is DRM (Digital Rights Management) and how does it work?
?
DRM protects premium content from unauthorized copying/redistribution. Three main DRM systems: FairPlay (Apple/iOS), Widevine (Google/Android/Chrome), PlayReady (Microsoft/Windows). Flow: Video stored encrypted in CDN. Player requests license from DRM License Server. License server checks user subscription. If valid: returns short-lived decryption key. Player decrypts and plays segments. Key expires after session. Client never holds decryption key permanently. Common in Netflix, Disney+ — content cannot be screen-recorded via software.

What are the two types of video watermarking?
?
Visible watermark: Platform logo overlay (e.g., top-right corner). Imperceptible to users, branding. Invisible/forensic watermark: Embed unique user ID or session ID in video signal (pixel-level changes imperceptible to human eye). If a leak is traced, the watermark identifies exactly which account downloaded the copy. Used by Netflix for screener copies sent to awards voters. Detectable by algorithm even after re-encoding or cropping.

What is the upload API flow at a high level? (End-to-end)
?

  1. Client sends POST /upload/init → API server creates video record (status=PROCESSING), returns pre-signed S3 URL. 2. Client uploads video directly to S3 (multipart, chunked). 3. S3 fires ObjectCreated event → SQS transcoding queue. 4. Transcoding worker picks up job → runs DAG pipeline (encode all resolutions, thumbnails, watermark, manifest). 5. Worker uploads all outputs to transcoded S3. 6. Worker sends completion message → API server updates video status to READY in MySQL. 7. CDN distribution: popular videos pre-warmed to edge PoPs.

What is the streaming API flow? (End-to-end)
?

  1. Client requests video page → API server returns video metadata from MySQL. 2. Client fetches HLS/DASH manifest URL (points to CDN path). 3. Client GET manifest from CDN → returns quality-level playlist. 4. Player selects starting quality based on initial bandwidth estimate. 5. Player requests video segments from CDN (2-10 sec each). 6. CDN: cache hit → serve immediately. CDN: cache miss → fetch from S3, cache, serve. 7. Player monitors download speed each segment, switches quality tier up/down seamlessly. 8. Buffering avoided by downloading 2-3 segments ahead.

How do you handle a video going viral (sudden 1000× traffic spike)?
?
CDN handles most of the spike: video already cached at edge PoPs. If new viral video: first few requests per PoP hit S3 (cache miss), then cached — self-healing. For transcoding spike: auto-scale transcoding workers (EC2 spot instances from queue depth metric). For API layer: horizontal auto-scaling behind load balancer. For anticipated events (sports finals, product launches): pre-warm CDN by proactively pushing video segments to relevant regional PoPs before the event starts.

Why use GPU instances for transcoding workers?
?
Video encoding (especially H.264/H.265) is highly parallelizable — each video frame can be encoded independently. GPUs have thousands of small cores optimized for this kind of parallel workload (vs CPUs with fewer but more powerful cores). GPU transcoding (NVENC on NVIDIA) is 10-50× faster than CPU-only FFmpeg. AWS EC2 GPU instances (g4dn, p3): higher cost per hour but much lower cost per video-minute transcoded. Use spot instances (up to 90% discount) since transcoding is fault-tolerant (can resume on interruption).

What is GOP-aligned segmentation in the preprocessor stage?
?
GOP = Group of Pictures. In video encoding, frames are interdependent: I-frames (full frames), P-frames (delta from previous), B-frames (delta from both directions). A segment must begin at an I-frame (keyframe) — not mid-GOP — otherwise the decoder can’t reconstruct the video. GOP-aligned segmentation: preprocessor finds keyframe boundaries and cuts segments there. Typical GOP size: 2 seconds. This is why HLS/DASH segment duration is usually 2-6 seconds — aligned to GOP boundaries.

What scale estimates should you cite for YouTube in an interview?
?
5M DAU. 300 hours of video uploaded per minute. 10M concurrent peak viewers. Storage: ~1.7 PB/day raw uploads (300 hrs/min × 60 × 24 × ~4 GB/hr). With 5 transcoded formats: ~8.5 PB/day total. CDN bandwidth: ~70 Gbps average egress (5M DAU × 5 min × 4 Mbps avg / 86,400 sec). Upload size limit: 1 GB max per video. These numbers signal scale awareness and are easy to derive — always show the math.

Why is CDN the core of the streaming architecture (not S3)?
?
S3 is in one region (or few regions with replication). A viewer in Singapore fetching from US-East-1 S3 = 200-300ms RTT per segment × many segments = terrible buffering. CDN has 400+ PoPs globally: Singapore viewer → Singapore CDN PoP = 5-20ms RTT. Also, S3 has request rate limits per prefix; CDN distributes read load automatically. CDN transforms “stream from one origin” into “stream from closest edge,” which is the fundamental requirement for global video delivery at low latency.

How would you design the video search feature?
?
Use Elasticsearch for full-text search. Index: video title, description, tags, captions/transcript. On upload: transcoding pipeline extracts audio transcript (speech-to-text, e.g., AWS Transcribe), indexes text in Elasticsearch. Search query: Elasticsearch returns ranked results by relevance + view count (popularity boost). For recommendations: separate ML service using watch history + collaborative filtering (beyond scope of basic design). Elasticsearch decoupled from MySQL — async indexing pipeline updates search index on video metadata changes.

What are the two main sub-systems in YouTube design and why keep them separate?
?
Upload path: Handles large file ingestion, transcoding (CPU/GPU intensive, async). Scales with storage throughput and CPU/GPU compute. Upload is infrequent per user but very resource-intensive. Streaming path: Handles millions of concurrent read requests, low-latency delivery via CDN. Scales with CDN capacity and network bandwidth. Almost no compute needed (just CDN cache serving). Keeping them separate: different scaling needs, different SLAs (upload can be async, streaming must be real-time), different cost profiles (compute vs bandwidth).

What happens if a pre-signed URL expires before the upload finishes?
?
Pre-signed URL typically valid for 1 hour. For a 1 GB file on a slow connection (5 Mbps ≈ 27 min upload), expiry is unlikely but possible. Handling: Client checks remaining TTL before starting each chunk. If URL will expire mid-upload: request a new pre-signed URL from API server for the same S3 key. API server can extend TTL without creating a new file record. Alternatively: set pre-signed URL TTL generously (e.g., 6 hours for video uploads) to avoid mid-upload expiry. S3 multipart upload ID itself does not expire within the URL’s TTL.


Total Cards: 25
Review Time: 20-25 minutes
Priority: HIGH - Very common hard interview question!
Last Updated: 2026-04-13