Chapter 15 Flashcards - Design Google Drive

flashcards volume1 google-drive cloud-storage sync file-system

Why split files into blocks (4 MB chunks) instead of uploading whole files?
?
Delta sync: when a user edits a large file (e.g., one paragraph in a 100 MB doc), only 1-2 blocks change. Instead of re-uploading 100 MB, only the changed 4 MB blocks are uploaded — 25× less bandwidth. Also enables deduplication (same block content across users stored once). Block size of 4 MB is a balance: too small (overhead per block too high), too large (edit touches more blocks than necessary). Fixed-size blocks keep the implementation simple vs variable-size (content-defined chunking).

What is content-addressable storage? How is it used in Google Drive block storage?
?
Content-addressable storage (CAS): each block is identified by the hash of its content (SHA256), not by a filename or location. SHA256(“hello world”) always = the same hash. Properties: (1) Immutable — if content changes, hash changes → it’s a new block. (2) Deduplication built-in — same content = same hash = same storage location. (3) Self-verifying — download block, re-hash, compare → detects corruption. Google Drive: block_hash is the primary key in the blocks table and the S3 object key. No two blocks with the same hash need to be stored twice.

How does delta sync work step-by-step?
?

Watcher detects file_changed event. 2. Chunker re-splits file into 4 MB blocks. 3. Indexer computes SHA256 for each block. 4. Sync Engine compares new block hashes vs last-synced block hashes. 5. Only CHANGED blocks (hash differs from last version) are uploaded. 6. Block Server stores new blocks in S3. 7. Metadata DB updated: new file_versions row with new block list. 8. Notification service tells other devices to sync. Example: 16 MB file, 1 block changed → upload 4 MB, not 16 MB. 75% bandwidth saved.

How does block deduplication work? What are the storage savings?
?
Before uploading a block, client sends block_hash to API server. Server checks blocks table: does this hash exist? Yes → skip upload (block already in S3), increment ref_count. No → upload block, create blocks row. Example: two users both have the same 5 MB photo → stored once, referenced twice. Deduplication across versions: edit report.docx, only 1 of 25 blocks changes → 24 blocks reused from v1. Typical deduplication ratios: 30-40% storage reduction for a general user base (higher for teams sharing files).

What is the four-component sync client architecture?
?

Watcher: uses OS file-system events (inotify on Linux, FSEvents on macOS, ReadDirectoryChangesW on Windows) to detect file create/modify/delete/rename in the watched folder. 2. Chunker: splits changed file into 4 MB fixed-size blocks. 3. Indexer: computes SHA256 hash for each block, maintains local SQLite DB of block hashes per file version. 4. Sync Engine: diffs old vs new block list → determines which blocks to upload → schedules uploads → handles retries and conflicts. These four work as a pipeline on every file change event.

What Metadata DB tables are needed for Google Drive?
?
Core tables: (1) users (user_id, email, storage_used, storage_limit). (2) files (file_id, owner_id, parent_id for folder hierarchy, name, is_folder, latest_version, trashed). (3) file_versions (file_id, version, size_bytes, checksum, created_by, created_at — revision history). (4) file_blocks (file_id, version, block_seq, block_hash — which blocks make up each version). (5) blocks (block_hash PK, s3_key, size_bytes, ref_count — the actual block inventory). (6) file_shares (file_id, shared_with, permission enum view/edit/owner).

How does file versioning work? How is storage kept efficient across versions?
?
Every save creates a new file_versions row. file_blocks maps each version to its block list. Blocks are immutable and shared across versions. Example: v1=[blockA, blockB, blockC], v2=[blockA, blockB’, blockC] → only blockB’ is new storage; blockA and blockC are referenced by both versions. Storage = only unique blocks, not N full file copies. Lifecycle: keep 30 daily versions after 30 days; 12 monthly snapshots after 1 year. Move old version blocks to S3 Glacier Instant Retrieval (83% cheaper than S3 Standard).

Why is long polling used for the notification service instead of WebSocket?
?
Drive sync is primarily server-to-client (server tells clients “a change happened, go sync”). WebSocket is bidirectional — overkill when clients only need to receive. Long polling: client sends GET /changes?since=timestamp, server holds request up to 30 sec, returns immediately when a change occurs, client re-polls. Advantages: works through all HTTP proxies/load balancers without config changes, lower per-connection overhead for infrequent notifications, simpler to implement. WebSocket better for: real-time co-editing (Google Docs), chat — truly bidirectional high-frequency traffic.

When should you use WebSocket vs Long Polling vs SSE?
?
Long Polling: server-to-client notifications, infrequent updates, need proxy compatibility. Use for: Drive sync, GitHub PR status, order tracking. WebSocket: high-frequency bidirectional, real-time collaboration. Use for: Google Docs co-edit, Slack chat, live dashboards with fast updates, multiplayer games. Server-Sent Events (SSE): server-to-client stream, high-frequency, one-directional. Use for: live sports scores, stock tickers, log streaming. Key rule: if client also pushes at high frequency → WebSocket. If mostly server pushes → Long Polling or SSE.

What consistency model is used for metadata vs file content? Why?
?
Metadata (MySQL, strong consistency): users must immediately see their own writes — if you upload a file and immediately list your folder, it must appear. Use MySQL with synchronous replication, read from primary. Eventual consistency would cause confusing UX (file “disappears” then “reappears”). File content (S3, eventual consistency OK): blocks are immutable — once written, content never changes (hash-addressed). An eventually consistent read either returns the block or “not found” (retry). No risk of returning stale/wrong content since blocks are write-once. This split is a key design insight: match consistency level to the actual need.

How do pre-signed URLs work in the Google Drive upload flow?
?

Client sends POST /upload/init with filename and block hashes. 2. API server checks which block hashes are MISSING from blocks table. 3. For each missing block: generates pre-signed S3 URL (valid for short time). 4. Returns list of {block_hash, presigned_url} pairs to client. 5. Client uploads only MISSING blocks directly to S3 (parallel, up to 4). 6. Client sends POST /upload/complete with full block list. 7. API server verifies all blocks exist → creates file_version → notifies other devices. Key optimization: blocks already stored (same content, or unchanged blocks) are NOT re-uploaded at all.

What is storage tiering and why is it important at 500 PB scale?
?
Tiered storage maps data access frequency to cost-optimized storage: (1) S3 Standard: current versions, recently accessed — $0.023/ GB / m o n t h . (2) S 3 S t an d a r d - I A : n o t a ccesse d in 30 d a y s —$ 0.0125/GB/month (46% cheaper). (3) S3 Glacier Instant Retrieval: old versions, 90+ days — $0.004/ GB / m o n t h (83$ 0.00099/GB/month. Lifecycle policies automate transitions. At 500 PB: if even 50% of data moves to Glacier, saves ~$9.5M/month vs keeping everything in Standard. Storage tiering is NOT optional at this scale.

How do you handle file conflicts between two offline devices?
?
Conflict detected: two devices create a version newer than last-synced-version. Strategy (no auto-merge for binary files): both versions are SAVED. Second sync creates “conflicted copy” (e.g., “report (John’s conflicted copy 2026-04-13).docx”) in the same folder. User is notified and must manually merge. For text files: optionally apply three-way merge (like git) if both edits are to different regions of the file — merge automatically. Key principle: NEVER silently overwrite. Always preserve both versions when in doubt. Data loss is worse than a conflict notification.

How does the upload flow handle block deduplication to avoid unnecessary uploads?
?
Client sends POST /upload/init with ALL block hashes for the new file version. API server does bulk lookup: SELECT block_hash FROM blocks WHERE block_hash IN (…). Server returns only the MISSING hashes (blocks not yet in S3). Client uploads ONLY those missing blocks. This means: editing 1 block of a 25-block file → client sends 25 hashes to server, server says “24 already exist, upload just these 1.” Also: if another user already uploaded the same content, client sends hashes, server finds them all, client uploads ZERO blocks. Deduplication happens before any network transfer.

What are the scale estimates for Google Drive?
?
50M total users, 10M DAU. Each user: 10 GB free → 500 PB total storage capacity. Read/write ratio: ~1:1 (users actively upload AND download). Upload rate: 10M DAU × 2 uploads/day = 20M uploads/day = ~231 uploads/sec. Metadata ops: ~10× upload ops = 2,300 ops/sec (listings, searches, version history). Storage with 40% dedup: 500 PB × 0.6 = 300 PB actual. With 3× replication: 900 PB physical. These estimates signal you understand the scale difference between a startup and a Google-scale product.

How do you handle very large files (e.g., a 50 GB video in Google Drive)?
?
Block-level chunking handles this naturally: 50 GB ÷ 4 MB = ~12,800 blocks. Client uploads blocks in parallel (e.g., 4 concurrent). Pre-signed URLs per block, each valid for the upload window. If interrupted: resume from last completed block (local indexer tracks which blocks uploaded). For the metadata API: upload init returns 12,800 {hash, presigned_url} pairs — might need pagination. S3 handles objects up to 5 TB with multipart upload. Block size could be increased for very large files (e.g., 16 MB blocks for 50 GB) to reduce round trips and overhead.

What OS-level mechanisms does the Watcher component use?
?
Linux: inotify API — kernel notifies process of file events (IN_CREATE, IN_MODIFY, IN_DELETE, IN_MOVED_FROM, IN_MOVED_TO) without polling. Efficient: no CPU overhead when nothing changes. macOS: FSEvents API (or kqueue) — similar kernel-level file system event notifications. Windows: ReadDirectoryChangesW Win32 API — watches a directory tree for changes. All are event-driven (no polling loop needed). Challenge: events can be noisy (editors like VS Code trigger many rapid events on save). Debounce: wait 500ms after last event before triggering sync to batch related changes.

How do you handle deleted files and the trash/recycle bin?
?
Soft delete: mark files.trashed = TRUE, do NOT delete S3 blocks immediately. Trash retains files for 30 days. User can restore from trash within 30 days: set trashed=FALSE, restore to original path. After 30 days: hard delete — decrement ref_count on all blocks. If ref_count drops to 0 for any block: schedule S3 block deletion (async cleanup job). This prevents data loss from accidental deletion. Also handles multi-version cleanup: deleting a file decrements ref_count across ALL versions’ blocks. Only truly orphaned blocks (ref_count=0) are eligible for S3 deletion.

How do you scale the Metadata DB to support 50 million users?
?
Shard MySQL by user_id: hash(user_id) % N_shards. Each shard owns a subset of users and all their files/blocks metadata. Cross-shard queries (e.g., “files shared with me by others”) handled by: (1) fan-out to all shards at read time, or (2) maintain a secondary index table “shared_with_me” per user (shard by recipient user_id). Read replicas per shard for read-heavy workloads (file listings, search). Redis cache in front of MySQL for hot data (e.g., a user’s root folder listing). At 50M users with good sharding: ~10M users/shard for 5 shards is manageable.

What database should you use for file metadata and why not NoSQL?
?
MySQL (relational) for metadata. Why: (1) Strong consistency — critical for sync (users must see own writes immediately). (2) Transactions — creating a file_version + updating files.latest_version atomically. (3) Foreign keys — enforce referential integrity (file_blocks.block_hash must exist in blocks table). (4) JOINS — folder listings need file + share + user queries. NoSQL tradeoffs: Cassandra gives high write throughput but eventual consistency (wrong for metadata). DynamoDB: single-table design possible but complex for relational queries. Verdict: metadata is structured, relational, requires ACID — SQL wins.

How does file sharing work? What are the permission levels?
?
file_shares table: (file_id, shared_with user_id, permission enum). Permissions: VIEW (read-only download), EDIT (can upload new versions), OWNER (can share with others, delete). API check on every operation: verify caller has required permission for target file_id. Sharing a folder: all contents inherit parent permissions (check recursively or denormalize permissions to each file). Public link sharing: special “anonymous” user record, or signed URL with embedded permissions. Revoke sharing: DELETE from file_shares → propagate to notification service → other users’ clients see file disappear from Shared folder.

How does the notification service scale to handle 10M concurrent users?
?
Long polling at 10M concurrent connections: (1) Use event-driven servers (Node.js, Netty) that hold connections without blocking threads — 1 server can handle 100K+ open connections. (2) Route user connections to specific notification server using consistent hashing (same user always goes to same server, simpler fan-out). (3) When a change occurs: publish to Kafka (topic = user_id partition). Notification consumers read from Kafka and push to connected client. (4) Offline users: store events in Kafka (or DB) until client reconnects and polls for missed changes. (5) ~100 notification servers for 10M DAU.

What is the difference between the Upload path and the Sync path?
?
Upload path (Client A → server): user explicitly adds a file. Client Watcher detects new file, Chunker splits it, Indexer hashes blocks, Sync Engine uploads delta blocks → Block Server → S3. API server creates/updates metadata in MySQL. Sync path (server → Client B): notifies other devices of changes. Notification service (long polling) pushes change event to Client B. Client B fetches updated file metadata from API server. Client B downloads only the NEW/CHANGED blocks from S3 (delta download). Reassembles blocks into file on local disk. Both paths use block-level delta to minimize data transfer.

How do you handle a client that has been offline for a long time (e.g., 2 weeks)?
?
On reconnect, client sends last_sync_timestamp to GET /changes?since=. Server returns all change events since that timestamp. Client applies changes in order: updates, creates, deletes. Problem: too many changes to return at once (2 weeks of activity). Solution: paginate changes (cursor-based: return 100 events, provide cursor for next page). Client processes pages one at a time. Another issue: conflict detection — changes on client during offline period vs changes on server. Apply standard conflict resolution (preserve both, notify user). Limit event retention: if offline > 90 days, full re-sync from current state instead.

What are the key reliability guarantees Google Drive must provide?
?
(1) No data loss: S3 provides 99.999999999% (11 nines) durability via multi-AZ replication. Blocks are immutable and content-verified (SHA256). (2) Strong metadata consistency: MySQL synchronous replication, transactions. (3) Versioning: all previous versions retained → accidental overwrites recoverable. (4) Deduplication correctness: SHA256 collision resistance means dedup is safe. (5) Atomic version creation: file_version record only created after all blocks confirmed in S3. (6) Conflict preservation: never silently overwrite, always keep both conflicting versions. These guarantees together mean: if you upload it, it’s there and correct.

Total Cards: 25
Review Time: 20-25 minutes
Priority: HIGH - Common hard interview question!
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

vol1-ch15-google-drive

Chapter 15 Flashcards - Design Google Drive

Graph View