Chapter 8 Flashcards - Distributed Email Service

flashcards volume2 email smtp imap distributed-systems

What does SMTP do and when is it used?
?
SMTP (Simple Mail Transfer Protocol) is used for SENDING email. It handles transport: User’s client → User’s SMTP server → Recipient’s SMTP server. Port 587 (submission with auth) or 465 (SSL). Text-based, stateless. SMTP is NOT used for reading email. When you click “Send”, SMTP is what carries the email from server to server across the internet.

What does IMAP do and how is it different from POP3?
?
IMAP (Internet Message Access Protocol) is used to ACCESS/RECEIVE email from a server. Emails stay on the server, synced to all devices. Port 993 (SSL). Supports folders, read/unread state, multi-device sync. POP3 downloads emails and deletes from server — no sync, single device only. IMAP = modern (Gmail, Outlook use it). POP3 = legacy. Key difference: IMAP is stateful sync; POP3 is one-time download.

Compare SMTP, IMAP, and POP3 in one table.
?
SMTP: Sending | No server storage | N/A for multi-device | Used today for sending. IMAP: Receiving/Access | Stores on server | Yes, multi-device | Used today for receiving. POP3: Receiving (download) | No server storage | No | Legacy, largely replaced. Memory trick: SMTP = Send Mail To People. IMAP = I’ll Always Mail People. POP3 = Pull One Per device.

Why is a relational database not suitable for email storage at Gmail scale?
?
Three problems: (1) Write throughput: 10B emails/day = 115K writes/sec. MySQL tops out at ~10K TPS per node — need massive sharding. (2) Full-text search: SQL LIKE ‘%keyword%’ is O(n) table scan, doesn’t scale to billions of rows. (3) Storage cost: Storing email bodies as BLOBs in RDBMS is expensive and inefficient vs S3 object storage. Solution: Cassandra for metadata, S3 for bodies, Elasticsearch for search.

What is the recommended storage architecture for email at scale?
?
Four-tier split: (1) Cassandra — email metadata (subject, sender, labels, read status, S3 key). Partition by user_id, cluster by message_id (time-based). High write throughput, fast per-user queries. (2) S3 object store — email bodies (HTML/text) and attachments. Immutable blobs, cheap, tiered storage (Standard → IA → Glacier). (3) Elasticsearch — inverted index for full-text search. Async updated. (4) Redis — unread counts, session state, rate limiting. Each layer does one thing well.

Why use Cassandra for email metadata and what is the data model?
?
Cassandra excels at: high write throughput (115K/sec), partitioning by user_id (all user’s emails on same node), clustering by time-based UUID (natural chronological sort), no single point of failure. Schema: PRIMARY KEY (user_id, message_id) WITH CLUSTERING ORDER BY (message_id DESC). This makes “get latest 50 emails for user X” a single-partition scan. Labels stored as SET. Separate email_by_label table for efficient label queries.

How does email body storage in S3 work and why separate it from metadata?
?
Body stored at S3 key: “emails/{year}/{month}/{day}/{message_id}.eml”. Metadata (Cassandra) stores the S3 key, not the body itself. Why separate: (1) Inbox listing needs subject/sender/labels, NOT body — so inbox queries are fast Cassandra reads with no large blob transfer. (2) 80% of emails are never re-opened after first read — body stays in cold storage. (3) S3 lifecycle policy: Standard (0-30 days) → Infrequent Access (30-365 days) → Glacier (365+ days). Cost reduction: ~60% cheaper than RDBMS BLOB storage.

How does attachment deduplication with content hash work?
?
Compute SHA-256 of attachment bytes → use hash as S3 key: “attachments/{sha256_hash}”. Before uploading: check if S3 key exists. If exists: skip upload, just reference the hash. If not: upload once. Store content_hash in attachment_metadata table, not the file itself. Example: same 5MB PDF attached to 1M emails = stored exactly once, referenced 1M times. Deduplication is transparent to users. Reduces storage from potentially petabytes to terabytes for popular attachments.

How does email threading work (conversation view)?
?
RFC 2822 headers: Message-ID (unique ID of this email), In-Reply-To (message_id of email being replied to), References (full chain of ancestor message IDs). On receipt: check In-Reply-To. If parent email found in storage → inherit parent’s conversation_id. If no parent → generate new conversation_id. All emails in thread share conversation_id. To load thread: SELECT WHERE conversation_id = X. Gmail’s “conversation view” is built on this. The conversation_id is assigned at mail processing time, not at query time.

Explain at-least-once delivery with deduplication for email.
?
Exactly-once delivery is impossible in distributed systems (two generals problem). Practical solution: (1) Kafka queue with replication factor 3 and 7-day retention. Worker commits offset AFTER successful DB write. (2) Each email has a message_id (from RFC 2822 headers or assigned by SMTP server). (3) Before storing: check if message_id exists in Cassandra. If yes: skip (idempotent). If no: store. (4) Worker crash → re-read from Kafka from last committed offset → duplicate detected by message_id check. Result: no email lost, no duplicates stored.

What is a Dead Letter Queue and when do emails land there?
?
DLQ is a separate Kafka topic or queue where messages go after exhausting retries (typically 3 attempts). Emails land in DLQ when: corrupt email format (can’t parse RFC 2822), virus detected in attachment (quarantine), user over storage quota, persistent storage failure. Operations team monitors DLQ: manual review, retry, or discard. NEVER silently drop failed emails — they go to DLQ for accountability. DLQ is also used to detect systematic failures (sudden surge of DLQ entries = processing bug or downstream outage).

How does spam filtering work in a layered email system?
?
Layer 1 = IP reputation (before SMTP accepted): check sender IP against blacklists (Spamhaus). Cheapest check — blocks at network edge. Layer 2 = Email authentication (SMTP handshake): SPF (is sender’s IP authorized for their domain?), DKIM (is email signature valid?), DMARC (what to do on SPF/DKIM failure). Layer 3 = Content filtering (during processing): rule-based (suspicious phrases, known spam links), ML-based (trained on billions labeled spam/ham). Layer 4 = Collaborative filtering: many users mark same sender as spam → auto-block. Layer 5 = User feedback: “Report spam” trains model; “Not spam” whitelists sender.

What do SPF, DKIM, and DMARC do?
?
SPF (Sender Policy Framework): DNS record listing which IPs are authorized to send email for a domain. Prevents spoofing of sender domain. DKIM (DomainKeys Identified Mail): Cryptographic signature added to email headers by sender’s mail server. Recipient verifies signature using sender’s public key from DNS. Proves email wasn’t modified in transit. DMARC (Domain-based Message Authentication, Reporting & Conformance): Policy combining SPF + DKIM — tells receiving servers what to do if checks fail (reject, quarantine, or report only). All three are DNS-based.

What are the scale estimates for a distributed email system?
?
Users: 1B. Emails/day: 10B. Emails/sec: 10B / 86,400 = ~115,740/sec. Peak: ~3x = 350,000/sec. Storage: metadata = 10B × 1KB = 10TB/day. Bodies = 10B × 50KB = 500TB/day. Attachments (20% of emails × 500KB avg) = 1PB/day. Annual body storage: 182PB (before compression + dedup). With lifecycle tiering and compression: ~60-70% reduction. Message queue: Kafka handles 1M+ messages/sec easily with proper partition count.

How does Gmail-style label system differ from traditional IMAP folders?
?
IMAP folders: email belongs to exactly ONE folder. Moving email = physical relocation. Archiving = moving to Archive folder. Gmail labels: email can have MULTIPLE labels simultaneously (e.g., INBOX + STARRED + WORK). Labels are metadata, not physical locations. Archive = remove INBOX label (email still exists with All Mail label). Benefits: more flexible organization, “move to multiple categories”, simpler storage (no physical moves). Storage: labels as SET in Cassandra + separate email_by_label table for efficient label-based queries.

How do you implement full-text email search with Elasticsearch?
?
After storing email in Cassandra + S3, async worker indexes in Elasticsearch: {user_id, message_id, subject (text, analyzed), sender (keyword, exact), body_snippet (text, ~200 chars), labels (keyword), date}. Search query MUST include user_id filter for security isolation. ES returns list of message_ids → fetch full metadata from Cassandra. Why not store full body in ES? Too expensive, and 200-char snippet is enough for most searches. Body full-text: index full body text for keyword search, but store body file in S3.

Why is Elasticsearch kept separate from Cassandra in the email system?
?
Cassandra = source of truth for email data. ES = search index (derived, eventually consistent). Reasons to separate: (1) Different scaling needs — ES is read-heavy (search), Cassandra is write-heavy (receive). (2) ES index can be rebuilt from Cassandra if lost. (3) ES writes are async — search availability slightly lags storage (acceptable). (4) Different failure modes — ES failure means search is degraded but email receive/read still works. Design principle: don’t couple your search index to your operational database.

How does mobile email sync work efficiently?
?
Delta sync with sequence numbers: each user has a global monotonic sequence counter. Every change (new email, read, delete, label change) increments it and records what changed. Client stores last_sync_sequence locally. On reconnect: GET /sync?since=12345. Server returns all changes with sequence > 12345. Client applies delta. Push model: new email arrives → push notification via APNs/FCM → app wakes up → pulls delta sync. Avoids downloading full mailbox on reconnect. Background silent push notifications keep app synced even when not in foreground.

What is the email sending pipeline from user to recipient?
?

  1. User submits email via HTTPS to API server or SMTP submission server (port 587). 2. Email pushed to Kafka (async, don’t block user). 3. Outgoing SMTP worker: validates sender domain (SPF/DKIM), runs outbound spam check, does DNS MX record lookup for recipient domain, establishes SMTP connection to recipient’s mail server. 4. If recipient is internal user: route to our own receiving pipeline. 5. If delivery fails: retry with exponential backoff, then bounce (NDR) to sender. 6. Kafka ensures no email is lost if SMTP worker crashes mid-delivery.

How would you handle user storage quotas for email?
?
Track quota in Redis (fast counter per user). On email receive: check quota before Cassandra write. If over quota: reject at SMTP level with 452 error code (standard “insufficient storage” response) — sender gets bounce notification. Warning notifications at 80%/90%/95% via push and email. Quota updates: Redis for real-time fast check; Cassandra/DB for persistent quota record. Periodic reconciliation: background job recomputes actual storage from Cassandra/S3 to catch drift between Redis counter and actual data.

What is the incoming SMTP server responsible for?
?
Incoming SMTP server (port 25) handles the first phase of receiving: Accept TCP connections from external mail servers. Validate SPF, DKIM, DMARC records. Rate limit by sender IP to prevent DDoS. Accept the SMTP DATA command (receive the raw email). Assign a unique message_id if not already present. Push email bytes to Kafka topic “incoming_emails”. Return SMTP 250 OK to sender (acknowledging receipt). Crucially: the “acceptance” message to the sender is sent BEFORE the email is fully processed — Kafka durability guarantees we don’t lose it after that point.

How is email body content addressed in S3?
?
S3 key: “emails/{year}/{month}/{day}/{message_id}.eml”. The message_id makes the key unique. The date prefix allows lifecycle policy application per cohort and efficient object listing. Content: full RFC 2822 formatted email (with headers). Access: via pre-signed S3 URLs (expire in 1 hour) — served directly from S3 to client without going through application servers. ACL: private (never public). The Cassandra metadata record contains the body_s3_key field linking metadata to the body blob.

What are the phases of the mail processing worker?
?
Mail processing worker consumes from Kafka and does: (1) Idempotency check: message_id in Cassandra? Skip if yes. (2) Spam filtering: ML + rules + reputation. (3) Virus scan: attachment scanning. (4) Attachment extraction: extract, hash, store in S3 with dedup. (5) Threading: check In-Reply-To → assign conversation_id. (6) Label assignment: INBOX (ham) or SPAM based on spam score. (7) Cassandra write: metadata record + label index record. (8) Async Elasticsearch indexing: push to ES indexer queue. (9) Push notification: notify recipient’s devices. Commit Kafka offset LAST (after all steps succeed).

What distinguishes the email metadata table from the email body and why does this split matter?
?
Metadata (Cassandra): subject, sender, recipients, labels, read status, date, conversation_id, body_s3_key — ~1KB per email. Body (S3): full HTML/text content — ~50KB per email. The split matters because: (1) Inbox listing fetches metadata ONLY (50x smaller data transfer per query). (2) Body fetched on-demand when user opens email. (3) 80% of emails rarely re-opened — cold body stored cheaply in S3 Glacier. (4) Metadata queries (sort by date, filter by label) don’t need to touch the body blob. (5) Separate failure domains: S3 outage doesn’t affect inbox listing.

If Elasticsearch fails, what is the impact and how do you recover?
?
Impact: Email search is degraded or unavailable. Email receive, read, send, and label operations continue normally (ES is not in the critical path for those operations). Recovery: (1) Fix ES cluster. (2) Replay email indexing from Cassandra: scan email_metadata table and re-index all documents. Cassandra is the source of truth — ES index is always rebuildable. This is why writes go to Cassandra first, ES asynchronously after. Monitoring: alert when ES indexing lag exceeds threshold (e.g., more than 60 seconds behind) to detect processing pipeline issues early.


Total Cards: 25
Review Time: 20-25 minutes
Priority: HIGH — Gmail-scale storage design, protocol knowledge, and delivery guarantees tested at Google, Microsoft, Yahoo
Last Updated: 2026-04-13