Chapter 10: Design a Notification System

volume1 notifications push ios android email

Status: 🟩 Interview ready
Difficulty: Medium
Time to complete: 45 min read + practice


Overview

A notification system delivers timely messages to users across multiple channels β€” mobile push, SMS, and email. It is a critical infrastructure component in apps like Facebook, Twitter, Uber, and Netflix.

Why this matters:

  • Extremely common interview question (appears at Meta, Uber, Airbnb, etc.)
  • Covers async event-driven architecture, third-party integrations, reliability
  • Teaches message queues, fan-out patterns, and user preference management

Problem Statement

Design a notification system that:

  • Sends iOS push, Android push, SMS, and email notifications
  • Operates at massive scale (millions of notifications per day)
  • Supports user opt-out and do-not-disturb preferences
  • Is soft real-time (slight delay is acceptable)
  • Does not lose notifications (reliable delivery)

Step 1: Requirements & Scope (5 min)

Functional Requirements

Clarifying questions:

  • What types of notifications? β†’ iOS push, Android push, SMS, email
  • Is it real-time? β†’ Soft real-time β€” slight delay acceptable
  • What triggers a notification? β†’ External services / scheduled jobs
  • Can users opt out? β†’ Yes, user preference settings supported
  • What devices? β†’ iOS, Android, Laptop/Desktop (email)

Scale:

10M   mobile push notifications / day
1M    SMS notifications / day
5M    email notifications / day

Scope:

  • Build the notification dispatch infrastructure
  • Support all four channels with third-party providers
  • User preference and opt-out management
  • Reliable delivery (no data loss)

Non-Functional Requirements

  • Reliability: No notification loss β€” at-least-once delivery
  • Scalability: Handle 10M+ notifications/day with room to grow
  • Extensibility: Easy to add new notification types
  • Low latency: Soft real-time delivery (seconds, not hours)
  • Security: Authenticated API β€” only trusted services can send notifications

Step 2: High-Level Design (10 min)

Notification Types and Providers

Different channels require different third-party providers:

Notification TypeProviderProtocol
iOS PushApple Push Notification Service (APNs)Binary/HTTP2
Android PushFirebase Cloud Messaging (FCM)HTTP
SMSTwilio, Nexmo, AWS SNSREST API
EmailSendGrid, Mailchimp, AWS SESSMTP / REST API

Provider Overview

APNs (Apple Push Notification Service):

App β†’ Register with APNs β†’ APNs returns Device Token
Notification System β†’ APNs β†’ Device (iOS)
  • Requires Apple developer certificate
  • Device token is unique per device per app
  • Token can be invalidated (app uninstall, OS update)

FCM (Firebase Cloud Messaging):

App β†’ Register with FCM β†’ FCM returns Registration Token
Notification System β†’ FCM β†’ Device (Android)
  • Replaces legacy Google Cloud Messaging (GCM)
  • Registration token required for each device
  • Supports both notification and data messages

SMS (Twilio / Nexmo):

Notification System β†’ Twilio REST API β†’ Carrier Network β†’ User Phone
  • Phone number required
  • Higher cost than push notifications
  • High deliverability (even on feature phones)

Email (SendGrid / Mailchimp):

Notification System β†’ SendGrid API β†’ SMTP β†’ User Inbox
  • Best for long-form, rich content
  • Supports templates and personalization
  • Track opens and clicks

Basic Flow (Before Scaling)

Single Notification Server (naive approach):

[Service 1]  ─┐
[Service 2]  ─┼──→ [Notification Server] ──→ [APNs] ──→ iOS Device
[Service 3]  β”€β”˜          β”‚                ──→ [FCM]  ──→ Android Device
                          β”‚                ──→ [Twilio] β†’ Phone
                          β”‚                ──→ [SendGrid]β†’ Email
                          ↓
                    [User DB]
                    [Device Token DB]

Problems with single server:

  • Single point of failure
  • Hard to scale
  • Performance bottleneck
  • If third-party provider is slow, everything slows down

Gathering Device Tokens

iOS device token flow:

1. User installs app
2. App requests push permission from OS
3. iOS registers with APNs
4. APNs returns device token to app
5. App sends device token to our backend
6. Backend stores token in DB: { user_id, device_token, platform: "iOS" }

Android registration token flow:

1. User installs app
2. App registers with FCM
3. FCM returns registration token to app
4. App sends token to our backend
5. Backend stores: { user_id, device_token, platform: "Android" }

Contact Information Storage

User Info DB:

CREATE TABLE users (
    user_id     BIGINT PRIMARY KEY,
    email       VARCHAR(255),
    phone       VARCHAR(20),
    country     VARCHAR(10),
    created_at  TIMESTAMP
);
 
CREATE TABLE device_tokens (
    token_id    BIGINT PRIMARY KEY,
    user_id     BIGINT,
    device_token VARCHAR(500),
    platform    ENUM('ios', 'android'),
    last_seen   TIMESTAMP
);

Step 3: Deep Dive (20 min)

Improved Architecture with Message Queues

Why queues?

  • Decouple notification sender from notification dispatcher
  • Handle bursts: Queue absorbs traffic spikes
  • Reliability: If provider is down, message stays in queue
  • Independent scaling per channel
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Notification Service                        β”‚
β”‚                                                                 β”‚
β”‚  [Service A] ─┐                                                 β”‚
β”‚  [Service B] ─┼──→ [Notification API Server]                    β”‚
β”‚  [Scheduler] β”€β”˜           β”‚                                     β”‚
β”‚                            ↓                                    β”‚
β”‚                    [Notification Router]                        β”‚
β”‚                    (validate, enrich, route)                    β”‚
β”‚                            β”‚                                    β”‚
β”‚          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
β”‚          ↓                 ↓                  ↓                 β”‚
β”‚   [iOS Queue]      [Android Queue]      [SMS Queue]             β”‚
β”‚   [Email Queue]                                                 β”‚
β”‚          β”‚                 β”‚                  β”‚                 β”‚
β”‚          ↓                 ↓                  ↓                 β”‚
β”‚  [iOS Workers]   [Android Workers]   [SMS Workers]              β”‚
β”‚  [Email Workers]                                                β”‚
β”‚          β”‚                 β”‚                  β”‚                 β”‚
β”‚          ↓                 ↓                  ↓                 β”‚
β”‚       [APNs]            [FCM]            [Twilio]               β”‚
β”‚                                          [SendGrid]             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Full Architecture Diagram

External Services / Schedulers
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Notification   │────▢│   User DB        β”‚
β”‚  API Servers    β”‚     β”‚  (email, phone)  β”‚
β”‚  (stateless)    │────▢│  Device Token DB β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β”‚ Validate & Route
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Message Queues (Kafka/SQS) β”‚
β”‚  [iOS Queue] [Android Queue]         β”‚
β”‚  [SMS Queue] [Email Queue]           β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚      β”‚      β”‚       β”‚
       β–Ό      β–Ό      β–Ό       β–Ό
   [iOS   ][Android][SMS  ][Email
   Workers][Workers][Workers][Workers]
       β”‚      β”‚      β”‚       β”‚
       β–Ό      β–Ό      β–Ό       β–Ό
    [APNs] [FCM] [Twilio][SendGrid]
       β”‚      β”‚      β”‚       β”‚
       β–Ό      β–Ό      β–Ό       β–Ό
  iOS Device Android  Phone   Inbox
            Device

Supporting components:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Notification  β”‚    β”‚   Notification      β”‚
β”‚   Template DB   β”‚    β”‚   Settings DB       β”‚
β”‚  (reusable msg) β”‚    β”‚  (user prefs,       β”‚
β”‚                 β”‚    β”‚   opt-out, DND)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Analytics DB   β”‚    β”‚   Cache (Redis)     β”‚
β”‚  (sent/deliveredβ”‚    β”‚  (user prefs,       β”‚
β”‚   /read status) β”‚    β”‚   templates)        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Reliability: Preventing Data Loss

Problem: Notification workers can crash. If message is dequeued but not sent, it is lost.

Solution: At-least-once delivery with retry

Worker dequeues message
         β”‚
         β–Ό
  Send to provider
         β”‚
     β”Œβ”€β”€β”€β”΄β”€β”€β”€β”
   Success   Failure
     β”‚           β”‚
     β–Ό           β–Ό
 ACK queue   Put back in queue
             (with delay)
                 β”‚
            Retry counter++
                 β”‚
           If retry > MAX
                 β”‚
                 β–Ό
         Dead Letter Queue
         (alert on-call)

Retry with exponential backoff:

def send_with_retry(notification, max_retries=3):
    for attempt in range(max_retries):
        try:
            provider.send(notification)
            return  # Success
        except ProviderError as e:
            if attempt == max_retries - 1:
                dead_letter_queue.publish(notification)
                raise
            delay = (2 ** attempt) * 1000  # 1s, 2s, 4s in ms
            time.sleep(delay / 1000)

Dead Letter Queue (DLQ):

  • Messages that fail after all retries go here
  • Alert engineers for manual investigation
  • Prevents infinite retry loops

Notification Templates

Templates prevent duplication and ensure consistency:

{
  "template_id": "order_shipped",
  "channels": {
    "push": {
      "title": "Your order is on its way!",
      "body": "Order #{{order_id}} has shipped. ETA: {{eta}}."
    },
    "email": {
      "subject": "Order #{{order_id}} Shipped",
      "html_body": "<h1>Your order is on its way!</h1>..."
    },
    "sms": {
      "body": "Your order #{{order_id}} has shipped. Track it: {{tracking_url}}"
    }
  }
}

Benefits:

  • Consistent messaging across channels
  • Easy A/B testing
  • Non-engineers can update copy without deploys
  • Supports localization (i18n)

User Notification Settings (Opt-out / Do-Not-Disturb)

Settings schema:

CREATE TABLE notification_settings (
    user_id         BIGINT,
    channel         ENUM('push', 'sms', 'email'),
    notification_type VARCHAR(100),  -- 'marketing', 'alerts', 'updates'
    opted_in        BOOLEAN DEFAULT true,
    dnd_start       TIME,           -- e.g., 22:00 (10pm)
    dnd_end         TIME,           -- e.g., 08:00 (8am)
    timezone        VARCHAR(50),
    PRIMARY KEY (user_id, channel, notification_type)
);

Checking preferences before send:

def should_send(user_id, channel, notification_type):
    settings = db.get_settings(user_id, channel, notification_type)
 
    # Check opt-out
    if not settings.opted_in:
        return False
 
    # Check Do-Not-Disturb window
    user_time = get_user_local_time(user_id, settings.timezone)
    if is_in_dnd_window(user_time, settings.dnd_start, settings.dnd_end):
        return False  # Or queue for later delivery
 
    return True

Cache user settings in Redis to avoid DB lookup on every notification:

Key:   notification:settings:{user_id}:{channel}
Value: { opted_in: true, dnd_start: "22:00", dnd_end: "08:00" }
TTL:   300 seconds (5 minutes)

Rate Limiting Notifications

Problem: Don’t overwhelm users. Sending too many notifications causes app uninstalls.

Rate limit per user per channel:

Push notifications: max 10/day per user
SMS:               max 5/day per user
Email:             max 3/day per user
Marketing emails:  max 1/week per user

Implementation:

def check_rate_limit(user_id, channel):
    key = f"notif_rate:{user_id}:{channel}:{today()}"
    count = redis.incr(key)
    redis.expire(key, 86400)  # 1 day TTL
 
    limits = {"push": 10, "sms": 5, "email": 3}
    return count <= limits[channel]

Notification Tracking (Sent / Delivered / Read)

Tracking table:

CREATE TABLE notification_log (
    notification_id  BIGINT PRIMARY KEY,
    user_id          BIGINT,
    channel          VARCHAR(20),
    template_id      VARCHAR(100),
    status           ENUM('pending', 'sent', 'delivered', 'failed', 'read'),
    sent_at          TIMESTAMP,
    delivered_at     TIMESTAMP,
    read_at          TIMESTAMP,
    provider_msg_id  VARCHAR(200),  -- ID from APNs/FCM/Twilio
    retry_count      INT DEFAULT 0
);

How tracking works:

1. SENT:      Worker records notification as sent after API call succeeds
2. DELIVERED: Provider sends delivery receipt (APNs/FCM delivery callback)
3. READ:      Client app calls /notification/read endpoint when user opens it
4. FAILED:    Worker records failure after retries exhausted

Analytics pipeline:

Notification events β†’ Kafka β†’ Stream processor β†’ Analytics DB
                                                β†’ Dashboard

Security: Authenticating Callers

Only trusted services should trigger notifications:

Service A wants to send notification:
   1. Service A has API key / JWT token
   2. Notification API validates token
   3. Notification API checks: Is Service A allowed to send this type?
   4. If yes β†’ enqueue notification
   5. If no  β†’ 403 Forbidden

API request format:
POST /v1/notify
Authorization: Bearer <service_jwt>

{
  "user_id": "123456",
  "template_id": "order_shipped",
  "channel": ["push", "email"],
  "data": {
    "order_id": "ORD-789",
    "eta": "Tomorrow by 5pm"
  }
}

Device Token Lifecycle Management

Problem: Device tokens can become stale (app uninstalled, device reset).

Token invalidation handling:

1. Worker sends notification to APNs/FCM
2. Provider returns error: "BadDeviceToken" or "NotRegistered"
3. Worker marks token as invalid in DB
4. Future notifications skip invalid tokens
5. When user re-installs app, new token is registered

APNs feedback codes:
- 410: Device unregistered β†’ Delete token
- 400: Bad device token β†’ Mark invalid

FCM error codes:
- NotRegistered β†’ Delete token
- InvalidRegistration β†’ Mark invalid

Token refresh strategy:

-- Mark tokens that haven't been seen in 30 days as stale
UPDATE device_tokens
SET status = 'stale'
WHERE last_seen < NOW() - INTERVAL 30 DAY;

Monitoring and Alerting

Key metrics to monitor:

MetricAlert ThresholdTool
Notification queue depth> 100K messagesCloudWatch / Datadog
Delivery success rate< 95%PagerDuty
p99 end-to-end latency> 10 secondsGrafana
DLQ message count> 0PagerDuty
Provider error rate> 1%Grafana

Health dashboard:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Notifications Dashboard                β”‚
β”‚                                         β”‚
β”‚  Last 1 hour:                           β”‚
β”‚  Push sent:    1.2M  βœ… Delivery: 98.5% β”‚
β”‚  SMS sent:     50K   βœ… Delivery: 99.1% β”‚
β”‚  Email sent:   200K  βœ… Delivery: 97.8% β”‚
β”‚                                         β”‚
β”‚  Queue depths:                          β”‚
β”‚  iOS queue:     12K  🟑 (normal)        β”‚
β”‚  Android queue: 8K   βœ… (low)           β”‚
β”‚  SMS queue:     2K   βœ… (low)           β”‚
β”‚  Email queue:   5K   βœ… (low)           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Handling Third-Party Provider Failures

Problem: APNs or FCM can go down. What do we do?

Strategy: Multiple providers per channel:

Primary:  SendGrid (email)
Fallback: Amazon SES (email)

If SendGrid fails:
1. Worker catches error
2. Increments SendGrid failure counter
3. If failure rate > 5% β†’ Switch to Amazon SES
4. Alert on-call engineer
5. Retry failed messages via SES

Circuit breaker pattern:

Normal β†’ [Worker] β†’ [SendGrid] ← sends OK
Degraded β†’ [Worker] β†’ [SendGrid fails] β†’ [SES] ← fallback
Open circuit β†’ [Worker] β†’ [SES] (skip SendGrid temporarily)

Design Summary

Final Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Notification System                           β”‚
β”‚                                                                  β”‚
β”‚  Callers (Services / Schedulers)                                 β”‚
β”‚         β”‚                                                        β”‚
β”‚         β–Ό                                                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Notification APIβ”‚  β”‚  User DB     β”‚  β”‚  Device Token DB β”‚   β”‚
β”‚  β”‚ Servers         │──│  (email,     β”‚  β”‚  (APNs token,    β”‚   β”‚
β”‚  β”‚ (Validate,      β”‚  β”‚   phone)     β”‚  β”‚   FCM token)     β”‚   β”‚
β”‚  β”‚  Enrich, Route) β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                            β”‚
β”‚           β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚           β”‚         β”‚  User Settings Cache (Redis)        β”‚      β”‚
β”‚           β”‚         β”‚  Notification Rate Limit (Redis)    β”‚      β”‚
β”‚           β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚           β–Ό                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚               Message Queues (Kafka / SQS)               β”‚   β”‚
β”‚  β”‚  [iOS Queue]  [Android Queue]  [SMS Queue]  [Email Queue] β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚         β–Ό               β–Ό              β–Ό            β–Ό           β”‚
β”‚  [iOS Workers]  [Android Workers] [SMS Workers] [Email Workers] β”‚
β”‚         β”‚               β”‚              β”‚            β”‚           β”‚
β”‚         β–Ό               β–Ό              β–Ό            β–Ό           β”‚
β”‚      [APNs]           [FCM]        [Twilio]    [SendGrid]       β”‚
β”‚                                    [Nexmo]     [AWS SES]        β”‚
β”‚         β”‚               β”‚              β”‚            β”‚           β”‚
β”‚         β–Ό               β–Ό              β–Ό            β–Ό           β”‚
β”‚     iOS Device    Android Device    Phone         Inbox         β”‚
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚         Notification Log DB (Analytics)                   β”‚   β”‚
β”‚  β”‚         Template DB  |  Settings DB  |  DLQ               β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Design Decisions

DecisionChoiceReasoning
Notification queueKafka or SQSDecouple sender/dispatcher, reliability
iOS pushAPNsApple requirement for iOS push
Android pushFCMGoogle requirement for Android push
SMSTwilio/NexmoMature APIs, global reach
EmailSendGrid/SESHigh deliverability, analytics
User settingsMySQL + Redis cachePersistent + fast lookup
Retry logicExponential backoff + DLQResilient, no infinite loops
Token managementDB + invalidation on errorHandle stale tokens gracefully

Interview Questions & Answers

Q: How do you ensure no notification is lost?
A: Use durable message queues (Kafka with replication factor 3, or SQS). Messages are only ACKed after successful delivery to the provider. On failure, messages are retried with exponential backoff. After max retries, messages go to a Dead Letter Queue for manual investigation. This gives at-least-once delivery semantics.

Q: How do you handle a user who uninstalls the app?
A: The device token becomes invalid. On the next send attempt, APNs returns error code 410 (Unregistered) or FCM returns NotRegistered. The worker catches this, marks the token as invalid in the device token DB, and skips it for future notifications. The token is re-registered if the user reinstalls the app.

Q: How do you respect user Do-Not-Disturb settings?
A: Before routing a notification to the queue, the Notification API checks user settings from Redis cache (backed by MySQL). If the user’s current local time falls within their DND window, the notification can either be dropped (for non-urgent) or queued with a delayed delivery timestamp for after the DND window ends.

Q: How would you scale the email notification system to handle 5M emails/day?
A: 5M/day = ~58 emails/second. SendGrid supports 1000+ req/second so a few email worker instances suffice. For higher scale: partition the email queue by user_id, run multiple worker pools, use SendGrid’s batch send API, and stagger sending to avoid hitting per-second provider limits. Monitor delivery rates and bounce rates continuously.

Q: What happens if FCM goes down?
A: Android push notifications back up in the Android queue. Kafka retains messages durably. When FCM recovers, workers process the backlog. For critical alerts, implement a fallback to SMS. Monitor FCM health with circuit breaker β€” if error rate exceeds threshold, alert on-call and consider holding non-urgent notifications until FCM recovers.


Key Takeaways

  1. Four notification channels each use a dedicated third-party provider: APNs (iOS), FCM (Android), Twilio/Nexmo (SMS), SendGrid/Mailchimp (email)
  2. Message queues are critical for reliability β€” decouple sending from dispatching and absorb traffic spikes
  3. At-least-once delivery via queue retry + DLQ β€” never lose a notification
  4. Device tokens are ephemeral β€” always handle invalid token errors from APNs and FCM gracefully
  5. User preferences (opt-out, DND, channel settings) must be checked before every notification, cached in Redis for speed
  6. Rate limit notifications per user to avoid overwhelming users and causing app uninstalls
  7. Notification tracking (sent/delivered/read) is essential for analytics, debugging, and product iteration


Last Updated: 2026-04-13
Status: Interview ready β€” know APNs vs FCM, queue-based reliability, and DLQ pattern!