Chapter 10: Design a Notification System
volume1 notifications push ios android email
Status: π© Interview ready
Difficulty: Medium
Time to complete: 45 min read + practice
Overview
A notification system delivers timely messages to users across multiple channels β mobile push, SMS, and email. It is a critical infrastructure component in apps like Facebook, Twitter, Uber, and Netflix.
Why this matters:
- Extremely common interview question (appears at Meta, Uber, Airbnb, etc.)
- Covers async event-driven architecture, third-party integrations, reliability
- Teaches message queues, fan-out patterns, and user preference management
Problem Statement
Design a notification system that:
- Sends iOS push, Android push, SMS, and email notifications
- Operates at massive scale (millions of notifications per day)
- Supports user opt-out and do-not-disturb preferences
- Is soft real-time (slight delay is acceptable)
- Does not lose notifications (reliable delivery)
Step 1: Requirements & Scope (5 min)
Functional Requirements
Clarifying questions:
- What types of notifications? β iOS push, Android push, SMS, email
- Is it real-time? β Soft real-time β slight delay acceptable
- What triggers a notification? β External services / scheduled jobs
- Can users opt out? β Yes, user preference settings supported
- What devices? β iOS, Android, Laptop/Desktop (email)
Scale:
10M mobile push notifications / day
1M SMS notifications / day
5M email notifications / day
Scope:
- Build the notification dispatch infrastructure
- Support all four channels with third-party providers
- User preference and opt-out management
- Reliable delivery (no data loss)
Non-Functional Requirements
- Reliability: No notification loss β at-least-once delivery
- Scalability: Handle 10M+ notifications/day with room to grow
- Extensibility: Easy to add new notification types
- Low latency: Soft real-time delivery (seconds, not hours)
- Security: Authenticated API β only trusted services can send notifications
Step 2: High-Level Design (10 min)
Notification Types and Providers
Different channels require different third-party providers:
| Notification Type | Provider | Protocol |
|---|---|---|
| iOS Push | Apple Push Notification Service (APNs) | Binary/HTTP2 |
| Android Push | Firebase Cloud Messaging (FCM) | HTTP |
| SMS | Twilio, Nexmo, AWS SNS | REST API |
| SendGrid, Mailchimp, AWS SES | SMTP / REST API |
Provider Overview
APNs (Apple Push Notification Service):
App β Register with APNs β APNs returns Device Token
Notification System β APNs β Device (iOS)
- Requires Apple developer certificate
- Device token is unique per device per app
- Token can be invalidated (app uninstall, OS update)
FCM (Firebase Cloud Messaging):
App β Register with FCM β FCM returns Registration Token
Notification System β FCM β Device (Android)
- Replaces legacy Google Cloud Messaging (GCM)
- Registration token required for each device
- Supports both notification and data messages
SMS (Twilio / Nexmo):
Notification System β Twilio REST API β Carrier Network β User Phone
- Phone number required
- Higher cost than push notifications
- High deliverability (even on feature phones)
Email (SendGrid / Mailchimp):
Notification System β SendGrid API β SMTP β User Inbox
- Best for long-form, rich content
- Supports templates and personalization
- Track opens and clicks
Basic Flow (Before Scaling)
Single Notification Server (naive approach):
[Service 1] ββ
[Service 2] ββΌβββ [Notification Server] βββ [APNs] βββ iOS Device
[Service 3] ββ β βββ [FCM] βββ Android Device
β βββ [Twilio] β Phone
β βββ [SendGrid]β Email
β
[User DB]
[Device Token DB]
Problems with single server:
- Single point of failure
- Hard to scale
- Performance bottleneck
- If third-party provider is slow, everything slows down
Gathering Device Tokens
iOS device token flow:
1. User installs app
2. App requests push permission from OS
3. iOS registers with APNs
4. APNs returns device token to app
5. App sends device token to our backend
6. Backend stores token in DB: { user_id, device_token, platform: "iOS" }
Android registration token flow:
1. User installs app
2. App registers with FCM
3. FCM returns registration token to app
4. App sends token to our backend
5. Backend stores: { user_id, device_token, platform: "Android" }
Contact Information Storage
User Info DB:
CREATE TABLE users (
user_id BIGINT PRIMARY KEY,
email VARCHAR(255),
phone VARCHAR(20),
country VARCHAR(10),
created_at TIMESTAMP
);
CREATE TABLE device_tokens (
token_id BIGINT PRIMARY KEY,
user_id BIGINT,
device_token VARCHAR(500),
platform ENUM('ios', 'android'),
last_seen TIMESTAMP
);Step 3: Deep Dive (20 min)
Improved Architecture with Message Queues
Why queues?
- Decouple notification sender from notification dispatcher
- Handle bursts: Queue absorbs traffic spikes
- Reliability: If provider is down, message stays in queue
- Independent scaling per channel
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Notification Service β
β β
β [Service A] ββ β
β [Service B] ββΌβββ [Notification API Server] β
β [Scheduler] ββ β β
β β β
β [Notification Router] β
β (validate, enrich, route) β
β β β
β βββββββββββββββββββΌβββββββββββββββββββ β
β β β β β
β [iOS Queue] [Android Queue] [SMS Queue] β
β [Email Queue] β
β β β β β
β β β β β
β [iOS Workers] [Android Workers] [SMS Workers] β
β [Email Workers] β
β β β β β
β β β β β
β [APNs] [FCM] [Twilio] β
β [SendGrid] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Full Architecture Diagram
External Services / Schedulers
β
βΌ
βββββββββββββββββββ βββββββββββββββββββ
β Notification ββββββΆβ User DB β
β API Servers β β (email, phone) β
β (stateless) ββββββΆβ Device Token DB β
ββββββββββ¬βββββββββ βββββββββββββββββββ
β
β Validate & Route
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Message Queues (Kafka/SQS) β
β [iOS Queue] [Android Queue] β
β [SMS Queue] [Email Queue] β
ββββββββ¬βββββββ¬βββββββ¬ββββββββ¬ββββββββββ
β β β β
βΌ βΌ βΌ βΌ
[iOS ][Android][SMS ][Email
Workers][Workers][Workers][Workers]
β β β β
βΌ βΌ βΌ βΌ
[APNs] [FCM] [Twilio][SendGrid]
β β β β
βΌ βΌ βΌ βΌ
iOS Device Android Phone Inbox
Device
Supporting components:
βββββββββββββββββββ ββββββββββββββββββββββ
β Notification β β Notification β
β Template DB β β Settings DB β
β (reusable msg) β β (user prefs, β
β β β opt-out, DND) β
βββββββββββββββββββ ββββββββββββββββββββββ
βββββββββββββββββββ ββββββββββββββββββββββ
β Analytics DB β β Cache (Redis) β
β (sent/deliveredβ β (user prefs, β
β /read status) β β templates) β
βββββββββββββββββββ ββββββββββββββββββββββ
Reliability: Preventing Data Loss
Problem: Notification workers can crash. If message is dequeued but not sent, it is lost.
Solution: At-least-once delivery with retry
Worker dequeues message
β
βΌ
Send to provider
β
βββββ΄ββββ
Success Failure
β β
βΌ βΌ
ACK queue Put back in queue
(with delay)
β
Retry counter++
β
If retry > MAX
β
βΌ
Dead Letter Queue
(alert on-call)
Retry with exponential backoff:
def send_with_retry(notification, max_retries=3):
for attempt in range(max_retries):
try:
provider.send(notification)
return # Success
except ProviderError as e:
if attempt == max_retries - 1:
dead_letter_queue.publish(notification)
raise
delay = (2 ** attempt) * 1000 # 1s, 2s, 4s in ms
time.sleep(delay / 1000)Dead Letter Queue (DLQ):
- Messages that fail after all retries go here
- Alert engineers for manual investigation
- Prevents infinite retry loops
Notification Templates
Templates prevent duplication and ensure consistency:
{
"template_id": "order_shipped",
"channels": {
"push": {
"title": "Your order is on its way!",
"body": "Order #{{order_id}} has shipped. ETA: {{eta}}."
},
"email": {
"subject": "Order #{{order_id}} Shipped",
"html_body": "<h1>Your order is on its way!</h1>..."
},
"sms": {
"body": "Your order #{{order_id}} has shipped. Track it: {{tracking_url}}"
}
}
}Benefits:
- Consistent messaging across channels
- Easy A/B testing
- Non-engineers can update copy without deploys
- Supports localization (i18n)
User Notification Settings (Opt-out / Do-Not-Disturb)
Settings schema:
CREATE TABLE notification_settings (
user_id BIGINT,
channel ENUM('push', 'sms', 'email'),
notification_type VARCHAR(100), -- 'marketing', 'alerts', 'updates'
opted_in BOOLEAN DEFAULT true,
dnd_start TIME, -- e.g., 22:00 (10pm)
dnd_end TIME, -- e.g., 08:00 (8am)
timezone VARCHAR(50),
PRIMARY KEY (user_id, channel, notification_type)
);Checking preferences before send:
def should_send(user_id, channel, notification_type):
settings = db.get_settings(user_id, channel, notification_type)
# Check opt-out
if not settings.opted_in:
return False
# Check Do-Not-Disturb window
user_time = get_user_local_time(user_id, settings.timezone)
if is_in_dnd_window(user_time, settings.dnd_start, settings.dnd_end):
return False # Or queue for later delivery
return TrueCache user settings in Redis to avoid DB lookup on every notification:
Key: notification:settings:{user_id}:{channel}
Value: { opted_in: true, dnd_start: "22:00", dnd_end: "08:00" }
TTL: 300 seconds (5 minutes)
Rate Limiting Notifications
Problem: Donβt overwhelm users. Sending too many notifications causes app uninstalls.
Rate limit per user per channel:
Push notifications: max 10/day per user
SMS: max 5/day per user
Email: max 3/day per user
Marketing emails: max 1/week per user
Implementation:
def check_rate_limit(user_id, channel):
key = f"notif_rate:{user_id}:{channel}:{today()}"
count = redis.incr(key)
redis.expire(key, 86400) # 1 day TTL
limits = {"push": 10, "sms": 5, "email": 3}
return count <= limits[channel]Notification Tracking (Sent / Delivered / Read)
Tracking table:
CREATE TABLE notification_log (
notification_id BIGINT PRIMARY KEY,
user_id BIGINT,
channel VARCHAR(20),
template_id VARCHAR(100),
status ENUM('pending', 'sent', 'delivered', 'failed', 'read'),
sent_at TIMESTAMP,
delivered_at TIMESTAMP,
read_at TIMESTAMP,
provider_msg_id VARCHAR(200), -- ID from APNs/FCM/Twilio
retry_count INT DEFAULT 0
);How tracking works:
1. SENT: Worker records notification as sent after API call succeeds
2. DELIVERED: Provider sends delivery receipt (APNs/FCM delivery callback)
3. READ: Client app calls /notification/read endpoint when user opens it
4. FAILED: Worker records failure after retries exhausted
Analytics pipeline:
Notification events β Kafka β Stream processor β Analytics DB
β Dashboard
Security: Authenticating Callers
Only trusted services should trigger notifications:
Service A wants to send notification:
1. Service A has API key / JWT token
2. Notification API validates token
3. Notification API checks: Is Service A allowed to send this type?
4. If yes β enqueue notification
5. If no β 403 Forbidden
API request format:
POST /v1/notify
Authorization: Bearer <service_jwt>
{
"user_id": "123456",
"template_id": "order_shipped",
"channel": ["push", "email"],
"data": {
"order_id": "ORD-789",
"eta": "Tomorrow by 5pm"
}
}
Device Token Lifecycle Management
Problem: Device tokens can become stale (app uninstalled, device reset).
Token invalidation handling:
1. Worker sends notification to APNs/FCM
2. Provider returns error: "BadDeviceToken" or "NotRegistered"
3. Worker marks token as invalid in DB
4. Future notifications skip invalid tokens
5. When user re-installs app, new token is registered
APNs feedback codes:
- 410: Device unregistered β Delete token
- 400: Bad device token β Mark invalid
FCM error codes:
- NotRegistered β Delete token
- InvalidRegistration β Mark invalid
Token refresh strategy:
-- Mark tokens that haven't been seen in 30 days as stale
UPDATE device_tokens
SET status = 'stale'
WHERE last_seen < NOW() - INTERVAL 30 DAY;Monitoring and Alerting
Key metrics to monitor:
| Metric | Alert Threshold | Tool |
|---|---|---|
| Notification queue depth | > 100K messages | CloudWatch / Datadog |
| Delivery success rate | < 95% | PagerDuty |
| p99 end-to-end latency | > 10 seconds | Grafana |
| DLQ message count | > 0 | PagerDuty |
| Provider error rate | > 1% | Grafana |
Health dashboard:
βββββββββββββββββββββββββββββββββββββββββββ
β Notifications Dashboard β
β β
β Last 1 hour: β
β Push sent: 1.2M β
Delivery: 98.5% β
β SMS sent: 50K β
Delivery: 99.1% β
β Email sent: 200K β
Delivery: 97.8% β
β β
β Queue depths: β
β iOS queue: 12K π‘ (normal) β
β Android queue: 8K β
(low) β
β SMS queue: 2K β
(low) β
β Email queue: 5K β
(low) β
βββββββββββββββββββββββββββββββββββββββββββ
Handling Third-Party Provider Failures
Problem: APNs or FCM can go down. What do we do?
Strategy: Multiple providers per channel:
Primary: SendGrid (email)
Fallback: Amazon SES (email)
If SendGrid fails:
1. Worker catches error
2. Increments SendGrid failure counter
3. If failure rate > 5% β Switch to Amazon SES
4. Alert on-call engineer
5. Retry failed messages via SES
Circuit breaker pattern:
Normal β [Worker] β [SendGrid] β sends OK
Degraded β [Worker] β [SendGrid fails] β [SES] β fallback
Open circuit β [Worker] β [SES] (skip SendGrid temporarily)
Design Summary
Final Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Notification System β
β β
β Callers (Services / Schedulers) β
β β β
β βΌ β
β βββββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β
β β Notification APIβ β User DB β β Device Token DB β β
β β Servers ββββ (email, β β (APNs token, β β
β β (Validate, β β phone) β β FCM token) β β
β β Enrich, Route) β ββββββββββββββββ ββββββββββββββββββββ β
β ββββββββββ¬βββββββββ β
β β ββββββββββββββββββββββββββββββββββββββ β
β β β User Settings Cache (Redis) β β
β β β Notification Rate Limit (Redis) β β
β β ββββββββββββββββββββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Message Queues (Kafka / SQS) β β
β β [iOS Queue] [Android Queue] [SMS Queue] [Email Queue] β β
β ββββββββ¬ββββββββββββββββ¬βββββββββββββββ¬βββββββββββββ¬ββββββββ β
β βΌ βΌ βΌ βΌ β
β [iOS Workers] [Android Workers] [SMS Workers] [Email Workers] β
β β β β β β
β βΌ βΌ βΌ βΌ β
β [APNs] [FCM] [Twilio] [SendGrid] β
β [Nexmo] [AWS SES] β
β β β β β β
β βΌ βΌ βΌ βΌ β
β iOS Device Android Device Phone Inbox β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Notification Log DB (Analytics) β β
β β Template DB | Settings DB | DLQ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Design Decisions
| Decision | Choice | Reasoning |
|---|---|---|
| Notification queue | Kafka or SQS | Decouple sender/dispatcher, reliability |
| iOS push | APNs | Apple requirement for iOS push |
| Android push | FCM | Google requirement for Android push |
| SMS | Twilio/Nexmo | Mature APIs, global reach |
| SendGrid/SES | High deliverability, analytics | |
| User settings | MySQL + Redis cache | Persistent + fast lookup |
| Retry logic | Exponential backoff + DLQ | Resilient, no infinite loops |
| Token management | DB + invalidation on error | Handle stale tokens gracefully |
Interview Questions & Answers
Q: How do you ensure no notification is lost?
A: Use durable message queues (Kafka with replication factor 3, or SQS). Messages are only ACKed after successful delivery to the provider. On failure, messages are retried with exponential backoff. After max retries, messages go to a Dead Letter Queue for manual investigation. This gives at-least-once delivery semantics.
Q: How do you handle a user who uninstalls the app?
A: The device token becomes invalid. On the next send attempt, APNs returns error code 410 (Unregistered) or FCM returns NotRegistered. The worker catches this, marks the token as invalid in the device token DB, and skips it for future notifications. The token is re-registered if the user reinstalls the app.
Q: How do you respect user Do-Not-Disturb settings?
A: Before routing a notification to the queue, the Notification API checks user settings from Redis cache (backed by MySQL). If the userβs current local time falls within their DND window, the notification can either be dropped (for non-urgent) or queued with a delayed delivery timestamp for after the DND window ends.
Q: How would you scale the email notification system to handle 5M emails/day?
A: 5M/day = ~58 emails/second. SendGrid supports 1000+ req/second so a few email worker instances suffice. For higher scale: partition the email queue by user_id, run multiple worker pools, use SendGridβs batch send API, and stagger sending to avoid hitting per-second provider limits. Monitor delivery rates and bounce rates continuously.
Q: What happens if FCM goes down?
A: Android push notifications back up in the Android queue. Kafka retains messages durably. When FCM recovers, workers process the backlog. For critical alerts, implement a fallback to SMS. Monitor FCM health with circuit breaker β if error rate exceeds threshold, alert on-call and consider holding non-urgent notifications until FCM recovers.
Key Takeaways
- Four notification channels each use a dedicated third-party provider: APNs (iOS), FCM (Android), Twilio/Nexmo (SMS), SendGrid/Mailchimp (email)
- Message queues are critical for reliability β decouple sending from dispatching and absorb traffic spikes
- At-least-once delivery via queue retry + DLQ β never lose a notification
- Device tokens are ephemeral β always handle invalid token errors from APNs and FCM gracefully
- User preferences (opt-out, DND, channel settings) must be checked before every notification, cached in Redis for speed
- Rate limit notifications per user to avoid overwhelming users and causing app uninstalls
- Notification tracking (sent/delivered/read) is essential for analytics, debugging, and product iteration
Related Resources
- distributed-system-components - Message queues, event-driven architecture
- key-patterns - Retry with exponential backoff, circuit breaker
- ch04-rate-limiter - Rate limiting pattern applied to notifications
- ch06-key-value-store - Redis for caching user settings
Last Updated: 2026-04-13
Status: Interview ready β know APNs vs FCM, queue-based reliability, and DLQ pattern!