Chapter 10 Flashcards - Notification System

flashcards volume1 notifications push ios android email

What are the four types of notifications and which third-party provider handles each?
?
iOS Push: Apple Push Notification Service (APNs). Android Push: Firebase Cloud Messaging (FCM). SMS: Twilio or Nexmo (REST API). Email: SendGrid, Mailchimp, or AWS SES (SMTP/REST). Each channel has its own queue, worker pool, and provider integration.

What is APNs and how does iOS push notification delivery work end to end?
?
APNs = Apple Push Notification Service, the only way to send push notifications to iOS devices. Flow: 1) App registers with APNs, 2) APNs returns a device token (unique per device per app), 3) App sends token to our backend, 4) We store token in device_tokens DB, 5) To send notification: our worker → APNs → iOS device. Requires Apple developer certificate.

What is FCM and how does Android push notification delivery work?
?
FCM = Firebase Cloud Messaging (replaced legacy Google Cloud Messaging). Flow: 1) App registers with FCM, 2) FCM returns a registration token, 3) App sends token to our backend, 4) We store token in DB, 5) To send: our worker → FCM HTTP API → Android device. FCM supports both notification messages (display) and data messages (background processing).

What is a device token and why must it be managed carefully?
?
A device token (APNs) or registration token (FCM) is a unique identifier for a specific app on a specific device, assigned by the platform (Apple/Google). Must manage carefully because: tokens become invalid when app is uninstalled, OS is updated, or user resets device. On send failure, APNs returns 410 (Unregistered) and FCM returns NotRegistered — worker must mark token invalid and skip it going forward.

Why are message queues essential in a notification system?
?
Queues (Kafka/SQS) solve four problems: 1) Decoupling — sender services don’t know about providers, 2) Reliability — if provider is down, messages stay in queue (no data loss), 3) Burst absorption — queue buffers traffic spikes so workers process at steady rate, 4) Independent scaling — each channel (iOS/Android/SMS/email) has its own queue and worker pool that scales independently.

What is the at-least-once delivery guarantee and how is it implemented?
?
At-least-once means every notification is delivered at least one time (may duplicate, never drop). Implementation: 1) Worker dequeues message, 2) Sends to provider, 3) Only ACKs queue on success, 4) On failure: requeue with backoff, 5) After max retries: send to Dead Letter Queue (DLQ). The DLQ acts as a safety net — alerts engineers for manual investigation. Idempotency keys prevent duplicate user impact.

Explain the retry with exponential backoff pattern for notification delivery.
?
On provider failure, don’t retry immediately — wait longer each time to avoid overwhelming a struggling provider. Example backoff: attempt 1 → wait 1s, attempt 2 → wait 2s, attempt 3 → wait 4s. Formula: delay = 2^attempt seconds. After max_retries (e.g., 3), move message to Dead Letter Queue (DLQ) and alert on-call. Prevents thundering herd when provider recovers.

What is a Dead Letter Queue (DLQ) and why is it important?
?
DLQ is a special queue where messages go after exhausting all retry attempts. Purpose: prevents messages from looping forever, captures failures for investigation, enables replay after fixing root cause. A DLQ with any messages should trigger an alert to on-call engineers. Contents: original message + error details + retry count. Can replay DLQ messages after fixing the underlying issue.

How do notification templates work and why use them?
?
Templates are reusable message blueprints stored in DB, with placeholders for dynamic data. Example: title=“Your order {{order_id}} has shipped”, body=“ETA: {{eta}}”. Benefits: 1) Consistent messaging across channels, 2) Non-engineers update copy without code deploys, 3) A/B test message variants, 4) Support localization (i18n) with locale-specific templates, 5) One template can have push/SMS/email variants. Stored in Template DB, cached in Redis.

How do you implement user opt-out from notifications?
?
Store opt-in status in notification_settings table: (user_id, channel, notification_type, opted_in). Before routing to queue, check: if opted_in == false, drop the notification silently. Cache settings in Redis (TTL 5 min) to avoid DB lookup on every notification. Support granular opt-out: user can opt out of marketing emails but keep transactional emails and push alerts.

What is a Do-Not-Disturb (DND) window and how is it enforced?
?
DND is a time range during which a user does not want notifications (e.g., 10pm–8am). Stored in notification_settings: (dnd_start, dnd_end, timezone). Enforcement: convert current UTC time to user’s local time, check if within DND window. Options: 1) Drop notification (non-urgent), 2) Queue with delayed delivery timestamp for after DND ends (urgent), 3) Always deliver system alerts (security, fraud) regardless of DND.

How do you rate limit notifications per user?
?
Use Redis counters with daily TTL. Key: notif_rate:{user_id}:{channel}:{date}. On each notification: INCR key, set TTL=86400 (1 day). If count > limit, drop or delay the notification. Typical limits: push 10/day, SMS 5/day, email 3/day, marketing email 1/week. This prevents notification fatigue and reduces app uninstalls. Limits are configurable per channel and notification type.

What notification tracking states should you track and how?
?
Track four states in notification_log table: PENDING (enqueued), SENT (worker called provider API successfully), DELIVERED (provider sent delivery receipt callback — APNs/FCM support this), READ (client app calls /notification/read endpoint when user taps notification). Also track FAILED (after retries exhausted). Store provider_msg_id for debugging. Feed events to analytics pipeline (Kafka → analytics DB).

What security measures protect the notification API?
?

Authentication: Only trusted internal services can call the Notification API — require service JWT tokens or API keys. 2) Authorization: Verify service is allowed to send specific notification types (order service can’t send auth alerts). 3) Input validation: Sanitize template data to prevent injection. 4) APNs certificate: Store securely, rotate periodically. 5) FCM server key: Keep secret, use Firebase service account credentials. 6) Audit logging: Log all notification API calls with caller identity.

How does the notification system architecture handle scale for 10M push/day?
?
10M push/day = ~115 push/second (average), with peaks potentially 10x higher (~1150/second). APNs and FCM support thousands of requests/second each. Solution: 1) Multiple iOS/Android worker instances in parallel, 2) Kafka partitioned by user_id (workers process their partition), 3) Horizontal scaling of workers based on queue depth, 4) Batch API calls where providers support it, 5) Monitor queue depth and auto-scale workers with CloudWatch/Datadog.

What data is stored in the device_tokens table and why?
?
Columns: token_id, user_id, device_token (APNs/FCM token string), platform (ios/android), last_seen (timestamp), status (active/invalid/stale). Why: user_id maps person to their devices, device_token is needed to address the notification, platform determines which provider to use, last_seen helps detect stale tokens (not seen in 30+ days → mark stale), status avoids sending to known-invalid tokens (saves provider API calls).

How do you handle multiple devices per user?
?
A user can have multiple devices (iPhone, iPad, Android tablet). Store all tokens in device_tokens table with user_id as foreign key. When sending a notification: query all active tokens for user_id, send to each device in parallel using appropriate provider (APNs for ios, FCM for android). Handle partial failures independently (one device token might be invalid; others still get the notification). Track delivery per token in notification_log.

What happens when the email provider (SendGrid) goes down?
?
Use a fallback provider strategy: 1) Primary provider SendGrid fails, 2) Worker catches error, 3) Track SendGrid failure rate in Redis counter, 4) If failure rate > threshold (e.g., 5%), activate circuit breaker — route new requests to fallback (AWS SES), 5) Alert on-call engineers, 6) Messages already in email queue are retried — after retries exhaust, go to DLQ. When SendGrid recovers, close circuit breaker and gradually restore traffic. Always have at least two email providers.

What is the difference between a notification message and a data message in FCM?
?
Notification message (display message): FCM shows the notification in system tray automatically when app is in background. App gets callback only when user taps. Simple to use. Data message: FCM delivers a data payload to the app’s onMessageReceived handler — app processes it programmatically. Works when app is foreground or background. Use data messages when you need custom handling, analytics tracking, or to update app state silently without showing a notification.

How do you implement notification analytics to measure open rate and click rate?
?

Sent: Log to notification_log when worker calls provider API. 2) Delivered: Providers (APNs, FCM) send delivery receipts via callback webhooks. 3) Opened/Read: Client app calls POST /notification/{id}/read when user taps. 4) Email clicks: Use tracked redirect URLs (click → tracking server → destination URL). 5) Feed all events to Kafka → stream processor → analytics DB. 6) Build dashboards: open rate = reads/delivered, click rate = clicks/delivered, per template, per channel, per user segment.

What metrics should you monitor for a notification system?
?
Queue health: Queue depth per channel (alert if > 100K), consumer lag. Delivery: Success rate per provider (alert if < 95%), p99 end-to-end latency (alert if > 10s). Errors: DLQ message count (alert on any), provider error rate (alert if > 1%), invalid token rate. Business: Notification open rate, opt-out rate (rising opt-outs = too many notifications), uninstall rate. Infrastructure: Worker CPU/memory, Redis hit rate for settings cache.

How do you handle timezone-aware notification delivery (e.g., send at 9am local time)?
?
Store user timezone in notification_settings table. For scheduled notifications: 1) Service requests notification for “9am user local time”, 2) Notification API converts to UTC based on user timezone, 3) Enqueue message with delivery_at = UTC timestamp, 4) Workers check delivery_at before sending — if in future, skip (or use delayed queue with SQS delay or Kafka scheduled messages), 5) Cron job sweeps pending notifications whose delivery_at has passed and moves them to active queue.

What is the difference between transactional and marketing notifications?
?
Transactional: Triggered by user action (order confirmation, password reset, bank alert). Time-sensitive, high priority, users always want these, bypass DND (or minimal DND respect), never rate-limited. Marketing: Promotional messages (sale alerts, re-engagement campaigns). Lower priority, respect all user preferences (DND, opt-out, rate limits), sent in batches via campaign scheduler. Store notification_type in the notification and settings table to apply appropriate rules to each type.

How do you prevent notification storms when a major event triggers millions of notifications at once?
?
Problem: Black Friday sale triggers 10M notifications simultaneously → overwhelms queues and providers. Solutions: 1) Rate limit outbound: Throttle workers to send at max N notifications/second to each provider, 2) Batch with staggering: Split recipients into batches and send over a window (e.g., 10M over 1 hour = 2,800/second), 3) Priority queues: High-priority notifications go first (transactional before marketing), 4) Queue capacity planning: Pre-scale workers before known events (Black Friday, product launches).

What are the key differences between Kafka and SQS for notification queues, and which do you choose?
?
Kafka: Durable log, replay messages, ordered within partition, high throughput, consumer groups, requires more ops. SQS: Managed AWS service, simpler ops, auto-scales, visibility timeout for at-least-once, FIFO queues available, limited retention (14 days). Choose Kafka if: need message replay (audit, reprocess), very high throughput, stream processing (aggregate events). Choose SQS if: on AWS, want low operational overhead, standard queue semantics are enough. Both provide at-least-once delivery.

Total Cards: 25
Review Time: 20-25 minutes
Priority: HIGH - Common interview question!
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

vol1-ch10-notification-system

Chapter 10 Flashcards - Notification System

Graph View