Chapter 12: Design a Chat System

volume1 chat websocket real-time messaging

Status: 🟩 Interview ready - Very common question!
Difficulty: Hard
Time to complete: 50 min read + practice


Overview

A chat system enables real-time text communication between individuals and groups. Modern examples include WhatsApp (2B+ users), Facebook Messenger, Slack, and WeChat. This design covers the full spectrum: 1-on-1 messaging, group chats, online presence, and offline delivery.

Why this matters:

  • Very common interview question at top tech companies
  • Combines WebSocket, message queues, NoSQL storage, and presence systems
  • Teaches real-time bidirectional communication at scale
  • Hard difficulty due to many moving parts (ordering, delivery, presence)

Problem Statement

Design a chat system that:

  • Supports real-time 1-on-1 and group messaging
  • Delivers messages to offline users when they come back online
  • Shows online/offline presence indicators
  • Works on both mobile and web (multi-device)
  • Sends push notifications when user is offline
  • Stores chat history persistently

Step 1: Requirements & Scope (5 min)

Functional Requirements

Clarifying questions:

  • 1-on-1 only or also group chat? β†’ Both, max 100 members per group
  • Scale? β†’ 50M DAU
  • Mobile or web? β†’ Both (iOS, Android, browser)
  • Message volume? β†’ Assume avg 40 messages/day per user
  • Message history? β†’ Persist messages (users can scroll back)
  • Delivery receipts? β†’ Optional (sent/delivered/read)
  • Online presence? β†’ Yes, show who is online
  • Push notifications? β†’ Yes, for offline users
  • End-to-end encryption? β†’ Out of scope for core design, mention as extension

Scope:

  • Real-time 1-on-1 and group messaging (up to 100/group)
  • Online presence indicators
  • Push notifications for offline users
  • Multi-device sync (phone + desktop)
  • Persistent message storage

Non-Functional Requirements

  • Low latency: Messages delivered in < 100ms (real-time feel)
  • Consistency: Message ordering must be preserved within a conversation
  • Availability: 99.99% uptime (users expect chat to always work)
  • Durability: Messages must not be lost even if server crashes
  • Scale: 50M DAU, handle ~2B messages/day (40 msg Γ— 50M users)

Capacity Estimation

DAU: 50M
Messages per user per day: ~40
Total messages/day: 50M Γ— 40 = 2B messages/day
Messages per second: 2B / 86400 β‰ˆ 23,000 msg/sec (peak ~2-3x = ~50K/sec)

Storage per message (avg 100 bytes text): 2B Γ— 100 bytes = 200 GB/day
10-year retention: 200 GB Γ— 365 Γ— 10 β‰ˆ 730 TB

WebSocket connections: 50M concurrent (50M open connections)

Step 2: High-Level Design (15 min)

Client-Server Communication Options

The core challenge: How do clients receive messages in real-time?

Option 1: HTTP Polling ❌ (Bad!)

Client β†’ HTTP GET /messages  (every N seconds)
Server β†’ Return new messages (or empty response)
Client β†’ HTTP GET /messages  (N seconds later)
...repeat forever

Problems:

  • Client wastes resources asking when there’s nothing new
  • High server load (millions of empty responses)
  • Not truly real-time (delay = polling interval)
  • Wastes bandwidth (headers on every request)

Option 2: Long Polling 🟑 (Better, but still flawed)

Client β†’ HTTP GET /messages  (opens connection, waits)
Server β†’ (holds connection open until new message arrives)
Server β†’ Returns message to client
Client β†’ Immediately opens new long-poll connection
...repeat

Better because: Server only responds when there’s data.

Still has problems:

  • Server can’t push (client must re-open connection after each response)
  • Each message requires a new HTTP connection (overhead)
  • Hard to tell if client is still connected
  • Doesn’t work well with multiple servers (message on server 1, client on server 2)

Option 3: WebSocket βœ…βœ… (Best for Chat!)

Client ──── HTTP Upgrade (handshake) ────→ Server
Client ←──────── Persistent Bidirectional ────── Server
Client ←── Server pushes message instantly ────  Server

Why WebSocket wins:

  • Bidirectional: Both client and server can send messages
  • Persistent connection: No reconnect overhead
  • Low latency: No HTTP headers on every message
  • Real push: Server pushes to client (not client polling)
  • Industry standard: Used by WhatsApp, Slack, Discord

WebSocket vs HTTP:

AspectHTTPWebSocket
DirectionClient β†’ Server onlyBidirectional
ConnectionNew connection per requestPersistent (long-lived)
OverheadFull HTTP headers each timeMinimal after handshake
Server pushNot possible (polling only)Native support
Use caseLogin, REST APIs, file uploadReal-time chat, gaming, feeds

Design decision: Use WebSocket for messaging, HTTP for everything else (login, signup, media upload, profile updates).

Stateless (HTTP/REST):           Stateful (WebSocket):
- Login / signup                 - Send/receive messages
- User profile                   - Online presence
- Group management               - Typing indicators
- Media upload (CDN)             - Message delivery receipts
- Push notification config

Three Core Service Types

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Clients                                β”‚
β”‚         (iOS / Android / Browser)                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚  WebSocket (persistent, for messages)
             β”‚  HTTP (stateless, for everything else)
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       Load Balancer                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό                     β–Ό                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Chat Servers   β”‚  β”‚ Presence Servers β”‚  β”‚   API Servers    β”‚
β”‚  (WebSocket)    β”‚  β”‚  (heartbeat,     β”‚  β”‚  (HTTP/REST)     β”‚
β”‚  - Send/receive β”‚  β”‚   online status) β”‚  β”‚  - Login/signup  β”‚
β”‚  - Message sync β”‚  β”‚                  β”‚  β”‚  - Group mgmt    β”‚
β”‚  - Route msgs   β”‚  β”‚                  β”‚  β”‚  - User profile  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                   β”‚
          β–Ό                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Message Queue                β”‚
β”‚         (Kafka / RabbitMQ)              β”‚
β”‚    - Decouple sender from receiver      β”‚
β”‚    - Buffer messages for offline users  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Data Storage Layer            β”‚
β”‚  - Key-value store: chat history        β”‚
β”‚    (HBase / Cassandra / DynamoDB)       β”‚
β”‚  - Relational DB: users, groups         β”‚
β”‚    (MySQL / PostgreSQL)                 β”‚
β”‚  - Cache: recent messages, sessions     β”‚
β”‚    (Redis)                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Storage Choices

Why two different storage systems?

Relational DB (MySQL/PostgreSQL) for:

  • User profiles (user_id, name, phone, email)
  • Friend/contact lists (user_id, friend_id)
  • Group metadata (group_id, name, members, admin)
  • These are structured, low-volume, need joins and ACID

Key-Value Store (HBase/Cassandra) for:

  • Chat message history
  • Why NOT relational DB for messages?
    • Indexes degrade as table grows to billions of rows
    • Read/write patterns are append-heavy (not update-heavy)
    • Need to scale horizontally across many nodes
    • Random access by (conversation_id, time range) β†’ perfect for KV
  • HBase supports: fast writes, ordered range scans (by time), horizontal scaling
  • Cassandra: same benefits, better read performance, used by Discord

Redis Cache for:

  • Active WebSocket sessions (user_id β†’ server_id mapping)
  • Recent messages (reduce DB load for latest N messages)
  • Online presence state

Message queue (Kafka) for:

  • Decoupling chat servers from message delivery
  • Buffering messages for offline users
  • Fan-out to multiple recipients (group chat)
  • Each user has a dedicated inbox queue (or queue partition)

Step 3: Deep Dive (20 min)

Service Discovery with Zookeeper

Problem: With many chat servers, how do we route each user to the best server?

User connects β†’ which chat server should handle them?

Solution: Zookeeper (or etcd) as service discovery + assignment:

User A logs in
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Zookeeper / Service     β”‚
β”‚  Registry                β”‚
β”‚  - Chat Server 1: 12K    β”‚
β”‚    connections           β”‚
β”‚  - Chat Server 2: 9K     β”‚
β”‚    connections           β”‚
β”‚  - Chat Server 3: 6K     β”‚  ← Least loaded
β”‚    connections           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚ Assigns Chat Server 3 to User A
                β–Ό
        User A β†’ Chat Server 3 (WebSocket)

Zookeeper responsibilities:

  • Track all chat servers and their load
  • Assign best server to each user on login
  • Update assignment if server goes down (reconnect to new server)
  • Store mapping: user_id β†’ chat_server_id (in Redis too for fast lookup)

Message Flow: 1-on-1 Chat (Detailed)

User A sends "Hello" to User B:

1. User A β†’ Chat Server 1 (WebSocket message)
2. Chat Server 1 β†’ Gets message_id (from ID generator)
3. Chat Server 1 β†’ Stores message in Key-Value store (HBase)
4. Chat Server 1 β†’ Sends ACK back to User A (message received)

5a. Is User B ONLINE?
    β†’ Chat Server 1 β†’ Find User B's chat server (via Redis: user_b β†’ Chat Server 2)
    β†’ Chat Server 1 β†’ Sends to Chat Server 2 via message queue
    β†’ Chat Server 2 β†’ Pushes message to User B via WebSocket

5b. Is User B OFFLINE?
    β†’ Chat Server 1 β†’ Routes to User B's inbox queue (Kafka)
    β†’ Push Notification Service β†’ Sends push notification to User B's device
    β†’ When User B comes online β†’ Pulls pending messages from queue / DB

Detailed flow diagram:

β”Œβ”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”
β”‚UserA β”‚         β”‚Chat Srv1β”‚      β”‚  Redis   β”‚    β”‚Chat Srv2 β”‚      β”‚UserB β”‚
β””β”€β”€β”¬β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”¬β”€β”€β”€β”˜
   β”‚  "Hello" (WS)    β”‚               β”‚                β”‚              β”‚
   │──────────────────►               β”‚                β”‚              β”‚
   β”‚                  β”‚ get_msg_id    β”‚                β”‚              β”‚
   β”‚                  │──────────────►│                β”‚              β”‚
   β”‚                  β”‚  msg_id=42    β”‚                β”‚              β”‚
   β”‚                  │◄──────────────│                β”‚              β”‚
   β”‚                  β”‚ store(42,msg) β”‚                β”‚              β”‚
   β”‚                  │──────────────►│  (HBase)       β”‚              β”‚
   β”‚   ACK (msg_id=42)β”‚               β”‚                β”‚              β”‚
   │◄──────────────────               β”‚                β”‚              β”‚
   β”‚                  β”‚ lookup(UserB) β”‚                β”‚              β”‚
   β”‚                  │──────────────►│                β”‚              β”‚
   β”‚                  β”‚  Chat Srv 2   β”‚                β”‚              β”‚
   β”‚                  │◄──────────────│                β”‚              β”‚
   β”‚                  β”‚  route(msg)   β”‚                β”‚              β”‚
   β”‚                  │───────────────────────────────►│              β”‚
   β”‚                  β”‚               β”‚                β”‚ push(msg) WS β”‚
   β”‚                  β”‚               β”‚                │─────────────►│

Message ID Design

Critical requirement: Messages within a conversation must be orderable.

Option 1: Global unique ID (Snowflake):

  • 64-bit ID: timestamp + datacenter + worker + sequence
  • βœ… Globally unique
  • βœ… Sortable by creation time globally
  • ❌ Overkill for per-conversation ordering

Option 2: Local sequence number per conversation (Recommended!):

  • Each conversation has its own auto-incrementing counter
  • message_id = sequence number within that conversation
  • Implementation: Redis INCR conversation:{id}:seq
# Generate message ID for a conversation
def generate_message_id(conversation_id):
    seq_key = f"conversation:{conversation_id}:seq"
    return redis.incr(seq_key)  # Atomic increment
 
# Message structure:
{
  "message_id": 42,       # Sequence within conversation
  "global_id": "uuid",    # UUID for dedup (optional)
  "conversation_id": 101,
  "sender_id": 1,
  "content": "Hello!",
  "timestamp": 1640000000000,  # For display, not ordering
  "type": "text"
}

Why local sequence number:

  • Easier to display in order (just sort by message_id)
  • Less contention than global ID generator
  • Natural conversation threading

Why NOT use timestamp alone:

  • Two messages can have the same millisecond timestamp
  • Clock skew across servers can cause out-of-order IDs
  • Always use a sequence number for definitive ordering

Group Chat Architecture

Small groups (≀ 100 members, our spec):

The fanout on write (push) model works well at this scale:

User A sends to Group (100 members):

Chat Server β†’ Message Queue
               β”œβ”€β”€ User B's inbox queue
               β”œβ”€β”€ User C's inbox queue
               β”œβ”€β”€ User D's inbox queue
               └── ... Γ— 100 members

Each online member gets message via their chat server.
Each offline member gets push notification.
# Group message fanout
def handle_group_message(sender_id, group_id, message):
    # Get all group members
    members = db.get_group_members(group_id)
 
    # Store message once
    msg_id = store_message(group_id, sender_id, message)
 
    # Fanout to each member
    for member_id in members:
        if member_id == sender_id:
            continue
        inbox_queue = f"inbox:{member_id}"
        kafka.publish(inbox_queue, msg_id)

Large groups (> 100 members, e.g., Discord) β€” different strategy:

  • Fanout on read (pull model) β€” don’t push to everyone
  • Store message once, each client fetches on demand
  • Trade-off: Higher read latency, simpler write path
  • Our design only requires ≀ 100, so fanout on write is fine

Small vs Large Group Chat:

AspectSmall Group (≀100)Large Group (>100)
StrategyFanout on write (push)Fanout on read (pull)
Write pathSend to each member’s queueWrite once to channel
Read pathReal-time push via WSClient polls/subscribes
Offline deliveryPush notification per userClient fetches on reconnect
ExampleWhatsApp groupsDiscord, Slack channels

Online Presence System

How to track who is online?

Naive approach: User sends β€œI’m online” on connect, β€œI’m offline” on disconnect.

Problem: Connection drops (network flicker, app background) would cause false β€œoffline” status, frustrating users.

Solution: Heartbeat mechanism

Client β†’ PresenceServer: heartbeat { user_id: 123 }  (every 5 seconds)

PresenceServer logic:
- On heartbeat: update Redis β†’ user:123:lastSeen = now()
- Status = ONLINE if lastSeen < 30 seconds ago
- Status = OFFLINE if lastSeen β‰₯ 30 seconds ago

No heartbeat for 30 seconds β†’ mark user as offline
Timeline:
t=0   User A connects
t=5   Heartbeat β†’ ONLINE
t=10  Heartbeat β†’ ONLINE
t=15  Heartbeat β†’ ONLINE
t=20  App backgrounded (no heartbeat)
t=35  No heartbeat for 30s... β†’ mark OFFLINE
t=40  User re-opens app β†’ Heartbeat β†’ ONLINE again

Presence propagation:

User B's status changes (OFFLINE β†’ ONLINE):
1. Presence Server detects change
2. Publish status change event to Kafka topic: "presence-updates"
3. All of User B's friends subscribe to presence-updates for User B
4. Their clients get notified via their chat server connection

For efficiency (fan-out problem with many friends):
- Only propagate to online friends (no point notifying offline users)
- Rate limit status updates (don't broadcast every flicker)
- Use pub/sub: User B's friends subscribe to user_b_presence channel

Redis data model for presence:

Key: presence:{user_id}
Value: { "status": "online", "last_heartbeat": 1640000000, "active_devices": ["phone", "web"] }
TTL: 35 seconds (auto-expire if no heartbeat)

Cross-Device Synchronization

Problem: User has phone AND laptop. Message received on phone β€” laptop must also show it.

Solution: Each device has its own message sync queue / cursor.

User A has 2 devices: Phone and Laptop

Message arrives in User A's inbox:

         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚    User A's Inbox           β”‚
         β”‚    Message 1, 2, 3, 4, 5   β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                 β”‚              β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
         β”‚  Phone     β”‚  β”‚   Laptop    β”‚
         β”‚  cursor: 3 β”‚  β”‚  cursor: 5  β”‚
         β”‚  (offline, β”‚  β”‚  (online,   β”‚
         β”‚  needs 4,5)β”‚  β”‚  up-to-date)β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Implementation:

  • Each device stores its own last_synced_message_id
  • On reconnect, device sends: β€œGive me all messages after message_id=3”
  • Server returns messages 4, 5 to phone
  • This is cursor-based pagination of the message log

Message sync API:

GET /messages/sync?conversation_id=101&after_message_id=3&limit=50

Response:
{
  "messages": [
    { "id": 4, "content": "Hi!", "sender": "UserB", ... },
    { "id": 5, "content": "How are you?", "sender": "UserB", ... }
  ],
  "has_more": false
}

Push Notifications for Offline Users

User B is OFFLINE (app closed):

1. Message arrives in User B's inbox queue
2. Chat Server detects User B is offline (no active WebSocket)
3. Message forwarded to Push Notification Provider
4. Push provider sends push notification to User B's device

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Chat Server   β”‚      β”‚  Push Notification       β”‚
β”‚                │─────►│  Service                 β”‚
β”‚ "User B is     β”‚      β”‚                          β”‚
β”‚  offline, send β”‚      β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  push"         β”‚      β”‚  β”‚  APNs (iOS)      │──► β”‚ User B's iPhone
β”‚                β”‚      β”‚  β”‚  FCM (Android)   │──► β”‚ User B's Android
β”‚                β”‚      β”‚  β”‚  Web Push        │──► β”‚ User B's Browser
β”‚                β”‚      β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Push notification content:

  • Don’t include full message text (privacy, payload limits)
  • Include: sender name, preview (β€œYou have a new message”), conversation_id
  • On tap: open app β†’ fetch full messages from server

Delivery guarantee:

  • Message stored in DB before push notification sent
  • Even if push notification is lost, user sees message when they open app
  • Message queue ensures at-least-once delivery

Media Messages (Images, Videos)

Problem: Binary files too large for chat servers. Don’t send via WebSocket.

Solution: Separate media upload flow via CDN:

Sending an image:

1. Client uploads image β†’ Object Storage (S3) via API Server
2. Object Storage generates CDN URL
3. Client sends chat message with CDN URL (not the image itself)
4. Receiving client downloads image from CDN URL

[Sender] ──► [S3/Object Storage] ──► CDN
                                      β”‚
[Sender] ──► [Chat Server] ──────────►│
             "image_url": "cdn.../img" β”‚
                                       β–Ό
                               [Receiver opens URL]

Optimization:

  • Compress images before upload (client-side)
  • Generate thumbnails server-side (store separate low-res version)
  • Lazy load images (only fetch when scrolled into view)

End-to-End Encryption (E2E)

Concept (mention in interview, don’t over-engineer):

WhatsApp uses Signal Protocol for E2E encryption:

1. Each user has a public/private key pair
2. Public keys stored on server, private keys NEVER leave device
3. Message encrypted with recipient's public key on sender's device
4. Server sees only encrypted bytes (can't read content)
5. Decrypted only on recipient's device using private key

Architecture impact:
- Server can't index message content (no search)
- Server stores encrypted blob, not plaintext
- Key exchange happens via server, but keys themselves are safe

Design Summary

Complete Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    CLIENTS (iOS / Android / Web)                 β”‚
β”‚           WebSocket (messages) + HTTP (everything else)          β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    Load Balancer     β”‚
β”‚  (L4 for WebSocket,  β”‚
β”‚   L7 for HTTP)       β”‚
β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   β”‚
   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚                                                              β”‚
   β–Ό                                                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         CHAT SERVERS             β”‚            β”‚        API SERVERS         β”‚
β”‚  (Stateful, WebSocket)           β”‚            β”‚  (Stateless, HTTP)         β”‚
β”‚  - Maintain WS connections       β”‚            β”‚  - Login / auth            β”‚
β”‚  - Route messages to recipients  β”‚            β”‚  - User management         β”‚
β”‚  - Handle group fanout           β”‚            β”‚  - Group management        β”‚
β”‚  - Detect online/offline         β”‚            β”‚  - Media upload            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚             MESSAGE QUEUE (Kafka)             β”‚
β”‚  - Topic per user inbox                       β”‚
β”‚  - Buffer for offline users                   β”‚
β”‚  - Decouple senders from receivers            β”‚
β”‚  - Fan-out for group messages                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β–Ό             β–Ό                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  HBase /    β”‚  β”‚  Redis   β”‚  β”‚  Push Notification  β”‚
β”‚ Cassandra   β”‚  β”‚  Cache   β”‚  β”‚  Service            β”‚
β”‚             β”‚  β”‚          β”‚  β”‚  (APNs / FCM)       β”‚
β”‚ - Messages  β”‚  β”‚ - Active β”‚  β”‚                     β”‚
β”‚ - Chat log  β”‚  β”‚   WS     β”‚  β”‚  For offline users  β”‚
β”‚ - 10yr ret. β”‚  β”‚   sessionsβ”‚ β”‚                     β”‚
β”‚             β”‚  β”‚ - Presenceβ”‚ β”‚                     β”‚
β”‚             β”‚  β”‚ - Seq IDs β”‚ β”‚                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–²
         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  PRESENCE SERVERS   β”‚
β”‚  - Heartbeat (5s)   β”‚
β”‚  - Online status    β”‚
β”‚  - Status fanout    β”‚
β”‚  - Redis TTL for    β”‚
β”‚    auto-offline     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–²
         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  SERVICE DISCOVERY  β”‚
β”‚  (Zookeeper)        β”‚
β”‚  - Assign chat srv  β”‚
β”‚  - Track server loadβ”‚
β”‚  - Failover routing β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Design Decisions Summary

DecisionChoiceReasoning
Client-server protocolWebSocket (messages) + HTTP (REST)Bidirectional push; HTTP stateless for non-real-time
Message queueKafkaDecouple sender/receiver, buffer offline msgs, fan-out
Chat history storageHBase / CassandraWrite-heavy, horizontal scale, ordered scan by time
User/group dataMySQL / PostgreSQLStructured, needs joins, low volume
CacheRedisSessions, presence, sequence IDs β€” all in-memory
Message IDLocal sequence per conversationSimple ordering, less contention than global ID
Group fanoutPush to each member’s queueWorks well for ≀ 100 members
Online presenceHeartbeat every 5s, 30s timeoutHandles flaky networks gracefully
Service discoveryZookeeperDynamic server assignment and load balancing
Offline deliveryPush notifications (APNs/FCM)Native mobile delivery when app is closed

Interview Questions & Answers

Q: Why WebSocket over HTTP long-polling for a chat system?
A: WebSocket provides persistent bidirectional communication β€” the server can push messages to the client without waiting for a client request. Long polling still requires the client to re-open a connection after every message, adding latency and overhead. For a chat system with frequent messages, the constant reconnect cost of long polling is unacceptable. WebSocket also allows the server to push typing indicators and presence updates at no extra cost.

Q: How do you ensure message ordering in a group chat?
A: Each conversation (1-on-1 or group) maintains a local sequence number using a Redis atomic INCR. Every message sent to that conversation gets the next sequence number. Clients display messages sorted by sequence number, not timestamp, since clock skew across servers can produce misleading timestamps. The sequence number is the single source of truth for message order within a conversation.

Q: How does the system handle a user who is offline for a week?
A: Messages are stored in two places: the persistent key-value store (HBase/Cassandra) and the user’s Kafka inbox queue. When the user reconnects: (1) they request messages after their last-seen message_id, (2) the server returns all newer messages via the sync API, (3) the Kafka consumer group drains the inbox queue and delivers pending messages. The KV store is the source of truth; Kafka is the real-time delivery mechanism.

Q: How do you scale to 50M concurrent WebSocket connections?
A: WebSocket connections are stateful and memory-intensive (~50KB per connection). At 50M connections, that’s ~2.5TB RAM total. We scale horizontally: add more chat servers, each handling ~50K-100K connections. Zookeeper tracks server load and routes new users to least-loaded servers. Redis stores the user_id β†’ chat_server_id mapping so any server can route messages correctly. When a server fails, clients reconnect and Zookeeper assigns a new server.

Q: How do you handle the fanout problem when a user is in 1000 groups?
A: For groups up to 100 members (our spec), fanout on write is acceptable β€” at most 100 queue writes per message. If we needed to support larger groups (Discord-style channels with thousands of members), we would switch to fanout on read: store the message once and have clients subscribe to the channel and pull messages. The trade-off is increased read latency vs. avoiding massive write amplification on sends.

Q: How do you guarantee message delivery (exactly-once vs at-least-once)?
A: We implement at-least-once delivery with deduplication for exactly-once behavior: (1) Sender gets an ACK from the chat server after the message is stored in DB β€” if no ACK, the client retries with the same client-side generated UUID. (2) The server deduplicates by checking if the UUID already exists before storing. (3) Kafka ensures at-least-once delivery to the inbox queue. (4) The client tracks the last seen message_id and ignores duplicates. This gives effectively exactly-once visible delivery.


Key Takeaways

  1. WebSocket is the right choice for real-time bidirectional messaging β€” never use polling for chat
  2. HTTP stays stateless (login, group management, media), only messaging goes over WebSocket
  3. Key-value stores (HBase/Cassandra) are ideal for chat history β€” write-heavy, append-only, horizontally scalable
  4. Kafka message queues decouple senders from receivers and buffer messages for offline users
  5. Heartbeat mechanism (every 5s, 30s timeout) is more reliable than connect/disconnect events for presence
  6. Local sequence numbers per conversation solve message ordering β€” don’t rely on timestamps alone
  7. Cross-device sync via cursor (last_seen_message_id) β€” each device independently tracks its position
  8. Service discovery (Zookeeper) assigns users to chat servers dynamically, enabling horizontal scaling


Practice this design! One of the hardest SDI questions. Be ready to:

  1. Justify WebSocket over long polling clearly
  2. Draw the complete architecture with all components
  3. Explain message ordering with local sequence numbers
  4. Walk through both online AND offline message delivery flows
  5. Discuss group chat fanout and scaling limits

Last Updated: 2026-04-13
Status: Very common interview question - Must know!