Chapter 2 Flashcards — Nearby Friends (Vol. 2)
flashcards volume2 nearby-friends real-time websocket location
What is the Nearby Friends feature and how does it differ from a proximity service?
?
Nearby Friends (e.g., Facebook) shows which of your friends are physically close right now, updating in real-time as people move. Key differences from a proximity service: Proximity service (Yelp) is query-based — user asks, system responds with static data. Nearby Friends is event-driven — system proactively pushes updates when friends move. Businesses are static; friends are mobile. Proximity service uses geohash index + HTTP; Nearby Friends uses WebSocket + Pub/Sub fan-out. The fan-out problem is what makes Nearby Friends architecturally much harder.
What are the scale numbers for Nearby Friends and why is the fan-out the hard part?
?
1 billion total users, 10% use Nearby Friends = 100M active users. Location update frequency: every 30 seconds. Write QPS: 100M / 30 ≈ 3.3M location updates/sec. Average friends: 400. Raw fan-out: 3.3M × 400 = ~1.3 billion fan-out events/sec. Even after filtering to nearby-only friends (assume 10% are nearby): ~130M push events/sec. The fan-out is 400× amplification of the write QPS. This is why the system design focuses on efficient fan-out, not just storage.
Why is WebSocket the correct protocol for Nearby Friends instead of HTTP polling?
?
HTTP polling at 5-second intervals for 100M users = 100M / 5 = 20M req/sec, most returning empty responses — wasted resources and money. Long polling holds open HTTP connections but cannot efficiently broadcast to multiple users. WebSocket provides a single persistent bidirectional TCP connection per user. The server can push location updates instantly without the client asking. The client sends location updates (client → server) AND receives friend updates (server → client) over the same connection. Result: no wasted requests, near-zero additional latency, and lower per-connection overhead.
Describe the end-to-end flow when User A moves and User B (a friend) receives the update.
?
- User A’s phone sends location via WebSocket: {type:“location_update”, lat:37.77, lng:-122.42, ts:1713000000}. 2. A’s WebSocket server stores in Redis: SET user:A:location “37.77,-122.42,ts” EX 60. 3. A’s WS server publishes: PUBLISH user:A:location “37.77,-122.42,ts”. 4. Redis delivers message to all subscribers of that channel — including B’s WS server (which subscribed because B is friends with A). 5. B’s WS server checks: is A within B’s 5-mile radius? Yes → push {type:“friend_location”, friend_id:“A”, distance_miles:1.3} to B’s WebSocket. No → discard. Total end-to-end latency: ~1-2 seconds.
How does Redis Pub/Sub enable the fan-out in Nearby Friends?
?
Each user gets a dedicated channel: “user:{user_id}:location”. When a user’s friend connects to a WebSocket server, that WS server subscribes to the user’s channel on behalf of the friend. Example: A has friends B (on WS-1) and C (on WS-2). WS-1 subscribes to user:A:location. WS-2 subscribes to user:A:location. When A moves, A’s WS server does PUBLISH user:A:location “37.77,-122.42,ts”. Redis delivers to both WS-1 and WS-2 simultaneously. Each WS server then decides (via distance check) whether to push to its connected friends. One PUBLISH → N deliveries.
Why does each WebSocket server subscribe to friend channels on behalf of connected users, rather than having users subscribe directly?
?
WebSocket connections live at the server layer — the server is the process actually holding the TCP connection to the client. Redis Pub/Sub subscriptions are held by Redis client processes (the servers), not end-user devices. When User A connects to WS-1, WS-1 subscribes to all of A’s friends’ channels. WS-1 is the Redis subscriber, not A’s phone. When a friend update arrives, WS-1 decides whether to forward it to A over A’s open WebSocket. This architecture means Redis only needs to talk to WS servers (a manageable N), not 100M mobile clients.
Why is Redis Pub/Sub preferred over Kafka for location updates in Nearby Friends?
?
Location updates are ephemeral — if a subscriber misses one update, they get the next one in 30 seconds. No persistence or replay is needed. Redis Pub/Sub: in-memory, microsecond delivery, at-most-once (acceptable — missed update = 30 sec wait), no disk I/O, built into Redis (same cluster used for location cache). Kafka: durable, replayable, at-least-once, higher latency due to disk persistence, more operational overhead. Use Kafka if you need location history, analytics, or audit trail. Use Redis Pub/Sub if you just need real-time delivery with no persistence requirement.
How is user location stored in Redis and why does TTL matter?
?
Key: user:{user_id}:location. Type: String or Hash. Value: “lat,lng,timestamp” (e.g., “37.7749,-122.4194,1713000000”). TTL: 60 seconds. TTL is critical because: if a user’s phone dies, app is backgrounded, or network drops, they stop sending updates. Without TTL, their last location would persist indefinitely — friends would see a 2-hour-old location as if current. With 60s TTL, after two missed updates the key expires automatically. The user’s WS server can send friend_offline to subscribed friends, and the Redis key disappears. Staleness = 60 seconds max.
What is the fan-out problem and what are three approaches to mitigate it?
?
Fan-out problem: A user with 400 friends sends one location update → must deliver to up to 400 subscribers. At 3.3M updates/sec × 400 = 1.3B fan-out events/sec raw. Three mitigations: (1) Hard cap: limit Nearby Friends to users with < 1,000 friends (or disable for accounts with massive friend counts). (2) Server-side geographic pre-filtering: before publishing or before forwarding, check the friend’s last known location — if they are clearly in another city, skip. Reduces actual deliveries to ~10% of raw fan-out. (3) Rate limiting: for users with many nearby friends, throttle push frequency (e.g., once per 60 sec max instead of every 30 sec).
What does a WebSocket server maintain in memory for each connected user?
?
Two data structures per WS server: (1) Active connections map: {user_id → websocket_handle} — maps each connected user to their WebSocket object so the server can push messages to them. (2) Channel subscriptions map: {user_id → [channel_1, channel_2, …]} — maps each user to the list of Redis Pub/Sub channels they have subscribed on behalf of (one per friend). On connect: populate both maps (fetch friend list, subscribe channels). On disconnect: unsubscribe all channels, remove from connections map, optionally send friend_offline events.
How do you scale WebSocket servers as user count grows?
?
WebSocket servers are stateful (hold open connections) but independently scalable. Each server handles ~50K connections. 100M users / 50K per server = 2,000 WS servers. Use consistent hashing (hash user_id → WS server index) so a reconnecting user goes back to the same server, avoiding repeated subscription churn. Load balancer uses sticky sessions based on user_id hash. On server failure: load balancer health-checks detect it, redirects affected clients to healthy servers, clients re-subscribe on reconnect (fetching friend list fresh). Redis Pub/Sub cluster scales independently — shard by channel name hash.
How do you handle a WebSocket server crash with 50,000 active connections?
?
- Load balancer health check detects unhealthy WS server (TCP probe fails or health endpoint returns error). 2. Load balancer stops routing new connections to it. 3. All 50,000 clients detect TCP disconnect and initiate reconnect with exponential backoff. 4. Clients reconnect to healthy WS servers (consistent hash may redistribute them or a replacement server can be spun up with the same hash ring slot). 5. Each newly connected user causes their WS server to re-fetch their friend list and re-subscribe to friend channels. 6. Gap in updates: at most 30 seconds (next location update cycle). Location data in Redis is safe (separate tier).
What are the privacy controls for Nearby Friends and how are they implemented?
?
Four privacy modes: (1) Off (default): no location key written to Redis, no channel subscriptions created — friends receive nothing. (2) Friends only: standard flow — all friends get updates. (3) Close friends only: filter friend list to a tagged subset before subscribing to channels. (4) Approximate mode: server rounds lat/lng to 0.1° precision (~11km) before publishing — friends see “about 5 miles away” not exact location. Opt-out flow: immediately DELETE user:{id}:location from Redis, UNSUBSCRIBE all friend channel subscriptions, push friend_offline to currently connected friends.
How do you handle a user adding or removing a friend in real-time?
?
Add friend: (1) Update friend graph in User DB. (2) User A’s WS server subscribes to user:B:location channel. (3) User B’s WS server subscribes to user:A:location channel. Both start receiving each other’s updates immediately — no reconnect needed, just add new Redis subscriptions. Remove friend: (1) Update friend graph in User DB. (2) User A’s WS server unsubscribes from user:B:location channel. (3) User B’s WS server unsubscribes from user:A:location channel. (4) Send friend_offline event to both to clear the UI. Effect is immediate — no staleness window.
How do you efficiently calculate distance at scale between millions of user pairs?
?
Two-step approach: (1) Bounding box pre-filter: compute lat/lng deltas. If |lat1-lat2| > radius/69 miles × 1.5 buffer, reject immediately (no trig needed). This eliminates ~90%+ of pairs since most friends are in different cities. (2) Haversine formula for remaining candidates: accurate great-circle distance using sin/cos. For 100M users × 400 friends = 40B pairs in theory, the bounding box pre-filter reduces actual Haversine calls to a tiny fraction. At the WS server level, each server only computes distances for its connected users’ friends — the work is naturally distributed.
What is the difference between the location update write path and the friend display read path?
?
Write path (high frequency, 3.3M/sec): Mobile client → WebSocket server → Redis SET (location cache) + Redis PUBLISH (fan-out). Synchronous, hot path, must complete in < 100ms. Read path (initial load): Client opens app → HTTP REST GET /v1/friends/nearby → API server fetches friend list from User DB → for each friend, GET user:{id}:location from Redis → compute distances → return sorted list. WebSocket then keeps this list fresh via pushed updates. The REST read path is only for initial page load; ongoing updates come via WebSocket push.
What is graduated location sharing and why is it important?
?
Graduated location sharing provides different levels of precision for different groups of friends. Example tiers: (1) Close friends: exact GPS coordinates (±10 meters). (2) Regular friends: rounded to ~1 mile precision (round to 0.01° ≈ 1.1km). (3) Acquaintances: approximate (“in your area” — same city only). Implementation: server applies different rounding before publishing to different channel segments. Importance: sensitive data like live location should not be shared at full precision with everyone. Reduces risk of stalking or unwanted contact. Also reduces anxiety about privacy, increasing feature adoption.
Why is the nearby friends problem architecturally harder than the proximity service problem?
?
Proximity service (Yelp): static data (businesses don’t move), query-on-demand, one-way data flow (user asks, server responds), geohash index pre-built offline, ~6K QPS to handle. Nearby Friends: dynamic data (everyone moves constantly), proactive push (server must track who needs updates), bidirectional data flow (client sends location AND receives updates), fan-out multiplier (400 friends per update), privacy graph (must check friendship before every delivery), 3.3M writes/sec × 400 fan-out = 1.3B events/sec at peak. Proximity service requires geospatial indexing. Nearby Friends requires streaming architecture.
How does consistent hashing help with WebSocket server scaling?
?
Without consistent hashing: users reconnecting after a server restart could land on any of 2,000 servers, causing mass re-subscription of Redis channels (all friends’ channels re-subscribed from scratch, stampede effect). With consistent hashing: hash(user_id) deterministically maps to a server slot. When a server is replaced, only the users that hashed to that slot need to reconnect and re-subscribe. Users on other servers are unaffected. Also useful for planned scaling — adding a new server only migrates a 1/N fraction of users, not all of them. This dramatically reduces subscription churn during deploys and failures.
What is the subscription math for a WebSocket server and is it feasible?
?
Per WS server: ~50,000 active connections. Each user has ~400 friends → 400 Redis channel subscriptions. Per WS server: 50,000 × 400 = 20 million channel subscriptions. Total across all servers: 2,000 WS servers × 20M = 40 billion subscriptions (globally). But: each subscription is just a Redis channel listener — a few bytes of state in the Redis server. Redis Pub/Sub supports millions of channels and subscribers. Subscriptions are not per-message overhead, only per-connection setup. This is feasible; the bottleneck is actually message delivery throughput, not subscription count.
What happens when a user’s location key expires in Redis (TTL hits 60 seconds)?
?
When the TTL expires: (1) Redis automatically deletes the key. (2) The user’s WS server (or a key-expiry listener via Redis keyspace notifications) detects expiry. (3) WS server publishes to user:{id}:location with a “user_offline” sentinel, OR the key expiry is detected via keyspace notification. (4) All subscribing WS servers (friends’ servers) receive the notification and push a {type:“friend_offline”, friend_id:”…”} message to their connected users. (5) Clients remove the friend from their Nearby Friends list. This handles app backgrounding, phone death, and network loss gracefully — no manual cleanup needed.
What are the three communication protocol options and why WebSocket wins?
?
Option 1 - HTTP Polling: Client asks every N seconds. Problem: 20M req/sec for 100M users at 5s interval — mostly empty responses. High latency (up to N sec). Option 2 - Long Polling: Server holds request open until update available. Better than polling but: cannot broadcast to multiple users, high server memory for 100M held connections, HTTP overhead per cycle. Option 3 - WebSocket: Persistent bidirectional TCP connection. Server pushes instantly when data is available. One connection handles both directions (client sends location, server sends friend updates). No wasted requests. O(1) overhead per push. Only correct choice for this problem at this scale.
How do the data flows compare between Nearby Friends and Proximity Service?
?
Proximity Service (Yelp): User → HTTP GET request → LBS reads geohash index → returns list of businesses → done. One-time read. Data is static. Pull model. Nearby Friends: Bidirectional continuous stream. User → WebSocket → WS server → Redis SET (store location) + Redis PUBLISH (fan-out to friends). Friends’ WS servers → distance check → WebSocket push to connected friends. Continuous, both directions, push model. Key difference: proximity service is a snapshot query; Nearby Friends is an ongoing subscription with continuous state updates. This requires streaming infrastructure (WebSocket, Pub/Sub) rather than simple request/response.
How would you add location history to Nearby Friends without impacting the real-time path?
?
Keep the real-time path (WebSocket → Redis) unchanged for latency reasons. Add an async write side-channel: WebSocket server publishes location updates to a Kafka topic (“location_events”) in addition to Redis Pub/Sub. Kafka consumers write to: (1) TimescaleDB or InfluxDB (time-series DB for efficient time-based queries), or (2) Amazon S3 / data lake (for bulk analytics). User-facing history UI queries TimescaleDB directly (separate read path). This keeps the real-time publish path < 5ms while giving you full location history asynchronously. Redis location cache remains ephemeral (60s TTL) regardless.
What should your Nearby Friends architecture diagram show in a system design interview?
?
Must show: (1) Client connected via WebSocket to a WS server pool (labeled “stateful, ~50K conns each”). (2) Load balancer with sticky sessions in front of WS servers. (3) Redis Pub/Sub Cluster — labeled “one channel per user.” (4) Redis Location Cache — labeled “user:{id}:location, TTL 60s.” (5) User DB (PostgreSQL) — friend graph and privacy settings. (6) Arrows: client → WS server (location update), WS server → Redis SET + PUBLISH, Redis PUBLISH → all subscriber WS servers, WS server → client (friend_location push). Label “~3.3M writes/sec” on the write path and “fan-out: 400x, filtered to nearby only” on the Pub/Sub path.
Total Cards: 25
Review Time: 20-25 minutes
Priority: HIGH — Real-time + WebSocket + Pub/Sub fan-out is a hard interview category. Know the flow cold.
Last Updated: 2026-04-13