Chapter 1 Flashcards — Proximity Service (Vol. 2)

flashcards volume2 proximity geospatial geohash quadtree

What is a proximity service and name two real-world examples?
?
A proximity service returns points of interest (businesses, venues) within a given radius of the user’s current location. Real-world examples: Yelp (find restaurants nearby), Google Places (find businesses near a map pin), Uber (find nearby drivers), Pokémon GO (find nearby Pokémon stops). Core inputs: (lat, lng, radius). Core output: ranked list of matching places.

What are the scale requirements for a proximity service at 100M DAU?
?
100M DAU × 5 searches/day / 86,400 seconds = ~5,800 QPS average. Peak QPS ≈ 3× average = ~17,400 QPS. 200M businesses × ~1 KB/record ≈ 200 GB business metadata. Read-to-write ratio is ~30:1 (heavily read-skewed — businesses rarely move, users search constantly). These numbers justify separate read/write paths and a dedicated geospatial index.

Why can’t you just put a B-tree index on lat and lng columns for proximity search?
?
A B-tree index on (lat, lng) only helps with one dimension at a time. A bounding-box query like “lat BETWEEN a AND b AND lng BETWEEN c AND d” requires merging two range scans (index merge), which is expensive and still scans a large fraction of the table in dense areas. The root problem: 2D spatial data cannot be sorted along one 1D B-tree axis and preserve both dimensions. You need a spatial index: geohash, QuadTree, or PostGIS.

What is geohash and what is its core property that makes it useful for proximity search?
?
Geohash encodes any (lat, lng) coordinate as a short alphanumeric string by interleaving latitude bits and longitude bits, then encoding in Base32. Core property: geographic proximity maps to string prefix proximity. Two nearby locations share a common prefix (e.g., both in San Francisco start with “9q8y”). This converts a 2D range query into a 1D string prefix scan — something databases and Redis handle extremely efficiently.

What are the geohash precision levels and corresponding cell sizes you must know?
?
Level 4 (4 chars): ~39 km × 20 km — district/county level. Level 5 (5 chars): ~4.9 km × 4.9 km — city neighborhood (~5km radius queries). Level 6 (6 chars): ~1.2 km × 0.6 km — ~1km radius queries. Level 7 (7 chars): ~152 m × 152 m — street level. Rule of thumb: pick the precision whose cell is ~1–2× the search radius. For 2km radius → level 5 (~4.9km cell). For 500m radius → level 6 (~1.2km cell).

How do you map a search radius to the correct geohash precision level?
?
Pick the level where the cell size is just slightly larger than your search radius, so one cell likely contains all relevant results (plus neighbors for safety). Mapping: 500m → level 6 (cell ~600m), 1km → level 6, 2km → level 5 (cell ~4.9km), 5km → level 5, 10km → level 4 (cell ~20km), 20km → level 4. Always verify by querying 9 cells (center + 8 neighbors) regardless of precision chosen.

What is the geohash boundary problem and how do you solve it?
?
The boundary problem: A user standing at the edge of geohash cell X may be very close to a business in the adjacent cell Y. A query on X alone misses that business even though it is within the search radius. Solution: Always query 9 cells — the center cell (user’s geohash) plus all 8 immediate neighbors (N, NE, E, SE, S, SW, W, NW). Collect all candidate business IDs from all 9 cells, then compute exact Haversine distance and filter to the requested radius.

Draw the 9-cell neighbor query pattern for geohash.
?

┌─────────┬─────────┬─────────┐
│ NW nbr  │  N nbr  │ NE nbr  │
├─────────┼─────────┼─────────┤
│  W nbr  │  CENTER │  E nbr  │  ← Always query all 9 cells
├─────────┼─────────┼─────────┤
│ SW nbr  │  S nbr  │ SE nbr  │
└─────────┴─────────┴─────────┘

Center = user’s geohash at chosen precision. After collecting IDs from all 9 cells, filter candidates by exact distance <= radius, then sort and return top N results.

How do you store the geospatial index in Redis using geohash?
?
Use a Redis Sorted Set with all scores set to 0, making it sorted lexicographically by member. Member format: “geohash:business_id” (e.g., “9q8yy5:biz_abc123”). Insert: ZADD geohash_index 0 “9q8yy5:biz_abc123”. Query prefix: ZRANGEBYLEX geohash_index “[9q8yy5” “[9q8yy5\xff” — returns all members starting with “9q8yy5”. This is an O(log n + k) lexicographic range scan. Repeat for all 9 geohash cells.

What is Redis GEORADIUS and how does it relate to geohash?
?
GEORADIUS (now GEOSEARCH in Redis 6.2+) is a built-in Redis geospatial command. Usage: GEORADIUS locations 37.77 -122.42 2 km ASC COUNT 20. Internally Redis stores the geohash as the Sorted Set score (encodes lat/lng into a 52-bit integer using geohash). GEORADIUS is convenient but gives less fine-grained control than manual geohash prefix queries. Use GEORADIUS for simple cases; use manual ZRANGEBYLEX for custom precision or multi-cell logic.

Describe the two-database design for a proximity service.
?
Business DB (PostgreSQL): Stores full business metadata (name, address, hours, phone, category, rating). Write-heavy — business owners update listings. ACID guarantees for correctness. Add read replicas for read scaling. Geospatial Index (Redis Sorted Set): Maps geohash → list of business_ids. Read-heavy — every search query hits this. In-memory for sub-millisecond lookups. Updated asynchronously when business metadata changes. The LBS reads from Redis to get IDs, then fetches details from PostgreSQL read replicas.

What does LBS (Location-Based Service) do and why is it stateless?
?
LBS is the read-only service that handles proximity queries. It: (1) Converts user (lat, lng, radius) to the appropriate geohash precision, (2) Computes the 9 neighboring geohash cells, (3) Queries Redis for all business_ids in those cells, (4) Fetches business details from PostgreSQL read replicas, (5) Filters by exact distance and returns sorted results. It is stateless because it holds no session data or counters — all state lives in Redis and PostgreSQL. Stateless = trivially horizontally scalable (add pods behind a load balancer).

Why use two separate services (LBS for reads, Business Service for writes) instead of one monolith?
?
Read and write workloads have completely different characteristics. Reads: 5,800 QPS, latency-sensitive, can tolerate eventual consistency, stateless → need horizontal scale. Writes: ~200 writes/sec, need ACID consistency, require geospatial index invalidation → need careful orchestration. Separating them lets each scale independently, use different infrastructure, and be deployed/updated without affecting the other. This is the standard read/write separation pattern in high-traffic systems.

What is a QuadTree and how does it handle non-uniform data distribution?
?
A QuadTree recursively subdivides a 2D area into 4 quadrants (NW, NE, SW, SE). A node is split when it contains more businesses than a threshold (e.g., 100). Dense areas (Manhattan) get many small cells; sparse areas (Montana) stay as large cells. This adapts naturally to data distribution. Build time: O(n log n). Query time: O(log n + k). The entire tree for 200M businesses fits in ~1-2 GB of memory. Downside: hard to update incrementally when businesses are added/moved.

QuadTree vs Geohash — which do you recommend in a system design interview and why?
?
Recommend Geohash for interviews. Geohash is simpler to explain, maps directly to Redis strings, and the 9-cell neighbor query is easy to draw on a whiteboard. It is battle-tested in production (Foursquare, Yelp, many others use it). QuadTree is better for adaptive density handling and custom implementations but is complex to implement correctly and hard to update incrementally. Mention QuadTree as a trade-off you know, but default to geohash unless the interviewer specifically probes density adaptivity.

What is Google S2 and when would you mention it in an interview?
?
S2 is Google’s geospatial library that projects Earth’s sphere onto a cube, then applies a Hilbert space-filling curve to assign a 1D integer to each 2D cell. Hilbert curve property: geographically adjacent cells have numerically close integers, enabling efficient 1D range queries for 2D proximity. S2 has better area uniformity than geohash (cells are more equal in size). Used by Google Maps, Foursquare, Pokémon GO. Mention it when asked about alternatives to geohash or when discussing Google’s approach, but do not claim you would implement it from scratch.

How does the caching strategy work for proximity queries?
?
Three cache layers: (1) Geospatial index: Redis Sorted Set (always current, updated on write — no TTL). (2) Business detail cache: Redis hash keyed by business_id, TTL 24 hours. (3) Search result cache: Redis list keyed by (geohash_cell + category), TTL 10 minutes. On a business update: remove old geohash entry from sorted set, add new one, DEL detail cache, DEL affected search caches. User location-to-geohash: cache with 5-minute TTL to avoid recomputing for rapid consecutive searches.

What is the write path when a business owner updates their business location?
?

Business Service receives PUT /v1/businesses/{id} with new lat/lng. 2. Write new lat/lng to PostgreSQL primary. 3. Compute old geohash (from previous record) and new geohash at all precision levels. 4. Atomically in Redis: ZREM geohash_index “old_hash:biz_id” and ZADD geohash_index 0 “new_hash:biz_id”. 5. Invalidate business detail cache: DEL biz:{business_id}. 6. Invalidate search caches for affected cells. 7. Replication lag to read replicas (<1 sec) is acceptable — near real-time is fine for business location changes.

How would you handle a food truck that updates its location every 5 minutes?
?
Treat it like any business update but at higher frequency. Key considerations: (1) Debounce rapid updates — do not write to DB for every GPS ping; batch or rate-limit to one DB write per 5 minutes. (2) Update Redis geospatial index on each meaningful location change. (3) Cache TTL for search results in affected cells should be short (< 5 minutes) so users see current location quickly. (4) If the truck moves frequently, consider a separate “live locations” Redis hash with shorter TTL rather than modifying the main geospatial index on every update.

What QPS does the proximity service read path need to handle and how do you scale it?
?
Average QPS: ~5,800. Peak QPS: ~17,400 (3× average). Scaling approach: (1) LBS is stateless → scale horizontally with more pods, auto-scaled based on CPU/latency. (2) Redis cluster sharded by geohash prefix → distributes index lookups. (3) PostgreSQL read replicas (3-5) → distribute detail fetches. (4) CDN / edge caching for repeated searches in popular areas. (5) Search result caches (10 min TTL per geohash cell) drastically cut Redis + DB hits for popular areas (e.g., Times Square searches hit cache almost always).

How do you shard the Redis geospatial index?
?
Shard by the first character of the geohash string. Base32 geohash uses characters: 0-9 and b,c,d,e,f,g,h,j,k,m,n,p,q,r,s,t,u,v,w,x,y,z (32 chars). Divide them across N shards. Example for 4 shards: Shard 0: “0”-“7”, Shard 1: “8”-“f”, Shard 2: “g”-“q”, Shard 3: “r”-“z”. A search for 9 cells always hashes to the same shard(s), making multi-cell queries predictable. This gives even geographic distribution since geohash prefixes are globally distributed.

What is the difference between a proximity service and nearby friends (Facebook)?
?
Proximity service (Yelp): Query-based, pull model. User actively searches “show me restaurants near me.” Businesses are static (update rarely). Latency requirement ~200ms. Geospatial index is pre-built. Nearby Friends (Facebook): Event-driven, push model. Friends’ locations update continuously (every 30 sec). System must fan-out location updates to all of a user’s friends in real-time. Requires WebSocket for live updates and Pub/Sub for fan-out. Much harder fan-out problem (400 friends × 3.3M updates/sec = 14M fan-out ops/sec).

How does PostGIS compare to Redis for geospatial indexing?
?
PostGIS (PostgreSQL extension): Full spatial SQL queries (ST_DWithin, ST_Distance), ACID transactions, supports complex geometry (polygons, lines), excellent for write-heavy or complex geo queries. Redis Geospatial: In-memory (microsecond latency), horizontal scaling is easier, ideal for read-heavy high-QPS proximity lookups. In the proximity service, Redis is preferred for the geospatial index (hot read path) while PostgreSQL handles business metadata. Use PostGIS when you need transactional correctness for geo updates or complex polygon queries.

What are the 5 approaches to proximity search in order of sophistication?
?

Full table scan: O(n) distance computation on all businesses — unusable at scale. 2. 2D B-tree index on lat + lng: index merge is expensive, still O(n) effective. 3. Fixed grid: assign grid cell IDs — simple but uneven distribution in real world. 4. Geohash: 1D string encoding of 2D space, prefix search in Redis — interview default. 5. QuadTree: adaptive recursive subdivision, adapts to density — custom infra. Bonus: Google S2 (Hilbert curve on sphere projection) — used by Google, most sophisticated.

What should your proximity service architecture diagram show in an interview?
?
Show these components and data flows: Client → Load Balancer → LBS (stateless, multiple pods). LBS → Redis Cluster (geospatial index, ZRANGEBYLEX on 9 geohash cells). LBS → PostgreSQL Read Replicas (fetch business details by ID). Client → Load Balancer → Business Service (CRUD). Business Service → PostgreSQL Primary (write business metadata). Business Service → Redis (update geospatial index on create/update/delete). Optionally show: CDN for static assets, a search result cache layer, monitoring/alerts. Label read vs write paths clearly.

Total Cards: 25
Review Time: 20-25 minutes
Priority: HIGH — Geospatial indexing is asked at top-tier companies. Know geohash cold.
Last Updated: 2026-04-13

Study Notes by Niladri & AI

Explorer

vol2-ch01-proximity-service

Chapter 1 Flashcards — Proximity Service (Vol. 2)

Graph View