Chapter 1: Design a Proximity Service
volume2 proximity geospatial geohash quadtree
Status: π© Interview ready
Difficulty: Medium-Hard
Time to complete: 50 min read + practice
Overview
A proximity service finds nearby points of interest (businesses, restaurants, venues) given a userβs current location and a search radius. Think Yelp, Google Places, or the βrestaurants near meβ feature inside any map app.
Why this matters:
- Very common interview question (Uber, Yelp, Google, Meta frequently ask this)
- Combines database design, geospatial indexing, caching, and API design
- Teaches real-world geospatial techniques used in production systems
Problem Statement
Design a proximity service that:
- Accepts a userβs latitude/longitude and a search radius
- Returns a ranked list of businesses within that radius
- Handles 100M daily active users and 200M businesses
- Keeps search latency low (read-heavy, write-light workload)
Step 1: Requirements & Scope (5 min)
Functional Requirements
Clarifying questions:
- What radii do we support? β Fixed options: 500m, 1km, 2km, 5km, 10km, 20km
- Is location data real-time? β Near real-time is fine; business moves rarely
- Do businesses update their own location? β Yes, but changes propagate with slight delay (acceptable)
- Return limit per query? β Default 20 results per page, support pagination
- What data is returned? β business_id, name, address, lat/lng, rating, thumbnail
Scope:
- Search businesses by location + radius
- CRUD for business owners to manage listings
- Read path and write path are separate concerns
Non-Functional Requirements
- Low latency: Search results in < 200ms
- High availability: Downtime loses searches β revenue impact
- High read throughput: Read-to-write ratio ~100:1
- Scalability: 100M DAU, 200M businesses
Capacity Estimation
Daily Active Users: 100M
Searches per user/day: ~5
Total daily searches: 500M
QPS (read): 500M / 86,400 sec β 5,800 QPS
Peak QPS (read): ~3x average β 17,400 QPS
Business writes (updates, new listings):
200M businesses / day (spread evenly) β very low write QPS (~200 writes/sec)
Read-to-write ratio: ~30:1 (heavily read-skewed)
Storage estimate:
200M businesses Γ ~1 KB per record β 200 GB (business metadata)
Geospatial index (geohash strings + IDs): additional ~50 GB
Total: ~250 GB β fits in a modest PostgreSQL cluster with replicas
Step 2: High-Level Design (10 min)
API Design
Search nearby businesses:
GET /v1/search?lat={lat}&lon={lon}&radius={radius}&business_type={type}&page={page}
Parameters:
lat float required User latitude (e.g., 37.7749)
lon float required User longitude (e.g., -122.4194)
radius int optional Meters, default 2000. Options: 500|1000|2000|5000|10000|20000
business_type string optional Filter: "restaurant"|"gas_station"|"hotel"
page int optional Pagination cursor, default 1
Response 200 OK:
{
"total": 142,
"businesses": [
{
"business_id": "biz_abc123",
"name": "Tartine Bakery",
"address": "600 Guerrero St, San Francisco, CA",
"lat": 37.7616,
"lon": -122.4241,
"rating": 4.7,
"distance_meters": 380,
"thumbnail_url": "https://cdn.example.com/biz_abc123.jpg"
},
...
],
"next_page": 2
}
Business CRUD (for business owners):
POST /v1/businesses Create a business listing
GET /v1/businesses/{business_id} Get business details
PUT /v1/businesses/{business_id} Update business (name, address, hours)
DELETE /v1/businesses/{business_id} Remove listing
Component Overview
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Clients β
β (Mobile / Web / Partners) β
ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββΌβββββββββ
β Load Balancer β
ββββββββββ¬βββββββββ
βββββββββββββββ΄βββββββββββββ
β β
ββββββββββββΌβββββββββββ ββββββββββββΌβββββββββββ
β Location-Based β β Business Service β
β Service (LBS) β β (CRUD) β
β (read-only, β β (writes go here) β
β stateless) β ββββββββββββ¬βββββββββββ
ββββββββββββ¬βββββββββββ β
β β
ββββββββββββΌβββββββββββ ββββββββββββΌβββββββββββ
β Geospatial Index β β Business DB β
β (Redis / PostGIS) β β (PostgreSQL) β
β Read-heavy β β Write + Read β
βββββββββββββββββββββββ ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β Read Replicas β
β (scale reads) β
βββββββββββββββββββββββ
Two-database approach:
| Database | Purpose | Characteristics |
|---|---|---|
| Business DB (PostgreSQL) | Business metadata: name, address, hours, phone | Write-heavy; normalized; ACID |
| Geospatial Index (Redis/PostGIS) | Spatial lookup: geohash β list of business_ids | Read-heavy; can tolerate eventual consistency |
Location-Based Service (LBS):
- Stateless β horizontally scalable
- No business logic, pure query layer
- Translates (lat, lng, radius) β geohash prefix β business_ids β fetch details
Step 3: Deep Dive (20 min)
Approach 1: Naive β Full Table Scan (Too Slow!)
SELECT business_id, name, lat, lng,
distance(lat, lng, :user_lat, :user_lng) AS dist
FROM businesses
WHERE distance(lat, lng, :user_lat, :user_lng) <= :radius
ORDER BY dist
LIMIT 20;Problem:
- Computes distance for every row β O(n)
- 200M rows β unusable even with parallelism
- No index on a computed column
Approach 2: Two-Dimensional Index (Better but Still Slow)
Add separate indexes on lat and lng:
CREATE INDEX idx_lat ON businesses(lat);
CREATE INDEX idx_lng ON businesses(lng);
-- Bounding box query (approximate)
SELECT * FROM businesses
WHERE lat BETWEEN :min_lat AND :max_lat
AND lng BETWEEN :min_lng AND :max_lng;Problem: Database can efficiently use ONE index at a time. Combining two range indexes (βindex mergeβ) is expensive. Still scans too many rows for dense areas.
Approach 3: Evenly Divided Grid
Divide the world into a fixed NΓN grid. Store each business with its grid cell ID. Query a cell + adjacent cells.
βββββββ¬ββββββ¬ββββββ¬ββββββ
β 0,3 β 1,3 β 2,3 β 3,3 β
βββββββΌββββββΌββββββΌββββββ€
β 0,2 β 1,2 β 2,2 β 3,2 β
βββββββΌββββββΌββββββΌββββββ€
β 0,1 β 1,1 β 2,1 β 3,1 β
βββββββΌββββββΌββββββΌββββββ€
β 0,0 β 1,0 β 2,0 β 3,0 β
βββββββ΄ββββββ΄ββββββ΄ββββββ
Problem: Uneven distribution. Manhattan has millions of businesses; Sahara desert has zero. Fixed grid is wasteful and inefficient for dense areas.
Approach 4: Geohash (Recommended for Interviews!)
Core idea: Encode the entire globe into a hierarchy of grid cells using a Base32 string. Longer string = smaller, more precise cell.
How geohash works:
- Interleave latitude bits and longitude bits alternately
- Encode the resulting bit string in Base32
- Result is a short alphanumeric string where common prefix = geographic proximity
Geohash precision table:
| Level | Length | Cell Size (approx) | Use Case |
|---|---|---|---|
| 1 | 1 char | 5,000 km Γ 5,000 km | Country level |
| 2 | 2 chars | 1,250 km Γ 625 km | Region |
| 3 | 3 chars | 156 km Γ 156 km | City |
| 4 | 4 chars | 39 km Γ 20 km | District |
| 5 | 5 chars | 4.9 km Γ 4.9 km | ~5km radius |
| 6 | 6 chars | 1.2 km Γ 0.6 km | ~1km radius |
| 7 | 7 chars | 152 m Γ 152 m | Street level |
Radius β geohash level mapping:
Radius β Geohash level
500m β level 6 (0.6km cell)
1km β level 6
2km β level 5 (4.9km cell)
5km β level 5
10km β level 4 (20km cell)
20km β level 4
Geohash example:
San Francisco (37.7749, -122.4194):
Level 6 geohash: "9q8yy"
Level 5 geohash: "9q8yy" β prefix "9q8y"
Level 4 geohash: prefix "9q8"
Nearby restaurant (37.7616, -122.4241):
Level 6 geohash: "9q8yw"
β Shares prefix "9q8y" at level 5 (same area!)
Query algorithm:
1. Convert user (lat, lng) to geohash at precision P
e.g., "9q8yy" at precision 6
2. Get 8 neighboring geohashes:
["9q8yz", "9q8yu", "9q8yv", "9q8yg",
"9q8ys", "9q8yt", "9q8yw", "9q8yx"]
3. Query Redis for all businesses in 9 cells (center + 8 neighbors):
ZRANGEBYLEX geohash_index "[9q8yy" "[9q8yy\xff"
(repeat for all 9 cells)
4. For each candidate business_id, fetch metadata from Business DB
5. Compute exact distance, filter to radius, sort by distance
6. Return top 20
Why 9 cells? Boundary problem: A user standing at the edge of geohash cell X might miss businesses in the adjacent cell Y even though they are very close. Querying all 8 neighbors guarantees complete coverage.
βββββββββββ¬ββββββββββ¬ββββββββββ
β NW nbr β N nbr β NE nbr β
β 9q8yg β 9q8yu β 9q8yv β
βββββββββββΌββββββββββΌββββββββββ€
β W nbr β CENTER β E nbr β
β 9q8ys β 9q8yy β 9q8yz β β Always query all 9!
βββββββββββΌββββββββββΌββββββββββ€
β SW nbr β S nbr β SE nbr β
β 9q8yt β 9q8yw β 9q8yx β
βββββββββββ΄ββββββββββ΄ββββββββββ
Redis data structure for geohash index:
Sorted Set: geohash_index
Key: geohash string (lexicographic)
Score: 0 (all same score β sorted lexicographically by member)
Member: "9q8yy:biz_abc123" (geohash:business_id)
# Insert a business:
ZADD geohash_index 0 "9q8yy:biz_abc123"
# Query all businesses with prefix "9q8yy":
ZRANGEBYLEX geohash_index "[9q8yy" "[9q8yy\xff"
β ["9q8yy:biz_abc123", "9q8yy:biz_def456", ...]
Alternative: Redis GEOADD/GEODIST/GEORADIUS (built-in geospatial):
GEOADD locations 37.7616 -122.4241 "biz_abc123"
GEORADIUS locations 37.7749 -122.4194 2 km ASC COUNT 20
- Redis internally uses geohash for storage
- GEORADIUS is convenient but has less fine-grained control
Geohash edge case β boundary split:
Problem: Two adjacent cells share a border. A business 10m across
the border from a user appears in a different geohash. Without
neighbor queries, it would be missed.
Solution: Always query 9 cells (center + 8 neighbors).
This adds a small overhead but ensures correctness.
Approach 5: QuadTree (Good for Custom Implementations)
Core idea: Recursively subdivide the world into quadrants. Split a cell when it contains more than a threshold number of businesses (e.g., 100).
World
βββ NW Quadrant (sparse) β leaf at level 2
βββ NE Quadrant
β βββ NW sub-quadrant (medium density) β leaf
β βββ NE sub-quadrant
β β βββ NW (dense, e.g., Manhattan) β leaf
β β βββ NE β leaf
β β βββ SW β leaf
β β βββ SE β leaf
β βββ SW β leaf
β βββ SE β leaf
βββ SW Quadrant β leaf
βββ SE Quadrant β ...
QuadTree properties:
| Property | Value |
|---|---|
| Max businesses per leaf | 100 (configurable threshold) |
| Build time | O(n log n) β each business inserted once per level |
| Space | O(n) |
| Query time | O(log n + k) where k = results returned |
| Update | Hard β must re-balance subtree |
Build the tree:
function buildQuadTree(businesses, bounds):
if len(businesses) <= MAX_LEAF_SIZE:
return LeafNode(businesses, bounds)
NW, NE, SW, SE = splitIntoBounds(bounds)
return InternalNode(
NW = buildQuadTree(filter(businesses, NW), NW),
NE = buildQuadTree(filter(businesses, NE), NE),
SW = buildQuadTree(filter(businesses, SW), SW),
SE = buildQuadTree(filter(businesses, SE), SE)
)
Query:
function search(node, point, radius):
if not overlaps(node.bounds, circle(point, radius)):
return []
if isLeaf(node):
return filter(node.businesses, distance <= radius)
return (search(node.NW, ...) + search(node.NE, ...) +
search(node.SW, ...) + search(node.SE, ...))
QuadTree pros and cons:
| Pros | Cons |
|---|---|
| Adapts to data density (more cells in NYC, fewer in Montana) | Hard to update incrementally (business moves = tree rebuild) |
| Efficient for irregular distributions | Must be rebuilt when data changes significantly |
| Entire tree fits in memory (~1-2 GB for 200M businesses) | Complex to implement correctly in an interview |
| No precision loss β exact geometry | Not as cache-friendly as geohash |
When to use QuadTree: When you control the full stack and have heavily non-uniform data distribution. Geohash is simpler for interviews.
Approach 6: Google S2 (Advanced, Know the Concept)
Core idea: Map the sphere (Earth) onto a cube, then use a Hilbert space-filling curve to assign a 1D integer to each 2D cell. The Hilbert curve has excellent locality β nearby cells have nearby integer IDs.
Sphere β Cube face projection β Divide face into grid β Hilbert curve ordering
Hilbert curve property:
Points that are geographically close β integers that are numerically close
This makes range queries on a 1D integer work for 2D geography!
S2 key properties:
- Cells at 30 levels of granularity (1 cmΒ² to entire Earth)
- Better area distortion than geohash (more uniform cell sizes)
- Used by Google Maps, Foursquare, PokΓ©mon GO
- More complex to implement from scratch β use as knowledge, not interview code
Algorithm Comparison
| Approach | Complexity | Pros | Cons | Use |
|---|---|---|---|---|
| Full scan | O(n) | Simple | Way too slow | Never |
| 2D index | O(n) effective | Easy | Index merge expensive | Never |
| Fixed grid | O(1) lookup | Simple | Uneven distribution | No |
| Geohash | O(log n) | Simple strings, Redis-native, easy prefix search | Boundary problem (solved by 9-cell query), cells not equal size | Yes β interview default |
| QuadTree | O(log n + k) | Adaptive density | Hard to update, complex to code | Yes β if custom infra |
| S2 | O(log n) | Best coverage, uniform cells | Very complex | Know concept only |
Business Data: Read/Write Separation
Write path (low frequency):
Business owner β Business Service β Primary DB (PostgreSQL)
β Invalidate / update geospatial index
Read path (high frequency):
User search β LBS β Redis Geospatial Index β business_ids
β Business Service β Read Replica β full business details
Database schema:
-- Business metadata table
CREATE TABLE businesses (
business_id VARCHAR(36) PRIMARY KEY,
owner_id VARCHAR(36) NOT NULL,
name VARCHAR(255) NOT NULL,
address TEXT,
lat DECIMAL(10, 8) NOT NULL,
lng DECIMAL(11, 8) NOT NULL,
category VARCHAR(50),
rating DECIMAL(3, 1),
is_active BOOLEAN DEFAULT true,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_businesses_category ON businesses(category);
CREATE INDEX idx_businesses_lat_lng ON businesses(lat, lng); -- for bounding box fallbackCaching Strategy
What to cache:
| Cache Layer | Key | Value | TTL |
|---|---|---|---|
| Geospatial index | Redis Sorted Set (geohash_index) | geohash:business_id pairs | No TTL β updated on write |
| Business details | biz:{business_id} | Full business JSON | 24 hours |
| Search results | search:{geohash6}:{category} | List of business_ids | 10 minutes |
| User location | user:{user_id}:geohash | Current geohash of user | 5 minutes |
Cache invalidation:
When a business updates location:
1. Update primary DB
2. Remove old geohash entry from Redis: ZREM geohash_index "old_hash:biz_id"
3. Add new geohash entry: ZADD geohash_index 0 "new_hash:biz_id"
4. Invalidate business detail cache: DEL biz:{business_id}
5. Invalidate affected search caches: DEL search:{old_geohash6}:*
DEL search:{new_geohash6}:*
Handling Scale
LBS horizontal scaling:
LBS is stateless β add more instances behind load balancer
βββββββββββββββββββββββββββββββββββββββββ
β Load Balancer β
ββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββββ
β β β
ββββΌβββ ββββΌβββ ββββΌβββ
βLBS 1β βLBS 2β βLBS 3β ... (add more on demand)
ββββ¬βββ ββββ¬βββ ββββ¬βββ
ββββββββββββ΄βββββββββββ
β
ββββββββββΌβββββββββ
β Redis Cluster β (sharded by geohash prefix)
βββββββββββββββββββ
Redis sharding for geospatial index:
Shard by first character of geohash (16 possible first chars in Base32):
Shard 0: geohash prefix "0" - "3"
Shard 1: geohash prefix "4" - "7"
Shard 2: geohash prefix "8" - "b"
Shard 3: geohash prefix "c" - "z"
This distributes global queries evenly across shards.
Read replica strategy:
Primary DB (writes) β replicate β Read Replica 1
β Read Replica 2
β Read Replica 3
LBS always reads from replicas.
Business Service writes to primary only.
Replication lag < 1 second (acceptable for near-real-time updates).
Design Summary
Final Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client (Mobile/Web) β
ββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β GET /v1/search?lat=&lon=&radius=
ββββββββββΌβββββββββ
β Load Balancer β
βββββββββ¬ββββββββββ
ββββββββββββββββββ΄ββββββββββββββββββ
β Search Path β Write Path
ββββββββββββΌβββββββββββ βββββββββββββββΌβββββββββββββββ
β Location-Based β β Business Service β
β Service (LBS) β β (CRUD, Owner API) β
β [Stateless, N pods]β βββββββββββββββ¬βββββββββββββββ
ββββββββββββ¬βββββββββββ β
β β read/write
ββββββββββββΌβββββββββββ βββββββββββββββΌβββββββββββββββ
β Redis Cluster β β PostgreSQL Primary β
β Geospatial Index ββββββββββββ (business metadata) β
β (geohash β biz_id) β sync βββββββββββββββ¬βββββββββββββββ
β β β replicate
β Also: detail cache β βββββββββββββββΌβββββββββββββββ
βββββββββββββββββββββββ β Read Replicas (Γ3) β
β (LBS reads details here) β
ββββββββββββββββββββββββββββββ
Key Decisions Summary
| Decision | Choice | Reasoning |
|---|---|---|
| Geospatial algorithm | Geohash | Simple strings, Redis ZRANGEBYLEX, easy neighbor lookup |
| Geospatial storage | Redis Sorted Set | Fast prefix scans, in-memory, horizontally scalable |
| Business metadata | PostgreSQL + Read Replicas | ACID for writes, scale reads horizontally |
| LBS design | Stateless microservice | Horizontal scaling for high read QPS (5,800 peak) |
| Cache | Redis (detail + search result cache) | Sub-millisecond reads, TTL auto-expiry |
| Update strategy | Write-through + cache invalidation | Eventual consistency acceptable (near real-time) |
| Neighbor search | 9-cell query (center + 8 neighbors) | Solves geohash boundary problem |
Interview Questions & Answers
Q: Why geohash over a simple (lat, lng) index in the database?
A: A composite (lat, lng) B-tree index only helps with one dimension at a time. For a range query spanning both dimensions, the DB must merge two range scans, which is expensive. Geohash encodes 2D space into a 1D string where proximity is preserved β nearby locations share a common prefix. This turns a 2D spatial query into a simple string prefix scan, which Redis handles very efficiently with ZRANGEBYLEX.
Q: What is the geohash boundary problem and how do you solve it?
A: A user standing at the edge of geohash cell β9q8yyβ may be very close to a business in the adjacent cell β9q8yzβ, but a naive query on only β9q8yyβ would miss that business. The fix is to always query 9 cells: the userβs own cell plus all 8 immediate neighbors. After collecting all candidate business IDs, compute exact distance and filter to the requested radius. This adds a constant overhead (9Γ the work) but guarantees no nearby business is missed.
Q: Geohash vs QuadTree β when would you choose each?
A: Geohash is simpler to implement, maps naturally to Redis strings, and is the right choice for most interviews and most production systems. QuadTree adapts better to highly non-uniform data distribution (e.g., ultra-dense urban areas vs sparse rural regions) and can be more memory-efficient in those cases. However QuadTree is harder to update incrementally β moving a business requires a subtree rebuild. Use geohash unless you have a strong reason to need the density adaptivity of a QuadTree.
Q: How do you handle a business that moves (e.g., a food truck)?
A: The update pipeline must be atomic in practice: (1) Write new lat/lng to the primary DB. (2) Remove the old geohash entry from the Redis sorted set. (3) Add the new geohash entry. (4) Invalidate any search-result caches that covered the old and new geohash cells. Replication lag to read replicas is acceptable (< 1 second). Users might get stale results for a few seconds, which is fine for a proximity service.
Q: How would you scale this to 10Γ more users?
A: The LBS is already stateless β horizontally scale it by adding more pods. Redis can be clustered and sharded by geohash prefix. PostgreSQL read replicas handle the read volume. The only global bottleneck is the Redis geospatial index β if needed, partition it into regional clusters (US, EU, Asia) since users typically search nearby businesses and cross-region queries are rare.
Key Takeaways
- Geohash encodes 2D location as a 1D string β geographic proximity maps to string prefix proximity. This is the key insight enabling fast prefix-range searches.
- Always query 9 cells (center + 8 neighbors) to avoid missing businesses near cell boundaries.
- Separate read and write paths: LBS (reads) β Redis + Read Replicas; Business Service (writes) β Primary DB β sync geospatial index.
- LBS is stateless β trivially horizontally scalable to handle peak QPS (~17,400).
- Redis ZRANGEBYLEX is the natural fit for geohash prefix queries in production; Redis GEORADIUS is the higher-level convenience API.
- QuadTree is adaptive to density but harder to update; geohash is fixed-grid but simpler and battle-tested.
- Cache aggressively: business details (24h TTL), search results per geohash cell (10 min TTL), invalidate on any business update.
Related Resources
- ch02-nearby-friends β Real-time location tracking (contrast: query-based vs push-based)
- ch05-consistent-hashing β Used for Redis cluster sharding
- key-patterns > Geospatial Indexing β Geohash and QuadTree pattern reference
- distributed-system-components β Redis, PostgreSQL, Load Balancers
Last Updated: 2026-04-13
Status: π© Interview ready β Geohash is the go-to answer; know the boundary problem cold!