Chapter 3: Design Google Maps
volume2 google-maps routing graph navigation
Status: π© Interview ready
Difficulty: Very Hard
Time to complete: 60 min read + practice
Overview
Google Maps is a real-time mapping and navigation platform that serves 1 billion users daily. It provides turn-by-turn navigation, real-time traffic updates, map tile rendering, ETA estimation, and location search across 5 million km of roads worldwide.
Why this matters:
- Rare but high-signal interview question at Google, Uber, Lyft, Meta
- Tests knowledge of graph algorithms, distributed storage, CDN strategy, and real-time data pipelines
- Combines multiple hard sub-problems: routing, tile serving, traffic aggregation
Problem Statement
Design a system that:
- Navigates users from point A to point B with accurate turn-by-turn directions
- Incorporates real-time traffic data into route and ETA calculations
- Renders map tiles at any zoom level, worldwide
- Supports 1 billion DAU with low latency
Step 1: Requirements & Scope (5 min)
Functional Requirements
Clarifying questions:
- Navigation only, or also search/discovery? β Navigation + location search
- Mobile and web? β Yes, both
- Real-time traffic? β Yes, critical feature
- Offline support? β Out of scope for interview
- Public transit, walking, cycling? β Focus on driving navigation
- How fresh is traffic data? β Near real-time (seconds to minutes)
Scope:
- Turn-by-turn navigation (shortest + fastest path)
- Real-time traffic integration
- Map rendering at multiple zoom levels
- ETA estimation
- Location/address search
Non-Functional Requirements
- Scale: 1 billion DAU
- Road data: 5 million km of roads globally
- Map storage: 50 PB of raw map data; each tile ~100 KB
- Accuracy: Routes must be correct; ETAs within acceptable margin
- Low latency: Map tiles render quickly; route calculation < 2 sec
- High availability: 99.99% uptime (people depend on navigation)
- Traffic updates: Near real-time (aggregate and apply within ~1 min)
Scale Estimates
DAU: 1 billion users
Average active navigation: 35 min/session
Location updates (GPS): Every 15 sec while navigating
Location events/sec: ~1B users Γ 10% navigating Γ (1/15 sec)
β 6.7 million GPS events/sec
Map tiles:
Zoom levels: 0β22 (level 0 = whole world, level 22 = building)
Tiles at zoom 21: 2^21 Γ 2^21 = ~4 trillion tiles
But only ~10% of tiles are land (usable area)
Storage: 50 PB raw image data
Road graph:
5M km roads β ~100M nodes (intersections)
~200M edges (road segments)
Step 2: High-Level Design (10 min)
Core Components
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client (Mobile/Web) β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ¬ββββββββββββββ
β Route Request β Location Updates
βΌ βΌ
ββββββββββββββββββββββ βββββββββββββββββββββββββββ
β Navigation Service β β Location Service β
β (Route Calculation)β β (GPS ingestion) β
ββββββββββββ¬ββββββββββ ββββββββββββ¬βββββββββββββββ
β β
β Graph query β Kafka stream
βΌ βΌ
ββββββββββββββββββββββ βββββββββββββββββββββββββββ
β Road Graph Store β β Traffic Processor β
β (Object Storage / β β (Speed aggregation β
β Graph DB) βββββββββ per road segment) β
ββββββββββββββββββββββ βββββββββββββββββββββββββββ
ββββββββββββββββββββββ
β Map Tile Service βββββ CDN ββββ Client
β (Tile generation) β
ββββββββββββββββββββββ
Three Main Services
| Service | Responsibility | Key Data |
|---|---|---|
| Location Service | Ingest GPS updates from users | Kafka topic per region |
| Navigation Service | Compute routes, ETAs | Road graph, traffic weights |
| Map Rendering Service | Serve pre-rendered or vector tiles | Tile storage + CDN |
Data Stores
Road Graph: Object storage (S3 / GCS) β adjacency lists by region
Map Tiles: Object storage + CDN (static, rarely updated)
Traffic Data: Time-series DB or Redis β speed per road segment
User Location: Kafka (stream) β traffic processor (ephemeral)
ETA History: Data warehouse for ML model training
Step 3: Deep Dive (30 min)
3.1 Map Tile System
World divided into tiles using a quadtree-like grid:
Zoom Level 0: 1 tile (entire world 256Γ256 px)
Zoom Level 1: 4 tiles (2Γ2 grid)
Zoom Level 2: 16 tiles (4Γ4 grid)
...
Zoom Level z: 4^z tiles
Each tile = 256Γ256 pixels
Each tile URL: /tiles/{z}/{x}/{y}.png
Tile URL structure:
https://maps.cdn.example.com/tiles/15/16384/10900.png
^z ^x ^y
z = zoom level (15 = neighborhood level)
x = tile column
y = tile row
Zoom level reference:
Level 0: World (1 tile)
Level 5: Continent
Level 10: City
Level 14: Neighborhood
Level 17: Street level
Level 20: Building detail
Level 22: Room level
Tile caching strategy:
Client Request for tile z/x/y
β
βΌ
CDN Edge Node ββββ HIT βββ Return tile (< 10 ms)
β MISS
βΌ
CDN Origin (Object Storage: S3/GCS)
β
βΌ
Pre-rendered tile file (100 KB)
Cache-Control: max-age=604800 (7 days)
Why CDN works perfectly for tiles:
- Tiles are static (road geometry rarely changes)
- Same tile served to millions of users at same zoom/location
- Immutable URLs (tile content at z/x/y is always the same version)
- Cache-hit ratio > 99% for popular areas
Vector Tiles vs Raster Tiles:
| Aspect | Raster Tiles (PNG/JPG) | Vector Tiles (PBF/JSON) |
|---|---|---|
| Rendering | Server-rendered image | Client-rendered geometry |
| File size | ~100 KB per tile | ~10β50 KB per tile |
| Zoom smoothness | Pixelates between levels | Smooth (infinite zoom) |
| Customization | Fixed style | Style at runtime |
| Client CPU | Low | Higher (WebGL/Canvas) |
| Offline | Easy to cache | Easy to cache + re-style |
| Used by | Legacy systems | Google Maps, Mapbox |
Google Maps uses vector tiles on modern clients for smooth rendering and smaller bandwidth.
3.2 Road Graph for Routing
Graph representation:
Nodes = Road intersections (lat/lng coordinates)
Edges = Road segments between intersections
Weight = Travel time in seconds (NOT distance)
β Updated dynamically with traffic data
Example graph:
A ββ(5s)ββ B ββ(3s)ββ C
β β
(8s) (4s)
β β
D ββ(2s)ββ E ββ(6s)ββ F
Nodes: A, B, C, D, E, F
Edges with weights (travel time):
AβB: 5s BβC: 3s AβD: 8s
BβE: 4s DβE: 2s EβF: 6s
Storage format β adjacency list:
{
"node_A": {
"lat": 37.7749, "lng": -122.4194,
"edges": [
{"to": "node_B", "base_weight_s": 5, "road_type": "primary"},
{"to": "node_D", "base_weight_s": 8, "road_type": "secondary"}
]
},
"node_B": {
"lat": 37.7755, "lng": -122.4180,
"edges": [
{"to": "node_C", "base_weight_s": 3, "road_type": "primary"},
{"to": "node_E", "base_weight_s": 4, "road_type": "primary"}
]
}
}Stored in object storage (S3/GCS):
- Partitioned by geographic region (tiles map to graph tiles)
- Graph tile at zoom ~6 covers one city region
- Only load relevant regions during route calculation
- ~200 GB per major metro areaβs compressed graph data
3.3 Routing Algorithms
Naive Dijkstraβs β Too Slow for Global Scale
Dijkstra's time complexity: O((V + E) log V)
Global road network:
V = 100 million nodes (intersections)
E = 200 million edges (road segments)
Computation: ~300M Γ log(100M) β 300M Γ 27 β 8 billion operations
Time: Several seconds to minutes per query β Unacceptable!
Why itβs too slow:
New York β Los Angeles route:
Dijkstra explores ALL nodes closer than destination
Must explore millions of nodes across entire continent
Cannot complete in < 2 second requirement
Algorithm 1: A* (A-Star) β Better with Heuristic
Key idea: Guide search with a heuristic (straight-line distance to goal)
f(n) = g(n) + h(n)
g(n) = actual cost from start to node n
h(n) = heuristic estimate from n to goal (straight-line distance)
Priority queue orders by f(n), not just g(n)
β Explores nodes "toward" the destination first
Visual comparison:
Dijkstra: A*:
Explores in all directions Explores mostly toward goal
S S
βββββ βββ
βββββββββ βββββ
βββββββGββ ββββG
βββββββββ β
βββββ
β = explored nodes Much fewer explored nodes!
Speedup: 2β10x faster than Dijkstra on road networks
Limitation: Still explores too many nodes for continent-wide routing
Algorithm 2: Contraction Hierarchies (CH) β Production Standard
Key idea: Pre-process graph by βcontractingβ (removing) unimportant nodes and adding shortcut edges. Query uses bidirectional search on the contracted graph.
Pre-processing phase:
Step 1: Rank all nodes by "importance"
Importance = how many shortcut edges would be needed if this node removed
Low-importance nodes: Dead-end streets, local roads
High-importance nodes: Highway junctions, major intersections
Step 2: Contract nodes from least to most important
When contracting node v:
For each pair of neighbors (u, w) of v:
If shortest path uβw goes through v:
Add shortcut edge uβw with weight dist(u,v) + dist(v,w)
Step 3: Result: Hierarchical graph
Level 0: Local streets
Level 1: Primary roads
Level 2: Highways
Level 3: Major interstates / motorways
Query phase (bidirectional):
Forward search from source S: Only upward in hierarchy
Backward search from goal G: Only upward in hierarchy
[Motorway nodes - top level]
/ \
[Highway nodes] [Highway nodes]
/ \
[S] β β β β β β βmeeting pointβ β β β [G]
Each direction only explores ~1000 nodes vs millions!
Speedup: 1000β10,000x faster than naive Dijkstra
Used by: OSRM, Graphhopper, HERE Maps, most production GPS systems
Pre-computation time: Hours to days for global graph (done offline, periodically)
Algorithm 3: Bidirectional Dijkstra
Standard Dijkstra: Bidirectional Dijkstra:
S βββββββββββββββΊ G S βββββββΊ βββββββ G
Explores radius r Each explores radius r/2
Total: 2 Γ Ο(r/2)Β² << Ο rΒ²
Area: Ο rΒ² ~4x fewer nodes explored
Simple and practical, often used as first-pass filter with CH.
Algorithm Comparison
| Algorithm | Time Complexity | Pre-computation | Speedup vs Dijkstra | Used For |
|---|---|---|---|---|
| Dijkstra | O((V+E) log V) | None | 1x | Small graphs |
| A* | O((V+E) log V) | None | 2β10x | Medium graphs |
| Bidirectional Dijkstra | O((V+E) log V) | None | ~4x | Simple speedup |
| Contraction Hierarchies | O(V^0.5) query | Hours (offline) | 1000β10000x | Production GPS |
Recommendation for interview: Mention Dijkstra first, explain why itβs too slow, then propose Contraction Hierarchies as the production solution. Show you know A* as an intermediate.
3.4 Real-Time Traffic Integration
How Google Maps collects traffic data:
Passive probe model (anonymized):
[User navigating] ββGPS update every 15sβββΊ Location Service
β
βΌ
Anonymize
(strip user ID)
β
βΌ
Kafka Topic
(location events)
β
βΌ
Traffic Processor
(aggregate speed per segment)
β
βΌ
Road segment speed table
segment_id β avg_speed_kmh
β
βΌ
Update graph edge weights
weight_s = segment_length_m / speed_m_s
Traffic data pipeline:
GPS Events (6.7M/sec)
β
βΌ
Kafka (partitioned by road segment ID)
β
βΌ
Stream Processor (Flink/Spark Streaming)
- Group events by road segment
- Calculate median speed over 1-minute window
- Filter outliers (GPS errors, stationary users)
β
βΌ
Traffic DB (Redis / time-series DB)
Key: segment_id:timestamp_bucket
Val: avg_speed_kmh, sample_count, confidence
β
βΌ
Graph weight updater (runs every 1β5 min)
Reads updated speeds β Recalculates edge weights β Writes back to graph
Speed β Edge weight conversion:
def compute_edge_weight(segment_length_m, observed_speed_kmh, free_flow_speed_kmh):
# Use observed speed if traffic data available (high confidence)
# Fall back to free flow speed (posted speed limit) otherwise
effective_speed = observed_speed_kmh if confidence > threshold else free_flow_speed_kmh
speed_m_s = effective_speed * 1000 / 3600
travel_time_s = segment_length_m / speed_m_s
return travel_time_sTraffic data freshness:
Data latency: GPS β edge weight update β 1β3 minutes
Acceptable because: Traffic conditions change slowly (minutes, not seconds)
Exception: Accidents/incidents β pushed via separate incident feed
3.5 ETA Calculation
Simple ETA:
ETA = Sum of edge weights along route (in seconds)
= Sum(segment_length / current_speed) for all segments
Better ETA with ML:
Features:
- Current traffic speed per segment (real-time)
- Historical speed at same time/day/segment
- Day of week, time of day, holidays
- Weather (rain slows traffic ~10β20%)
- Incidents (accidents, road closures)
- Route type (highway vs city streets)
Model:
- Gradient boosted trees or neural network
- Trained on historical route + actual travel time pairs
- Recalculated every few minutes as route progresses
Accuracy target: Β± 10% of actual travel time for 90% of routes
3.6 Map Partitioning for Scalability
Challenge: 5M km of roads, 100M nodes β canβt load entire graph in memory
Solution: Partition by geographic tiles:
World map divided into routing tiles (zoom level ~6)
Each routing tile: ~100 km Γ 100 km area
Stored as: graph_tile_{z}_{x}_{y}.bin in object storage
Route from NYC to LA:
1. Find all tiles along approximate path (bounding box)
2. Load only those tiles (~20 tiles for cross-country)
3. Run CH query on loaded subgraph
4. If path crosses tile boundary β load neighboring tile
Total tiles for global map: ~10,000 tiles at routing zoom level
Average tile size: ~20 MB compressed
Total: ~200 GB for global routing graph (fits in cluster memory)
Tile boundary connections:
Tile A βββββββββββββ Tile B
[node_99] ββββ [node_100]
boundary edge stored in both tiles
3.7 Location Update Service Architecture
Mobile Client (navigating)
β GPS update (lat, lng, speed, heading, accuracy)
β WebSocket (persistent connection for 2-way comms)
βΌ
Load Balancer
β
βΌ
Location Service (stateless, horizontally scalable)
β Validates, anonymizes, rate-limits
βΌ
Kafka (partitioned by geohash of location)
β
ββββΊ Traffic Processor (aggregate speed per segment)
β
ββββΊ Incident Detector (sudden stops = possible accident)
Why WebSocket for location updates:
- Persistent connection avoids TCP handshake overhead per update
- Server can push re-routing suggestions back to client
- More efficient than polling (REST) for 15-sec update interval
Design Summary
Complete Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client Apps β
β (iOS / Android / Web) β
ββββββββ¬ββββββββββββββββ¬βββββββββββββββββ¬βββββββββββββββββ¬ββββββββββ
β β β β
Route Request Location Update Tile Request Address Search
β β β β
βΌ βΌ βΌ βΌ
ββββββββββββββββ βββββββββββββ βββββββββββββββ βββββββββββββββββ
β Navigation β β Location β β CDN Edge β β Search β
β Service β β Service β β (Tiles) β β Service β
β (Routing + β β (WebSocketβ ββββββββ¬βββββββ β (Geocoding) β
β ETA) β β /REST) β β MISS βββββββββββββββββ
ββββββββ¬ββββββββ βββββββ¬ββββββ ββββββββΌβββββββ
β β β Object Store β
β βΌ β (Tile files) β
β ββββββββββββ βββββββββββββββ
β β Kafka β
β ββββββ¬ββββββ
β β
β ββββββΌβββββββββββ
β β Traffic β
β β Processor β
β ββββββ¬βββββββββββ
β β Weight updates
βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββ
β Road Graph Store β
β (Object Storage, partitioned tiles) β
β Edge weight = travel_time_s β
β Updated every 1β5 min from traffic β
ββββββββββββββββββββββββββββββββββββββββ
Key Decisions Summary
| Decision | Choice | Reasoning |
|---|---|---|
| Map tiles | Vector tiles + CDN | Smaller files, smooth zoom, high cache hit ratio |
| Tile storage | Object storage (S3) + CDN | Static content, global distribution |
| Routing algorithm | Contraction Hierarchies | 1000x faster than Dijkstra, production standard |
| Graph storage | Object storage, partitioned by tile | Too large for single DB, load on demand |
| Traffic collection | Passive GPS probes | Anonymized, scales with user base automatically |
| Traffic pipeline | Kafka + stream processor | High throughput, decoupled from routing |
| Location protocol | WebSocket | Persistent, low overhead, bidirectional |
| ETA | ML model (historical + live traffic) | Better accuracy than pure graph weights |
Interview Questions & Answers
Q: Why is naive Dijkstraβs algorithm too slow for global maps?
A: With 100 million nodes and 200 million edges globally, Dijkstraβs O((V+E) log V) would require billions of operations per query. A cross-continent route forces exploration of millions of nodes, taking seconds to minutes β far exceeding the < 2s requirement. Production systems use Contraction Hierarchies, which pre-compute shortcuts offline to reduce query time to milliseconds.
Q: How do you handle real-time traffic without re-running pre-computation?
A: Traffic updates only change edge weights (travel time per segment), not the graph topology. CH shortcuts are built on top of the base graph structure, but the actual query can incorporate live weights. In practice, CH is periodically rebuilt (e.g., every few hours) with current traffic patterns baked in, while very recent changes are handled by adjusting weights at query time.
Q: How do you store 50 PB of map tile data efficiently?
A: Tiles are static files stored in object storage (like S3). The key insight is CDN caching β popular tiles (major cities, common zoom levels) have cache-hit rates above 99%, so the origin only serves a tiny fraction of requests. Tiles are served as vector data (10β50 KB vs 100 KB for raster), reducing storage and bandwidth. Less-accessed areas (rural, remote) are served on cache miss with higher latency, which is acceptable.
Q: How do you collect traffic data without tracking users?
A: Use the passive probe model. Location updates are anonymized immediately upon receipt β the user ID is stripped and the device is identified only by a temporary session token. The system aggregates speed data per road segment across many users, so no individualβs route is stored. Google Mapsβ privacy model publishes minimum thresholds (e.g., need at least N users on a segment before showing traffic) to prevent re-identification.
Q: What happens when thereβs a sudden traffic event (accident)?
A: Two mechanisms: (1) Automatic detection β if many GPS probes show sudden deceleration on a segment, the traffic processor flags an incident within 1β3 minutes. (2) Explicit reporting β users can manually report incidents in the app. Both feed a separate incident service that immediately pushes high-priority weight updates (set segment weight to very high value) to the routing service, bypassing the normal 1β5 minute batch update.
Key Takeaways
- Map tiles are served via CDN with URLs structured as
z/x/yβ tiles are static and cacheable, enabling > 99% CDN hit rate - Vector tiles (client-rendered) beat raster tiles for file size, smooth zoom, and runtime styling β the modern standard
- Road graph uses nodes (intersections) + edges (segments) with travel time (not distance) as weight, updated dynamically
- Contraction Hierarchies is the production routing algorithm β offline pre-computation enables millisecond queries on 100M+ node graphs
- Real-time traffic is sourced passively from usersβ GPS updates β 6.7M events/sec flow through Kafka β stream processor β edge weights
- Graph partitioning by geographic tile makes the 200 GB global graph manageable β load only the tiles along the route
- ETA accuracy requires ML models combining live traffic, historical patterns, day/time, and weather β pure graph weight summation is not enough
Related Resources
- distributed-system-components - CDN, Object Storage, Kafka
- key-patterns - Graph algorithms, stream processing, geospatial indexing
- ch04-distributed-message-queue - Kafka internals (used in traffic pipeline)
- ch03-framework-for-system-design - Apply framework to this problem
Practice this design! Key areas to master:
- Draw the tile URL structure and explain zoom levels
- Walk through why Dijkstra fails and how CH solves it
- Describe the end-to-end traffic data pipeline
- Explain vector vs raster tile trade-offs
- Discuss how to handle map updates (batch for roads, real-time for traffic)
Last Updated: 2026-04-13
Status: Very Hard β rehearse the routing algorithm progression (Dijkstra β A* β CH)