Chapter 3: Design Google Maps

volume2 google-maps routing graph navigation

Status: 🟩 Interview ready
Difficulty: Very Hard
Time to complete: 60 min read + practice

Overview

Google Maps is a real-time mapping and navigation platform that serves 1 billion users daily. It provides turn-by-turn navigation, real-time traffic updates, map tile rendering, ETA estimation, and location search across 5 million km of roads worldwide.

Why this matters:

Rare but high-signal interview question at Google, Uber, Lyft, Meta
Tests knowledge of graph algorithms, distributed storage, CDN strategy, and real-time data pipelines
Combines multiple hard sub-problems: routing, tile serving, traffic aggregation

Problem Statement

Design a system that:

Navigates users from point A to point B with accurate turn-by-turn directions
Incorporates real-time traffic data into route and ETA calculations
Renders map tiles at any zoom level, worldwide
Supports 1 billion DAU with low latency

Step 1: Requirements & Scope (5 min)

Functional Requirements

Clarifying questions:

Navigation only, or also search/discovery? → Navigation + location search
Mobile and web? → Yes, both
Real-time traffic? → Yes, critical feature
Offline support? → Out of scope for interview
Public transit, walking, cycling? → Focus on driving navigation
How fresh is traffic data? → Near real-time (seconds to minutes)

Scope:

Turn-by-turn navigation (shortest + fastest path)
Real-time traffic integration
Map rendering at multiple zoom levels
ETA estimation
Location/address search

Non-Functional Requirements

Scale: 1 billion DAU
Road data: 5 million km of roads globally
Map storage: 50 PB of raw map data; each tile ~100 KB
Accuracy: Routes must be correct; ETAs within acceptable margin
Low latency: Map tiles render quickly; route calculation < 2 sec
High availability: 99.99% uptime (people depend on navigation)
Traffic updates: Near real-time (aggregate and apply within ~1 min)

Scale Estimates

DAU:                        1 billion users
Average active navigation:  35 min/session
Location updates (GPS):     Every 15 sec while navigating
Location events/sec:        ~1B users × 10% navigating × (1/15 sec)
                           ≈ 6.7 million GPS events/sec

Map tiles:
  Zoom levels: 0–22 (level 0 = whole world, level 22 = building)
  Tiles at zoom 21: 2^21 × 2^21 = ~4 trillion tiles
  But only ~10% of tiles are land (usable area)
  Storage: 50 PB raw image data

Road graph:
  5M km roads → ~100M nodes (intersections)
  ~200M edges (road segments)

Step 2: High-Level Design (10 min)

Core Components

┌──────────────────────────────────────────────────────────────────┐
│                        Client (Mobile/Web)                       │
└─────────────────────┬─────────────────────────────┬─────────────┘
                      │ Route Request               │ Location Updates
                      ▼                             ▼
         ┌────────────────────┐       ┌─────────────────────────┐
         │  Navigation Service │       │   Location Service      │
         │  (Route Calculation)│       │   (GPS ingestion)       │
         └──────────┬─────────┘       └──────────┬──────────────┘
                    │                             │
                    │ Graph query                 │ Kafka stream
                    ▼                             ▼
         ┌────────────────────┐       ┌─────────────────────────┐
         │  Road Graph Store  │       │  Traffic Processor      │
         │  (Object Storage / │       │  (Speed aggregation     │
         │   Graph DB)        │◄──────│   per road segment)     │
         └────────────────────┘       └─────────────────────────┘

         ┌────────────────────┐
         │  Map Tile Service  │──── CDN ──── Client
         │  (Tile generation) │
         └────────────────────┘

Three Main Services

Service	Responsibility	Key Data
Location Service	Ingest GPS updates from users	Kafka topic per region
Navigation Service	Compute routes, ETAs	Road graph, traffic weights
Map Rendering Service	Serve pre-rendered or vector tiles	Tile storage + CDN

Data Stores

Road Graph:         Object storage (S3 / GCS) — adjacency lists by region
Map Tiles:          Object storage + CDN (static, rarely updated)
Traffic Data:       Time-series DB or Redis — speed per road segment
User Location:      Kafka (stream) → traffic processor (ephemeral)
ETA History:        Data warehouse for ML model training

Step 3: Deep Dive (30 min)

3.1 Map Tile System

World divided into tiles using a quadtree-like grid:

Zoom Level 0:    1 tile  (entire world 256×256 px)
Zoom Level 1:    4 tiles (2×2 grid)
Zoom Level 2:   16 tiles (4×4 grid)
...
Zoom Level z:  4^z tiles

Each tile = 256×256 pixels
Each tile URL: /tiles/{z}/{x}/{y}.png

Tile URL structure:

https://maps.cdn.example.com/tiles/15/16384/10900.png
                                    ^z  ^x     ^y
z = zoom level (15 = neighborhood level)
x = tile column
y = tile row

Zoom level reference:

Level  0: World (1 tile)
Level  5: Continent
Level 10: City
Level 14: Neighborhood
Level 17: Street level
Level 20: Building detail
Level 22: Room level

Tile caching strategy:

Client Request for tile z/x/y
         │
         ▼
    CDN Edge Node ──── HIT ──→ Return tile (< 10 ms)
         │ MISS
         ▼
    CDN Origin (Object Storage: S3/GCS)
         │
         ▼
    Pre-rendered tile file (100 KB)
    Cache-Control: max-age=604800 (7 days)

Why CDN works perfectly for tiles:

Tiles are static (road geometry rarely changes)
Same tile served to millions of users at same zoom/location
Immutable URLs (tile content at z/x/y is always the same version)
Cache-hit ratio > 99% for popular areas

Vector Tiles vs Raster Tiles:

Aspect	Raster Tiles (PNG/JPG)	Vector Tiles (PBF/JSON)
Rendering	Server-rendered image	Client-rendered geometry
File size	~100 KB per tile	~10–50 KB per tile
Zoom smoothness	Pixelates between levels	Smooth (infinite zoom)
Customization	Fixed style	Style at runtime
Client CPU	Low	Higher (WebGL/Canvas)
Offline	Easy to cache	Easy to cache + re-style
Used by	Legacy systems	Google Maps, Mapbox

Google Maps uses vector tiles on modern clients for smooth rendering and smaller bandwidth.

3.2 Road Graph for Routing

Graph representation:

Nodes  = Road intersections (lat/lng coordinates)
Edges  = Road segments between intersections
Weight = Travel time in seconds (NOT distance)
         → Updated dynamically with traffic data

Example graph:

     A ──(5s)── B ──(3s)── C
     │           │
   (8s)        (4s)
     │           │
     D ──(2s)── E ──(6s)── F

Nodes: A, B, C, D, E, F
Edges with weights (travel time):
  A→B: 5s    B→C: 3s    A→D: 8s
  B→E: 4s    D→E: 2s    E→F: 6s

Storage format — adjacency list:

{
  "node_A": {
    "lat": 37.7749, "lng": -122.4194,
    "edges": [
      {"to": "node_B", "base_weight_s": 5, "road_type": "primary"},
      {"to": "node_D", "base_weight_s": 8, "road_type": "secondary"}
    ]
  },
  "node_B": {
    "lat": 37.7755, "lng": -122.4180,
    "edges": [
      {"to": "node_C", "base_weight_s": 3, "road_type": "primary"},
      {"to": "node_E", "base_weight_s": 4, "road_type": "primary"}
    ]
  }
}

Stored in object storage (S3/GCS):

Partitioned by geographic region (tiles map to graph tiles)
Graph tile at zoom ~6 covers one city region
Only load relevant regions during route calculation
~200 GB per major metro area’s compressed graph data

3.3 Routing Algorithms

Naive Dijkstra’s — Too Slow for Global Scale

Dijkstra's time complexity: O((V + E) log V)

Global road network:
  V = 100 million nodes (intersections)
  E = 200 million edges (road segments)

Computation: ~300M × log(100M) ≈ 300M × 27 ≈ 8 billion operations
Time:        Several seconds to minutes per query → Unacceptable!

Why it’s too slow:

New York → Los Angeles route:
  Dijkstra explores ALL nodes closer than destination
  Must explore millions of nodes across entire continent
  Cannot complete in < 2 second requirement

Algorithm 1: A* (A-Star) — Better with Heuristic

Key idea: Guide search with a heuristic (straight-line distance to goal)

f(n) = g(n) + h(n)

g(n) = actual cost from start to node n
h(n) = heuristic estimate from n to goal (straight-line distance)

Priority queue orders by f(n), not just g(n)
→ Explores nodes "toward" the destination first

Visual comparison:

Dijkstra:                    A*:
Explores in all directions   Explores mostly toward goal

     S                           S
   ○○○○○                       ○○○
 ○○○○○○○○○                   ○○○○○
○○○○○○○G○○                  ○○○○G
 ○○○○○○○○○                   ○
   ○○○○○

○ = explored nodes           Much fewer explored nodes!

Speedup: 2–10x faster than Dijkstra on road networks
Limitation: Still explores too many nodes for continent-wide routing

Algorithm 2: Contraction Hierarchies (CH) — Production Standard

Key idea: Pre-process graph by “contracting” (removing) unimportant nodes and adding shortcut edges. Query uses bidirectional search on the contracted graph.

Pre-processing phase:

Step 1: Rank all nodes by "importance"
  Importance = how many shortcut edges would be needed if this node removed
  Low-importance nodes: Dead-end streets, local roads
  High-importance nodes: Highway junctions, major intersections

Step 2: Contract nodes from least to most important
  When contracting node v:
    For each pair of neighbors (u, w) of v:
      If shortest path u→w goes through v:
        Add shortcut edge u→w with weight dist(u,v) + dist(v,w)

Step 3: Result: Hierarchical graph
  Level 0: Local streets
  Level 1: Primary roads
  Level 2: Highways
  Level 3: Major interstates / motorways

Query phase (bidirectional):

Forward search from source S:  Only upward in hierarchy
Backward search from goal G:   Only upward in hierarchy

       [Motorway nodes - top level]
      /                          \
[Highway nodes]              [Highway nodes]
    /                                    \
[S] → → → → → → ←meeting point← ← ← ← [G]

Each direction only explores ~1000 nodes vs millions!

Speedup: 1000–10,000x faster than naive Dijkstra
Used by: OSRM, Graphhopper, HERE Maps, most production GPS systems

Pre-computation time: Hours to days for global graph (done offline, periodically)

Algorithm 3: Bidirectional Dijkstra

Standard Dijkstra:          Bidirectional Dijkstra:

  S ──────────────► G         S ──────► ◄────── G
  Explores radius r           Each explores radius r/2
                              Total: 2 × π(r/2)² << π r²
  Area: π r²                  ~4x fewer nodes explored

Simple and practical, often used as first-pass filter with CH.

Algorithm Comparison

Algorithm	Time Complexity	Pre-computation	Speedup vs Dijkstra	Used For
Dijkstra	O((V+E) log V)	None	1x	Small graphs
A*	O((V+E) log V)	None	2–10x	Medium graphs
Bidirectional Dijkstra	O((V+E) log V)	None	~4x	Simple speedup
Contraction Hierarchies	O(V^0.5) query	Hours (offline)	1000–10000x	Production GPS

Recommendation for interview: Mention Dijkstra first, explain why it’s too slow, then propose Contraction Hierarchies as the production solution. Show you know A* as an intermediate.

3.4 Real-Time Traffic Integration

How Google Maps collects traffic data:

Passive probe model (anonymized):

[User navigating] ──GPS update every 15s──► Location Service
                                                    │
                                                    ▼
                                              Anonymize
                                           (strip user ID)
                                                    │
                                                    ▼
                                              Kafka Topic
                                            (location events)
                                                    │
                                                    ▼
                                           Traffic Processor
                                       (aggregate speed per segment)
                                                    │
                                                    ▼
                                          Road segment speed table
                                      segment_id → avg_speed_kmh
                                                    │
                                                    ▼
                                         Update graph edge weights
                               weight_s = segment_length_m / speed_m_s

Traffic data pipeline:

GPS Events (6.7M/sec)
    │
    ▼
Kafka (partitioned by road segment ID)
    │
    ▼
Stream Processor (Flink/Spark Streaming)
  - Group events by road segment
  - Calculate median speed over 1-minute window
  - Filter outliers (GPS errors, stationary users)
    │
    ▼
Traffic DB (Redis / time-series DB)
  Key: segment_id:timestamp_bucket
  Val: avg_speed_kmh, sample_count, confidence
    │
    ▼
Graph weight updater (runs every 1–5 min)
  Reads updated speeds → Recalculates edge weights → Writes back to graph

Speed → Edge weight conversion:

def compute_edge_weight(segment_length_m, observed_speed_kmh, free_flow_speed_kmh):
    # Use observed speed if traffic data available (high confidence)
    # Fall back to free flow speed (posted speed limit) otherwise
    effective_speed = observed_speed_kmh if confidence > threshold else free_flow_speed_kmh
    speed_m_s = effective_speed * 1000 / 3600
    travel_time_s = segment_length_m / speed_m_s
    return travel_time_s

Traffic data freshness:

Data latency: GPS → edge weight update ≈ 1–3 minutes
Acceptable because: Traffic conditions change slowly (minutes, not seconds)
Exception: Accidents/incidents → pushed via separate incident feed

3.5 ETA Calculation

Simple ETA:

ETA = Sum of edge weights along route (in seconds)
    = Sum(segment_length / current_speed) for all segments

Better ETA with ML:

Features:
  - Current traffic speed per segment (real-time)
  - Historical speed at same time/day/segment
  - Day of week, time of day, holidays
  - Weather (rain slows traffic ~10–20%)
  - Incidents (accidents, road closures)
  - Route type (highway vs city streets)

Model:
  - Gradient boosted trees or neural network
  - Trained on historical route + actual travel time pairs
  - Recalculated every few minutes as route progresses

Accuracy target: ± 10% of actual travel time for 90% of routes

3.6 Map Partitioning for Scalability

Challenge: 5M km of roads, 100M nodes — can’t load entire graph in memory

Solution: Partition by geographic tiles:

World map divided into routing tiles (zoom level ~6)
Each routing tile: ~100 km × 100 km area
Stored as: graph_tile_{z}_{x}_{y}.bin in object storage

Route from NYC to LA:
  1. Find all tiles along approximate path (bounding box)
  2. Load only those tiles (~20 tiles for cross-country)
  3. Run CH query on loaded subgraph
  4. If path crosses tile boundary → load neighboring tile

Total tiles for global map: ~10,000 tiles at routing zoom level
Average tile size: ~20 MB compressed
Total: ~200 GB for global routing graph (fits in cluster memory)

Tile boundary connections:

Tile A ──────────│── Tile B
      [node_99] ──── [node_100]
               boundary edge stored in both tiles

3.7 Location Update Service Architecture

Mobile Client (navigating)
  │ GPS update (lat, lng, speed, heading, accuracy)
  │ WebSocket (persistent connection for 2-way comms)
  ▼
Load Balancer
  │
  ▼
Location Service (stateless, horizontally scalable)
  │ Validates, anonymizes, rate-limits
  ▼
Kafka (partitioned by geohash of location)
  │
  ├──► Traffic Processor (aggregate speed per segment)
  │
  └──► Incident Detector (sudden stops = possible accident)

Why WebSocket for location updates:

Persistent connection avoids TCP handshake overhead per update
Server can push re-routing suggestions back to client
More efficient than polling (REST) for 15-sec update interval

Design Summary

Complete Architecture

┌──────────────────────────────────────────────────────────────────┐
│                          Client Apps                             │
│              (iOS / Android / Web)                               │
└──────┬───────────────┬────────────────┬────────────────┬─────────┘
       │               │                │                │
  Route Request  Location Update   Tile Request    Address Search
       │               │                │                │
       ▼               ▼                ▼                ▼
┌──────────────┐ ┌───────────┐  ┌─────────────┐  ┌───────────────┐
│  Navigation  │ │ Location  │  │  CDN Edge   │  │  Search       │
│  Service     │ │ Service   │  │  (Tiles)    │  │  Service      │
│  (Routing +  │ │ (WebSocket│  └──────┬──────┘  │  (Geocoding)  │
│   ETA)       │ │  /REST)   │         │ MISS    └───────────────┘
└──────┬───────┘ └─────┬─────┘  ┌──────▼──────┐
       │               │        │ Object Store │
       │               ▼        │ (Tile files) │
       │          ┌──────────┐  └─────────────┘
       │          │  Kafka   │
       │          └────┬─────┘
       │               │
       │          ┌────▼──────────┐
       │          │ Traffic       │
       │          │ Processor     │
       │          └────┬──────────┘
       │               │ Weight updates
       ▼               ▼
┌──────────────────────────────────────┐
│         Road Graph Store             │
│  (Object Storage, partitioned tiles) │
│  Edge weight = travel_time_s         │
│  Updated every 1–5 min from traffic  │
└──────────────────────────────────────┘

Key Decisions Summary

Decision	Choice	Reasoning
Map tiles	Vector tiles + CDN	Smaller files, smooth zoom, high cache hit ratio
Tile storage	Object storage (S3) + CDN	Static content, global distribution
Routing algorithm	Contraction Hierarchies	1000x faster than Dijkstra, production standard
Graph storage	Object storage, partitioned by tile	Too large for single DB, load on demand
Traffic collection	Passive GPS probes	Anonymized, scales with user base automatically
Traffic pipeline	Kafka + stream processor	High throughput, decoupled from routing
Location protocol	WebSocket	Persistent, low overhead, bidirectional
ETA	ML model (historical + live traffic)	Better accuracy than pure graph weights

Interview Questions & Answers

Q: Why is naive Dijkstra’s algorithm too slow for global maps?
A: With 100 million nodes and 200 million edges globally, Dijkstra’s O((V+E) log V) would require billions of operations per query. A cross-continent route forces exploration of millions of nodes, taking seconds to minutes — far exceeding the < 2s requirement. Production systems use Contraction Hierarchies, which pre-compute shortcuts offline to reduce query time to milliseconds.

Q: How do you handle real-time traffic without re-running pre-computation?
A: Traffic updates only change edge weights (travel time per segment), not the graph topology. CH shortcuts are built on top of the base graph structure, but the actual query can incorporate live weights. In practice, CH is periodically rebuilt (e.g., every few hours) with current traffic patterns baked in, while very recent changes are handled by adjusting weights at query time.

Q: How do you store 50 PB of map tile data efficiently?
A: Tiles are static files stored in object storage (like S3). The key insight is CDN caching — popular tiles (major cities, common zoom levels) have cache-hit rates above 99%, so the origin only serves a tiny fraction of requests. Tiles are served as vector data (10–50 KB vs 100 KB for raster), reducing storage and bandwidth. Less-accessed areas (rural, remote) are served on cache miss with higher latency, which is acceptable.

Q: How do you collect traffic data without tracking users?
A: Use the passive probe model. Location updates are anonymized immediately upon receipt — the user ID is stripped and the device is identified only by a temporary session token. The system aggregates speed data per road segment across many users, so no individual’s route is stored. Google Maps’ privacy model publishes minimum thresholds (e.g., need at least N users on a segment before showing traffic) to prevent re-identification.

Q: What happens when there’s a sudden traffic event (accident)?
A: Two mechanisms: (1) Automatic detection — if many GPS probes show sudden deceleration on a segment, the traffic processor flags an incident within 1–3 minutes. (2) Explicit reporting — users can manually report incidents in the app. Both feed a separate incident service that immediately pushes high-priority weight updates (set segment weight to very high value) to the routing service, bypassing the normal 1–5 minute batch update.

Key Takeaways

Map tiles are served via CDN with URLs structured as z/x/y — tiles are static and cacheable, enabling > 99% CDN hit rate
Vector tiles (client-rendered) beat raster tiles for file size, smooth zoom, and runtime styling — the modern standard
Road graph uses nodes (intersections) + edges (segments) with travel time (not distance) as weight, updated dynamically
Contraction Hierarchies is the production routing algorithm — offline pre-computation enables millisecond queries on 100M+ node graphs
Real-time traffic is sourced passively from users’ GPS updates — 6.7M events/sec flow through Kafka → stream processor → edge weights
Graph partitioning by geographic tile makes the 200 GB global graph manageable — load only the tiles along the route
ETA accuracy requires ML models combining live traffic, historical patterns, day/time, and weather — pure graph weight summation is not enough

distributed-system-components - CDN, Object Storage, Kafka
key-patterns - Graph algorithms, stream processing, geospatial indexing
ch04-distributed-message-queue - Kafka internals (used in traffic pipeline)
ch03-framework-for-system-design - Apply framework to this problem

Practice this design! Key areas to master:

Draw the tile URL structure and explain zoom levels
Walk through why Dijkstra fails and how CH solves it
Describe the end-to-end traffic data pipeline
Explain vector vs raster tile trade-offs
Discuss how to handle map updates (batch for roads, real-time for traffic)

Last Updated: 2026-04-13
Status: Very Hard — rehearse the routing algorithm progression (Dijkstra → A* → CH)

Study Notes by Niladri & AI

Explorer

ch03-google-maps

Chapter 3: Design Google Maps

Overview

Problem Statement

Step 1: Requirements & Scope (5 min)

Functional Requirements

Non-Functional Requirements

Scale Estimates

Step 2: High-Level Design (10 min)

Core Components

Three Main Services

Data Stores

Step 3: Deep Dive (30 min)

3.1 Map Tile System

3.2 Road Graph for Routing

3.3 Routing Algorithms

Naive Dijkstra’s — Too Slow for Global Scale

Algorithm 1: A* (A-Star) — Better with Heuristic

Algorithm 2: Contraction Hierarchies (CH) — Production Standard

Algorithm 3: Bidirectional Dijkstra

Algorithm Comparison

3.4 Real-Time Traffic Integration

3.5 ETA Calculation

3.6 Map Partitioning for Scalability

3.7 Location Update Service Architecture

Design Summary

Complete Architecture

Key Decisions Summary

Interview Questions & Answers

Key Takeaways

Graph View

Table of Contents

Backlinks

Study Notes by Niladri & AI

Explorer

ch03-google-maps

Chapter 3: Design Google Maps

Overview

Problem Statement

Step 1: Requirements & Scope (5 min)

Functional Requirements

Non-Functional Requirements

Scale Estimates

Step 2: High-Level Design (10 min)

Core Components

Three Main Services

Data Stores

Step 3: Deep Dive (30 min)

3.1 Map Tile System

3.2 Road Graph for Routing

3.3 Routing Algorithms

Naive Dijkstra’s — Too Slow for Global Scale

Algorithm 1: A* (A-Star) — Better with Heuristic

Algorithm 2: Contraction Hierarchies (CH) — Production Standard

Algorithm 3: Bidirectional Dijkstra

Algorithm Comparison

3.4 Real-Time Traffic Integration

3.5 ETA Calculation

3.6 Map Partitioning for Scalability

3.7 Location Update Service Architecture

Design Summary

Complete Architecture

Key Decisions Summary

Interview Questions & Answers

Key Takeaways

Related Resources

Graph View

Table of Contents

Backlinks