Chapter 4: Design a Rate Limiter
volume1 rate-limiter system-design throttling
Status: π© Interview ready - Common question!
Difficulty: Medium
Time to complete: 45 min read + practice
Overview
A rate limiter controls the rate of traffic sent by a client or service. It limits the number of requests a client can make within a time window (e.g., 100 requests per minute).
Why this matters:
- Very common interview question (appears at Google, Amazon, Microsoft, etc.)
- Fundamental building block in distributed systems
- Teaches throttling, distributed systems, and trade-offs
Problem Statement
Design a rate limiter that:
- Limits requests per user/IP within a time window
- Returns HTTP 429 (Too Many Requests) when limit exceeded
- Works in distributed systems (multiple servers)
- Low latency overhead (< 1ms)
- Accurate (no false positives/negatives)
Step 1: Requirements & Scope (5 min)
Functional Requirements
Clarifying questions:
- Client-side or server-side rate limiter? β Server-side (client-side can be bypassed)
- What do we rate limit on? β User ID, IP address, API key, or combination
- Scale? β Handle large number of requests (millions per second)
- Distributed environment? β Yes, multiple servers
- Response when throttled? β HTTP 429 with retry-after header
- Do users get notified? β Return headers showing rate limit status
Scope:
- Limit requests per user or IP
- Support different rate limit rules (per endpoint, per user tier)
- Inform users about their rate limit status
- Works across distributed servers
Non-Functional Requirements
- Low latency: < 1ms overhead (shouldnβt slow down API)
- Accurate: No false positives (legitimate requests blocked)
- Scalability: Handle millions of users
- Fault tolerance: Failures shouldnβt break system
- Memory efficient: Donβt store too much state
Step 2: High-Level Design (10 min)
Where to Implement Rate Limiting
Option 1: Client-side β
- Can be bypassed by malicious users
- Only use as courtesy, not security
Option 2: Server-side (Application code) β
- Full control
- Con: Adds logic to every service
Option 3: Middleware/API Gateway β β (Best!)
- Centralized rate limiting
- Donβt repeat logic in every service
- Common in production (AWS API Gateway, Kong, NGINX)
Recommendation: Use API Gateway for rate limiting
Basic Architecture
[Client] β [API Gateway with Rate Limiter] β [Backend Services]
β
[Redis/Memcached]
(Store counters/state)
Why Redis/Memcached?
- In-memory β Fast (< 1ms)
- Support atomic operations (INCR)
- Support expiration (TTL)
- Shared across servers (distributed)
API Design
Request with rate limiting:
GET /api/v1/user/123
Response Headers:
X-RateLimit-Limit: 100 # Max requests per window
X-RateLimit-Remaining: 42 # Requests left
X-RateLimit-Reset: 1640000000 # Unix timestamp when limit resets
When rate limited:
HTTP 429 Too Many Requests
Response Headers:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640000000
Retry-After: 60 # Seconds until can retry
Body:
{
"error": "Rate limit exceeded",
"message": "Too many requests. Try again in 60 seconds."
}
Rate Limiting Rules
Example rules:
Rule 1: 10 requests per second per user
Rule 2: 100 requests per minute per IP address
Rule 3: Free tier: 1000 requests/day, Premium: 10000 requests/day
Rule 4: /login endpoint: 5 requests per minute (prevent brute force)
Rule 5: /api/* endpoints: 100 requests per minute
Rules stored in configuration (database or config file):
{
"userId:123": {
"limit": 100,
"window": "1m"
},
"ip:192.168.1.1": {
"limit": 1000,
"window": "1h"
},
"endpoint:/api/search": {
"limit": 10,
"window": "1s"
}
}Step 3: Deep Dive - Rate Limiting Algorithms (20 min)
Algorithm 1: Token Bucket πͺ£ (Most Popular!)
How it works:
- Bucket has a capacity (max tokens)
- Tokens refill at a constant rate
- Each request consumes 1 token
- If no tokens β request rejected
Analogy: Water bucket with a hole. Water drips in at constant rate. Request = take water out.
Parameters:
- Bucket capacity: Max tokens (e.g., 100)
- Refill rate: Tokens per second (e.g., 10/sec)
Example:
Bucket capacity: 4 tokens
Refill rate: 1 token/second
t=0: [πͺπͺπͺπͺ] (4 tokens) β Request β [πͺπͺπͺ] (3 left)
t=1: [πͺπͺπͺπͺ] (refilled to 4) β Request β [πͺπͺπͺ]
t=1: [πͺπͺπͺ] β Request β [πͺπͺ]
t=1: [πͺπͺ] β Request β [πͺ]
t=1: [πͺ] β Request β [] (empty!)
t=1: [] β Request β β REJECTED (no tokens)
t=2: [πͺ] (refilled) β Request β [] (OK)
Pros:
- β Allows burst traffic (use all tokens at once)
- β Memory efficient (only store bucket state)
- β Simple to understand and implement
- β Used by AWS, Stripe (industry standard)
Cons:
- β Tuning parameters tricky (capacity vs refill rate)
- β May allow more requests than limit (burst)
When to use: Most cases! Default choice for rate limiting.
Redis implementation:
def allow_request(user_id):
key = f"rate_limit:{user_id}"
bucket = redis.get(key)
if bucket is None:
# Initialize bucket
bucket = {"tokens": 100, "last_refill": time.now()}
redis.set(key, bucket, ex=3600) # Expire in 1 hour
# Refill tokens based on time elapsed
now = time.now()
elapsed = now - bucket["last_refill"]
refill = elapsed * 10 # 10 tokens per second
bucket["tokens"] = min(100, bucket["tokens"] + refill)
bucket["last_refill"] = now
# Try to consume 1 token
if bucket["tokens"] >= 1:
bucket["tokens"] -= 1
redis.set(key, bucket, ex=3600)
return True # Allowed
else:
return False # Rate limitedAlgorithm 2: Leaky Bucket π§
How it works:
- Requests enter queue (bucket)
- Process requests at constant rate (leak)
- If queue full β request rejected
Analogy: Bucket with hole at bottom. Water leaks out at constant rate.
Parameters:
- Queue size: Max queued requests (e.g., 100)
- Outflow rate: Requests per second (e.g., 10/sec)
Example:
Queue size: 3
Outflow: 1 request/second
t=0: [___] (empty) β 3 requests arrive β [RRR] (queue full)
t=1: [RRR] β 1 processed (leaked) β [RR_] β 1 new arrives β [RRR]
t=2: [RRR] β 1 leaked β [RR_]
t=2: [RR_] β 2 new arrive β [RRR] β 1 more arrives β β REJECTED
Pros:
- β Smooth traffic (constant outflow rate)
- β Good for preventing downstream overload
- β Memory efficient
Cons:
- β Burst traffic gets queued (higher latency)
- β Old requests in queue may be stale
- β Queue can fill up during spikes
When to use: Need strict constant rate (avoid spikes downstream)
Leaky Bucket vs Token Bucket:
| Aspect | Token Bucket | Leaky Bucket |
|---|---|---|
| Burst traffic | Allowed | Queued/rejected |
| Outflow | Variable | Constant |
| Complexity | Simpler | More complex (queue management) |
| Use case | API rate limiting | Traffic shaping |
Most systems use Token Bucket because itβs simpler and allows bursts.
Algorithm 3: Fixed Window Counter
How it works:
- Divide time into fixed windows (e.g., each minute)
- Count requests in current window
- Reset counter at window boundary
Parameters:
- Window size: Duration (e.g., 1 minute)
- Request limit: Max per window (e.g., 100)
Example:
Limit: 3 requests per minute
Window: 00:00:00 - 00:00:59
00:00:10 β Request β Counter: 1 β
00:00:20 β Request β Counter: 2 β
00:00:30 β Request β Counter: 3 β
00:00:40 β Request β Counter: 3 β β REJECTED
00:00:50 β Request β Counter: 3 β β REJECTED
00:01:00 β Counter resets to 0
00:01:05 β Request β Counter: 1 β
Pros:
- β Very simple to implement
- β Memory efficient (one counter per window)
- β Easy to understand
Cons:
- β Burst at window boundary (major issue!)
- At 00:00:59 β 100 requests
- At 00:01:00 β 100 requests
- Total: 200 requests in 2 seconds! (Double the limit)
Burst problem visualization:
Window 1: [______________________|********************]
00:00 00:00:59 (100 req in last 1 sec)
Window 2: [********************|______________________]
00:01:00 00:01:59 (100 req in first 1 sec)
Total: 200 requests in 2 seconds! Should be 100 max.
When to use: Quick prototype, not production (use sliding window instead)
Redis implementation (simple):
def allow_request(user_id):
key = f"rate_limit:{user_id}:{current_minute()}"
count = redis.incr(key)
redis.expire(key, 60) # Expire after 1 minute
if count <= 100:
return True
else:
return False
def current_minute():
return int(time.now() / 60) # Window IDAlgorithm 4: Sliding Window Log π
How it works:
- Store timestamp of each request
- Count requests in sliding window (last N seconds)
- Remove old timestamps outside window
Parameters:
- Window size: Duration (e.g., 1 minute)
- Request limit: Max per window (e.g., 100)
Example:
Limit: 3 requests per 60 seconds
Current time: 00:01:30
Request log:
00:00:35 β Remove (too old, > 60 sec ago)
00:01:00 β Keep (within 60 sec)
00:01:15 β Keep
00:01:20 β Keep
Count: 3 requests in last 60 seconds
00:01:30 β New request β Count: 4 β β REJECTED
00:02:00 β New request β Count: 2 (00:01:00 expired) β β
ALLOWED
Pros:
- β Very accurate (no boundary burst problem)
- β True sliding window
Cons:
- β Memory expensive (store all timestamps)
- 1M users Γ 100 requests/min Γ 8 bytes/timestamp = 800 MB
- β Slow (need to iterate all timestamps to count)
When to use: Need high accuracy, low request volume
Redis implementation:
def allow_request(user_id):
key = f"rate_limit:{user_id}"
now = time.now()
window_start = now - 60 # 60 second window
# Remove old timestamps
redis.zremrangebyscore(key, 0, window_start)
# Count requests in window
count = redis.zcard(key)
if count < 100:
# Add current timestamp
redis.zadd(key, {now: now})
redis.expire(key, 60)
return True
else:
return FalseAlgorithm 5: Sliding Window Counter π― (Best Balance!)
How it works:
- Hybrid of fixed window + sliding window
- Approximate sliding window using two fixed windows
- Weighted average based on time in current window
Formula:
Requests in sliding window =
Requests in current window +
Requests in previous window Γ Overlap percentage
Example:
Limit: 100 requests per minute
Current time: 00:01:30 (30 seconds into current window)
Previous window (00:00 - 00:01): 80 requests
Current window (00:01 - 00:02): 40 requests
Overlap: 30 seconds into current minute
Overlap percentage: 30/60 = 50%
Estimated requests in sliding window:
= 40 (current) + 80 (previous) Γ 50%
= 40 + 40
= 80 requests
New request β 81 β€ 100 β β
ALLOWED
Visual:
Previous window: [ββββββββ80 reqββββββ]
Current window: [ββββ40 reqββββ|________]
00:01:00 00:01:30
Sliding window: [ββββββ 60 sec βββββ]
Uses 50% of prev + 100% of current
Pros:
- β Good accuracy (better than fixed window)
- β Memory efficient (only 2 counters)
- β No burst problem
- β Used by Cloudflare, Azure
Cons:
- β Approximation (not perfectly accurate)
- β Assumes even distribution in previous window
When to use: Production systems! Best balance of accuracy and efficiency.
Redis implementation:
def allow_request(user_id):
now = time.now()
current_window = int(now / 60)
previous_window = current_window - 1
elapsed_time_in_current_window = now % 60
current_key = f"rate_limit:{user_id}:{current_window}"
previous_key = f"rate_limit:{user_id}:{previous_window}"
current_count = redis.get(current_key) or 0
previous_count = redis.get(previous_key) or 0
# Calculate weighted count
overlap_percentage = (60 - elapsed_time_in_current_window) / 60
estimated_count = current_count + previous_count * overlap_percentage
if estimated_count < 100:
redis.incr(current_key)
redis.expire(current_key, 120) # Keep for 2 windows
return True
else:
return FalseAlgorithm Comparison
| Algorithm | Pros | Cons | Memory | Accuracy | Use Case |
|---|---|---|---|---|---|
| Token Bucket | Simple, allows burst, industry standard | Tuning tricky | Low | Good | Default choice |
| Leaky Bucket | Smooth traffic | Queues requests, complex | Low | Good | Traffic shaping |
| Fixed Window | Very simple | Burst at boundary | Very Low | Poor | Prototypes only |
| Sliding Log | Very accurate | Memory expensive, slow | High | Excellent | Low volume, need accuracy |
| Sliding Window | Good accuracy, efficient | Approximation | Low | Very Good | Production systems |
Recommendations:
- Start with: Token Bucket (simple, industry standard)
- For production: Sliding Window Counter (best balance)
- Avoid: Fixed Window (burst problem)
- For low volume: Sliding Log (if need perfect accuracy)
Step 4: Deep Dive - Distributed Rate Limiting
Challenge: Race Conditions
Problem: Multiple servers incrementing counter simultaneously
Server 1 reads count: 99
Server 2 reads count: 99
Server 1 increments: 100 β
Server 2 increments: 100 β
Actual count: 101 (should be rate limited!)
Solutions:
1. Redis INCR (Atomic):
count = redis.incr(key) # Atomic operation
if count <= 100:
return True- β Atomic, no race condition
- β Fast
2. Lua Script in Redis (for complex logic):
-- Lua script runs atomically in Redis
local current = redis.call('get', KEYS[1])
if current and tonumber(current) >= 100 then
return 0 -- Rate limited
else
redis.call('incr', KEYS[1])
redis.call('expire', KEYS[1], 60)
return 1 -- Allowed
end3. Locks (avoid if possible):
with redis.lock(f"lock:{user_id}"):
count = redis.get(key)
if count < 100:
redis.incr(key)
return True- β Slow (lock contention)
- β Deadlock risk
Recommendation: Use Redis atomic operations (INCR, Lua scripts)
Challenge: Synchronization Across Regions
Problem: Multi-region deployment, users routed to different regions
US Region: User has 50 requests
EU Region: User has 50 requests
Total: 100 requests (limit reached)
But each region thinks user has 50 (under limit)
Solutions:
1. Centralized Redis Cluster (Strict consistency):
- All regions talk to same Redis cluster
- β Higher latency (cross-region)
- β Accurate
2. Eventual Consistency (Relaxed):
- Each region has own Redis
- Periodically sync counts
- β Low latency
- β Less accurate (OK for most cases)
3. Sticky Sessions (Route same user to same region):
- GeoDNS routes user to nearest region
- β Simple
- β Imbalanced load
Recommendation:
- For strict enforcement: Centralized Redis (financial APIs)
- For general APIs: Eventual consistency (good enough)
Challenge: Performance & Scalability
Bottleneck: Redis can become bottleneck at very high scale
Solutions:
1. Redis Cluster (Sharding):
- Partition users across Redis shards
- Hash(user_id) β Redis shard
- β Scales horizontally
2. Local Rate Limiting + Global Rate Limiting:
# Fast local check (in-memory)
if local_rate_limiter.allow(user_id):
# Slower global check (Redis)
if global_rate_limiter.allow(user_id):
return True- β Very fast (local cache)
- β οΈ Less accurate (OK for high throughput)
3. Distributed Rate Limiting Libraries:
- Use existing solutions (DoorKeeper, Resilience4j)
- Handle complexities for you
Rate Limiting Rules & Configuration
Rule Storage
Option 1: Configuration File (Simple):
rules:
- user_id: "123"
limit: 1000
window: "1d"
- tier: "free"
limit: 100
window: "1h"
- tier: "premium"
limit: 10000
window: "1h"
- endpoint: "/api/search"
limit: 10
window: "1s"Option 2: Database (Dynamic):
CREATE TABLE rate_limit_rules (
rule_id INT PRIMARY KEY,
entity_type VARCHAR(50), -- 'user', 'ip', 'api_key'
entity_id VARCHAR(255),
endpoint VARCHAR(255),
limit INT,
window_seconds INT,
created_at TIMESTAMP
);Option 3: Rules Engine (Complex):
- If user.tier == βpremiumβ β 10000 req/hour
- If endpoint == β/loginβ β 5 req/min
- If ip.country == βsuspiciousβ β 10 req/hour
Rule Priority
Multiple rules can apply to same request:
User ID rule: 1000 req/hour
IP rule: 100 req/hour
Endpoint rule: 10 req/sec
Which to apply? β Most restrictive (10 req/sec)
Combining rules:
- AND: Must satisfy all rules
- OR: Must satisfy at least one
- Most restrictive: Apply strictest limit
Handling Rate Limit Exceeded
Response Format
HTTP 429 Response:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640000000
Retry-After: 60
{
"error": "rate_limit_exceeded",
"message": "You have exceeded the rate limit. Please try again in 60 seconds."
}
Client Behavior
Best practices:
- Exponential backoff: Wait 1s, 2s, 4s, 8sβ¦
- Respect Retry-After header: Wait specified time
- Monitor X-RateLimit-Remaining: Stop before hitting limit
- Queue non-urgent requests: Donβt spam API
Advanced Topics
1. Distributed Rate Limiting with Redis Cluster
Architecture:
API Gateway 1 βββ
API Gateway 2 βββΌβββ Redis Cluster (Sharded)
API Gateway 3 βββ - Shard 1: Users A-H
- Shard 2: Users I-P
- Shard 3: Users Q-Z
Consistent hashing: Route same user to same shard
2. Rate Limiting by Multiple Dimensions
Example: Rate limit by (user + endpoint):
User 123 β /api/search: 10 req/sec
User 123 β /api/users: 100 req/min
Redis key design:
rate_limit:{user_id}:{endpoint}:{window}
rate_limit:123:/api/search:1640000000
3. Dynamic Rate Limiting
Adjust limits based on:
- Time of day (higher during business hours)
- User behavior (increase limit for good actors)
- System load (decrease when overloaded)
Example:
def get_rate_limit(user_id, system_load):
base_limit = 100
if system_load > 0.9: # 90% capacity
return base_limit * 0.5 # Reduce by 50%
return base_limit4. Exemptions & Allowlists
Bypass rate limiting for:
- Internal services
- Health checks (monitoring)
- Admin users
- Trusted partners
if user.role == "admin":
return True # Bypass rate limitingDesign Summary
Final Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β API Gateway β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β Rate Limiter Middleware β β
β β β β
β β 1. Extract identifier (user/IP/API key) β β
β β 2. Check Redis for current count β β
β β 3. Apply rate limit algorithm β β
β β 4. If allowed β Forward to backend β β
β β 5. If blocked β Return HTTP 429 β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
βββββββββββββββββββββββββΌββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββ
β Redis Cluster β
β (Rate Counters) β
β β
β Sliding Window β
β or Token Bucket β
ββββββββββββββββββββ
Key Decisions Summary
| Decision | Choice | Reasoning |
|---|---|---|
| Algorithm | Token Bucket or Sliding Window Counter | Best balance of accuracy and efficiency |
| Storage | Redis | Fast, atomic operations, TTL support |
| Where | API Gateway | Centralized, donβt repeat in services |
| Distributed | Redis Cluster + Atomic operations | Handle race conditions |
| Rules | Database or Config file | Flexible, can update without deploy |
Interview Questions & Answers
Q: How would you handle rate limiting for anonymous users?
A: Rate limit by IP address. But IP can be shared (NAT, VPN), so may need to be more lenient. Alternative: Use fingerprinting (browser, device) or require login for API access.
Q: What if Redis goes down?
A:
- Fail open: Allow all requests (risk of abuse, but system stays up)
- Fail closed: Reject all requests (secure, but bad user experience)
- Fallback: Use local cache with eventual consistency
- Best: Redis cluster with replication (high availability)
Q: How to prevent users from circumventing rate limits?
A:
- Rate limit by multiple dimensions (user + IP + device)
- Detect suspicious patterns (many IPs from one user)
- Require authentication (canβt create infinite accounts)
- Use CAPTCHA for anonymous endpoints
- Monitor for abuse patterns
Q: Token Bucket vs Leaky Bucket?
A: Token Bucket allows burst traffic (use all tokens at once), Leaky Bucket enforces constant rate (smooth traffic). Token Bucket is more common because APIs naturally have bursty traffic.
Q: How to test rate limiter?
A:
- Unit tests: Test algorithm logic
- Integration tests: Test with Redis
- Load tests: Simulate high traffic (millions of requests)
- Chaos tests: Kill Redis, network partitions
- Monitor: Track false positives/negatives in production
Key Takeaways
- Token Bucket is the most popular algorithm (AWS, Stripe use it)
- Sliding Window Counter gives best balance of accuracy and efficiency
- Implement at API Gateway for centralized rate limiting
- Use Redis for fast, atomic operations
- Handle race conditions with atomic operations (INCR, Lua scripts)
- Always return HTTP 429 with Retry-After header
- Monitor false positives and adjust limits
Related Resources
- distributed-system-components > 4. API Gateway - Where rate limiting is implemented
- key-patterns > 1. Rate Limiting - Patterns and algorithms
- ch03-framework-for-system-design - Apply framework to this problem
Practice this design! Very common interview question. Be ready to:
- Explain different algorithms and trade-offs
- Draw architecture diagram
- Handle distributed system challenges (race conditions, multi-region)
- Discuss Redis data structures and operations
Last Updated: 2026-04-09
Status: Common interview question - Must know!