Chapter 4: Design a Rate Limiter

volume1 rate-limiter system-design throttling

Status: 🟩 Interview ready - Common question!
Difficulty: Medium
Time to complete: 45 min read + practice

Overview

A rate limiter controls the rate of traffic sent by a client or service. It limits the number of requests a client can make within a time window (e.g., 100 requests per minute).

Why this matters:

Very common interview question (appears at Google, Amazon, Microsoft, etc.)
Fundamental building block in distributed systems
Teaches throttling, distributed systems, and trade-offs

Problem Statement

Design a rate limiter that:

Limits requests per user/IP within a time window
Returns HTTP 429 (Too Many Requests) when limit exceeded
Works in distributed systems (multiple servers)
Low latency overhead (< 1ms)
Accurate (no false positives/negatives)

Step 1: Requirements & Scope (5 min)

Functional Requirements

Clarifying questions:

Client-side or server-side rate limiter? → Server-side (client-side can be bypassed)
What do we rate limit on? → User ID, IP address, API key, or combination
Scale? → Handle large number of requests (millions per second)
Distributed environment? → Yes, multiple servers
Response when throttled? → HTTP 429 with retry-after header
Do users get notified? → Return headers showing rate limit status

Scope:

Limit requests per user or IP
Support different rate limit rules (per endpoint, per user tier)
Inform users about their rate limit status
Works across distributed servers

Non-Functional Requirements

Low latency: < 1ms overhead (shouldn’t slow down API)
Accurate: No false positives (legitimate requests blocked)
Scalability: Handle millions of users
Fault tolerance: Failures shouldn’t break system
Memory efficient: Don’t store too much state

Step 2: High-Level Design (10 min)

Where to Implement Rate Limiting

Option 1: Client-side ❌

Can be bypassed by malicious users
Only use as courtesy, not security

Option 2: Server-side (Application code) ✅

Full control
Con: Adds logic to every service

Option 3: Middleware/API Gateway ✅✅ (Best!)

Centralized rate limiting
Don’t repeat logic in every service
Common in production (AWS API Gateway, Kong, NGINX)

Recommendation: Use API Gateway for rate limiting

Basic Architecture

[Client] → [API Gateway with Rate Limiter] → [Backend Services]
                     ↓
              [Redis/Memcached]
           (Store counters/state)

Why Redis/Memcached?

In-memory → Fast (< 1ms)
Support atomic operations (INCR)
Support expiration (TTL)
Shared across servers (distributed)

API Design

Request with rate limiting:

GET /api/v1/user/123

Response Headers:
X-RateLimit-Limit: 100        # Max requests per window
X-RateLimit-Remaining: 42     # Requests left
X-RateLimit-Reset: 1640000000 # Unix timestamp when limit resets

When rate limited:

HTTP 429 Too Many Requests

Response Headers:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640000000
Retry-After: 60               # Seconds until can retry

Body:
{
  "error": "Rate limit exceeded",
  "message": "Too many requests. Try again in 60 seconds."
}

Rate Limiting Rules

Example rules:

Rule 1: 10 requests per second per user
Rule 2: 100 requests per minute per IP address
Rule 3: Free tier: 1000 requests/day, Premium: 10000 requests/day
Rule 4: /login endpoint: 5 requests per minute (prevent brute force)
Rule 5: /api/* endpoints: 100 requests per minute

Rules stored in configuration (database or config file):

{
  "userId:123": {
    "limit": 100,
    "window": "1m"
  },
  "ip:192.168.1.1": {
    "limit": 1000,
    "window": "1h"
  },
  "endpoint:/api/search": {
    "limit": 10,
    "window": "1s"
  }
}

Step 3: Deep Dive - Rate Limiting Algorithms (20 min)

Algorithm 1: Token Bucket 🪣 (Most Popular!)

How it works:

Bucket has a capacity (max tokens)
Tokens refill at a constant rate
Each request consumes 1 token
If no tokens → request rejected

Analogy: Water bucket with a hole. Water drips in at constant rate. Request = take water out.

Parameters:

Bucket capacity: Max tokens (e.g., 100)
Refill rate: Tokens per second (e.g., 10/sec)

Example:

Bucket capacity: 4 tokens
Refill rate: 1 token/second

t=0:  [🪙🪙🪙🪙] (4 tokens) → Request → [🪙🪙🪙] (3 left)
t=1:  [🪙🪙🪙🪙] (refilled to 4) → Request → [🪙🪙🪙]
t=1:  [🪙🪙🪙] → Request → [🪙🪙]
t=1:  [🪙🪙] → Request → [🪙]
t=1:  [🪙] → Request → [] (empty!)
t=1:  [] → Request → ❌ REJECTED (no tokens)
t=2:  [🪙] (refilled) → Request → [] (OK)

Pros:

✅ Allows burst traffic (use all tokens at once)
✅ Memory efficient (only store bucket state)
✅ Simple to understand and implement
✅ Used by AWS, Stripe (industry standard)

Cons:

❌ Tuning parameters tricky (capacity vs refill rate)
❌ May allow more requests than limit (burst)

When to use: Most cases! Default choice for rate limiting.

Redis implementation:

def allow_request(user_id):
    key = f"rate_limit:{user_id}"
    bucket = redis.get(key)
 
    if bucket is None:
        # Initialize bucket
        bucket = {"tokens": 100, "last_refill": time.now()}
        redis.set(key, bucket, ex=3600)  # Expire in 1 hour
 
    # Refill tokens based on time elapsed
    now = time.now()
    elapsed = now - bucket["last_refill"]
    refill = elapsed * 10  # 10 tokens per second
    bucket["tokens"] = min(100, bucket["tokens"] + refill)
    bucket["last_refill"] = now
 
    # Try to consume 1 token
    if bucket["tokens"] >= 1:
        bucket["tokens"] -= 1
        redis.set(key, bucket, ex=3600)
        return True  # Allowed
    else:
        return False  # Rate limited

Algorithm 2: Leaky Bucket 💧

How it works:

Requests enter queue (bucket)
Process requests at constant rate (leak)
If queue full → request rejected

Analogy: Bucket with hole at bottom. Water leaks out at constant rate.

Parameters:

Queue size: Max queued requests (e.g., 100)
Outflow rate: Requests per second (e.g., 10/sec)

Example:

Queue size: 3
Outflow: 1 request/second

t=0:  [___] (empty) → 3 requests arrive → [RRR] (queue full)
t=1:  [RRR] → 1 processed (leaked) → [RR_] → 1 new arrives → [RRR]
t=2:  [RRR] → 1 leaked → [RR_]
t=2:  [RR_] → 2 new arrive → [RRR] → 1 more arrives → ❌ REJECTED

Pros:

✅ Smooth traffic (constant outflow rate)
✅ Good for preventing downstream overload
✅ Memory efficient

Cons:

❌ Burst traffic gets queued (higher latency)
❌ Old requests in queue may be stale
❌ Queue can fill up during spikes

When to use: Need strict constant rate (avoid spikes downstream)

Leaky Bucket vs Token Bucket:

Aspect	Token Bucket	Leaky Bucket
Burst traffic	Allowed	Queued/rejected
Outflow	Variable	Constant
Complexity	Simpler	More complex (queue management)
Use case	API rate limiting	Traffic shaping

Most systems use Token Bucket because it’s simpler and allows bursts.

Algorithm 3: Fixed Window Counter

How it works:

Divide time into fixed windows (e.g., each minute)
Count requests in current window
Reset counter at window boundary

Parameters:

Window size: Duration (e.g., 1 minute)
Request limit: Max per window (e.g., 100)

Example:

Limit: 3 requests per minute
Window: 00:00:00 - 00:00:59

00:00:10 → Request → Counter: 1 ✅
00:00:20 → Request → Counter: 2 ✅
00:00:30 → Request → Counter: 3 ✅
00:00:40 → Request → Counter: 3 → ❌ REJECTED
00:00:50 → Request → Counter: 3 → ❌ REJECTED
00:01:00 → Counter resets to 0
00:01:05 → Request → Counter: 1 ✅

Pros:

✅ Very simple to implement
✅ Memory efficient (one counter per window)
✅ Easy to understand

Cons:

❌ Burst at window boundary (major issue!)
- At 00:00:59 → 100 requests
- At 00:01:00 → 100 requests
- Total: 200 requests in 2 seconds! (Double the limit)

Burst problem visualization:

Window 1: [______________________|********************]
          00:00                  00:00:59  (100 req in last 1 sec)

Window 2: [********************|______________________]
          00:01:00              00:01:59  (100 req in first 1 sec)

Total: 200 requests in 2 seconds! Should be 100 max.

When to use: Quick prototype, not production (use sliding window instead)

Redis implementation (simple):

def allow_request(user_id):
    key = f"rate_limit:{user_id}:{current_minute()}"
    count = redis.incr(key)
    redis.expire(key, 60)  # Expire after 1 minute
 
    if count <= 100:
        return True
    else:
        return False
 
def current_minute():
    return int(time.now() / 60)  # Window ID

Algorithm 4: Sliding Window Log 📊

How it works:

Store timestamp of each request
Count requests in sliding window (last N seconds)
Remove old timestamps outside window

Parameters:

Window size: Duration (e.g., 1 minute)
Request limit: Max per window (e.g., 100)

Example:

Limit: 3 requests per 60 seconds
Current time: 00:01:30

Request log:
00:00:35 → Remove (too old, > 60 sec ago)
00:01:00 → Keep (within 60 sec)
00:01:15 → Keep
00:01:20 → Keep
Count: 3 requests in last 60 seconds

00:01:30 → New request → Count: 4 → ❌ REJECTED
00:02:00 → New request → Count: 2 (00:01:00 expired) → ✅ ALLOWED

Pros:

✅ Very accurate (no boundary burst problem)
✅ True sliding window

Cons:

❌ Memory expensive (store all timestamps)
- 1M users × 100 requests/min × 8 bytes/timestamp = 800 MB
❌ Slow (need to iterate all timestamps to count)

When to use: Need high accuracy, low request volume

Redis implementation:

def allow_request(user_id):
    key = f"rate_limit:{user_id}"
    now = time.now()
    window_start = now - 60  # 60 second window
 
    # Remove old timestamps
    redis.zremrangebyscore(key, 0, window_start)
 
    # Count requests in window
    count = redis.zcard(key)
 
    if count < 100:
        # Add current timestamp
        redis.zadd(key, {now: now})
        redis.expire(key, 60)
        return True
    else:
        return False

Algorithm 5: Sliding Window Counter 🎯 (Best Balance!)

How it works:

Hybrid of fixed window + sliding window
Approximate sliding window using two fixed windows
Weighted average based on time in current window

Formula:

Requests in sliding window =
  Requests in current window +
  Requests in previous window × Overlap percentage

Example:

Limit: 100 requests per minute
Current time: 00:01:30 (30 seconds into current window)

Previous window (00:00 - 00:01): 80 requests
Current window (00:01 - 00:02): 40 requests

Overlap: 30 seconds into current minute
Overlap percentage: 30/60 = 50%

Estimated requests in sliding window:
= 40 (current) + 80 (previous) × 50%
= 40 + 40
= 80 requests

New request → 81 ≤ 100 → ✅ ALLOWED

Visual:

Previous window:  [████████80 req██████]
Current window:               [████40 req████|________]
                              00:01:00        00:01:30

Sliding window:               [←————— 60 sec ————→]
                              Uses 50% of prev + 100% of current

Pros:

✅ Good accuracy (better than fixed window)
✅ Memory efficient (only 2 counters)
✅ No burst problem
✅ Used by Cloudflare, Azure

Cons:

❌ Approximation (not perfectly accurate)
❌ Assumes even distribution in previous window

When to use: Production systems! Best balance of accuracy and efficiency.

Redis implementation:

def allow_request(user_id):
    now = time.now()
    current_window = int(now / 60)
    previous_window = current_window - 1
    elapsed_time_in_current_window = now % 60
 
    current_key = f"rate_limit:{user_id}:{current_window}"
    previous_key = f"rate_limit:{user_id}:{previous_window}"
 
    current_count = redis.get(current_key) or 0
    previous_count = redis.get(previous_key) or 0
 
    # Calculate weighted count
    overlap_percentage = (60 - elapsed_time_in_current_window) / 60
    estimated_count = current_count + previous_count * overlap_percentage
 
    if estimated_count < 100:
        redis.incr(current_key)
        redis.expire(current_key, 120)  # Keep for 2 windows
        return True
    else:
        return False

Algorithm Comparison

Algorithm	Pros	Cons	Memory	Accuracy	Use Case
Token Bucket	Simple, allows burst, industry standard	Tuning tricky	Low	Good	Default choice
Leaky Bucket	Smooth traffic	Queues requests, complex	Low	Good	Traffic shaping
Fixed Window	Very simple	Burst at boundary	Very Low	Poor	Prototypes only
Sliding Log	Very accurate	Memory expensive, slow	High	Excellent	Low volume, need accuracy
Sliding Window	Good accuracy, efficient	Approximation	Low	Very Good	Production systems

Recommendations:

Start with: Token Bucket (simple, industry standard)
For production: Sliding Window Counter (best balance)
Avoid: Fixed Window (burst problem)
For low volume: Sliding Log (if need perfect accuracy)

Step 4: Deep Dive - Distributed Rate Limiting

Challenge: Race Conditions

Problem: Multiple servers incrementing counter simultaneously

Server 1 reads count: 99
Server 2 reads count: 99
Server 1 increments: 100 ✅
Server 2 increments: 100 ✅
Actual count: 101 (should be rate limited!)

Solutions:

1. Redis INCR (Atomic):

count = redis.incr(key)  # Atomic operation
if count <= 100:
    return True

✅ Atomic, no race condition
✅ Fast

2. Lua Script in Redis (for complex logic):

-- Lua script runs atomically in Redis
local current = redis.call('get', KEYS[1])
if current and tonumber(current) >= 100 then
    return 0  -- Rate limited
else
    redis.call('incr', KEYS[1])
    redis.call('expire', KEYS[1], 60)
    return 1  -- Allowed
end

3. Locks (avoid if possible):

with redis.lock(f"lock:{user_id}"):
    count = redis.get(key)
    if count < 100:
        redis.incr(key)
        return True

❌ Slow (lock contention)
❌ Deadlock risk

Recommendation: Use Redis atomic operations (INCR, Lua scripts)

Challenge: Synchronization Across Regions

Problem: Multi-region deployment, users routed to different regions

US Region: User has 50 requests
EU Region: User has 50 requests
Total: 100 requests (limit reached)
But each region thinks user has 50 (under limit)

Solutions:

1. Centralized Redis Cluster (Strict consistency):

All regions talk to same Redis cluster
❌ Higher latency (cross-region)
✅ Accurate

2. Eventual Consistency (Relaxed):

Each region has own Redis
Periodically sync counts
✅ Low latency
❌ Less accurate (OK for most cases)

3. Sticky Sessions (Route same user to same region):

GeoDNS routes user to nearest region
✅ Simple
❌ Imbalanced load

Recommendation:

For strict enforcement: Centralized Redis (financial APIs)
For general APIs: Eventual consistency (good enough)

Challenge: Performance & Scalability

Bottleneck: Redis can become bottleneck at very high scale

Solutions:

1. Redis Cluster (Sharding):

Partition users across Redis shards
Hash(user_id) → Redis shard
✅ Scales horizontally

2. Local Rate Limiting + Global Rate Limiting:

# Fast local check (in-memory)
if local_rate_limiter.allow(user_id):
    # Slower global check (Redis)
    if global_rate_limiter.allow(user_id):
        return True

✅ Very fast (local cache)
⚠️ Less accurate (OK for high throughput)

3. Distributed Rate Limiting Libraries:

Use existing solutions (DoorKeeper, Resilience4j)
Handle complexities for you

Rate Limiting Rules & Configuration

Rule Storage

Option 1: Configuration File (Simple):

rules:
  - user_id: "123"
    limit: 1000
    window: "1d"
  - tier: "free"
    limit: 100
    window: "1h"
  - tier: "premium"
    limit: 10000
    window: "1h"
  - endpoint: "/api/search"
    limit: 10
    window: "1s"

Option 2: Database (Dynamic):

CREATE TABLE rate_limit_rules (
    rule_id INT PRIMARY KEY,
    entity_type VARCHAR(50),  -- 'user', 'ip', 'api_key'
    entity_id VARCHAR(255),
    endpoint VARCHAR(255),
    limit INT,
    window_seconds INT,
    created_at TIMESTAMP
);

Option 3: Rules Engine (Complex):

If user.tier == “premium” → 10000 req/hour
If endpoint == “/login” → 5 req/min
If ip.country == “suspicious” → 10 req/hour

Rule Priority

Multiple rules can apply to same request:

User ID rule: 1000 req/hour
IP rule: 100 req/hour
Endpoint rule: 10 req/sec

Which to apply? → Most restrictive (10 req/sec)

Combining rules:

AND: Must satisfy all rules
OR: Must satisfy at least one
Most restrictive: Apply strictest limit

Handling Rate Limit Exceeded

Response Format

HTTP 429 Response:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640000000
Retry-After: 60

{
  "error": "rate_limit_exceeded",
  "message": "You have exceeded the rate limit. Please try again in 60 seconds."
}

Client Behavior

Best practices:

Exponential backoff: Wait 1s, 2s, 4s, 8s…
Respect Retry-After header: Wait specified time
Monitor X-RateLimit-Remaining: Stop before hitting limit
Queue non-urgent requests: Don’t spam API

Advanced Topics

1. Distributed Rate Limiting with Redis Cluster

Architecture:

API Gateway 1 ──┐
API Gateway 2 ──┼──→ Redis Cluster (Sharded)
API Gateway 3 ──┘     - Shard 1: Users A-H
                       - Shard 2: Users I-P
                       - Shard 3: Users Q-Z

Consistent hashing: Route same user to same shard

2. Rate Limiting by Multiple Dimensions

Example: Rate limit by (user + endpoint):

User 123 → /api/search: 10 req/sec
User 123 → /api/users: 100 req/min

Redis key design:

rate_limit:{user_id}:{endpoint}:{window}
rate_limit:123:/api/search:1640000000

3. Dynamic Rate Limiting

Adjust limits based on:

Time of day (higher during business hours)
User behavior (increase limit for good actors)
System load (decrease when overloaded)

Example:

def get_rate_limit(user_id, system_load):
    base_limit = 100
    if system_load > 0.9:  # 90% capacity
        return base_limit * 0.5  # Reduce by 50%
    return base_limit

4. Exemptions & Allowlists

Bypass rate limiting for:

Internal services
Health checks (monitoring)
Admin users
Trusted partners

if user.role == "admin":
    return True  # Bypass rate limiting

Design Summary

Final Architecture

┌─────────────────────────────────────────────────────┐
│                   API Gateway                       │
│                                                     │
│  ┌─────────────────────────────────────────────┐    │
│  │         Rate Limiter Middleware             │    │
│  │                                             │    │
│  │  1. Extract identifier (user/IP/API key)    │    │
│  │  2. Check Redis for current count           │    │
│  │  3. Apply rate limit algorithm              │    │
│  │  4. If allowed → Forward to backend         │    │
│  │  5. If blocked → Return HTTP 429            │    │
│  └─────────────────────────────────────────────┘    │
│                       ↓                             │
└───────────────────────┼─────────────────────────────┘
                        ↓
              ┌──────────────────┐
              │  Redis Cluster   │
              │  (Rate Counters) │
              │                  │
              │  Sliding Window  │
              │  or Token Bucket │
              └──────────────────┘

Key Decisions Summary

Decision	Choice	Reasoning
Algorithm	Token Bucket or Sliding Window Counter	Best balance of accuracy and efficiency
Storage	Redis	Fast, atomic operations, TTL support
Where	API Gateway	Centralized, don’t repeat in services
Distributed	Redis Cluster + Atomic operations	Handle race conditions
Rules	Database or Config file	Flexible, can update without deploy

Interview Questions & Answers

Q: How would you handle rate limiting for anonymous users?
A: Rate limit by IP address. But IP can be shared (NAT, VPN), so may need to be more lenient. Alternative: Use fingerprinting (browser, device) or require login for API access.

Q: What if Redis goes down?
A:

Fail open: Allow all requests (risk of abuse, but system stays up)
Fail closed: Reject all requests (secure, but bad user experience)
Fallback: Use local cache with eventual consistency
Best: Redis cluster with replication (high availability)

Q: How to prevent users from circumventing rate limits?
A:

Rate limit by multiple dimensions (user + IP + device)
Detect suspicious patterns (many IPs from one user)
Require authentication (can’t create infinite accounts)
Use CAPTCHA for anonymous endpoints
Monitor for abuse patterns

Q: Token Bucket vs Leaky Bucket?
A: Token Bucket allows burst traffic (use all tokens at once), Leaky Bucket enforces constant rate (smooth traffic). Token Bucket is more common because APIs naturally have bursty traffic.

Q: How to test rate limiter?
A:

Unit tests: Test algorithm logic
Integration tests: Test with Redis
Load tests: Simulate high traffic (millions of requests)
Chaos tests: Kill Redis, network partitions
Monitor: Track false positives/negatives in production

Key Takeaways

Token Bucket is the most popular algorithm (AWS, Stripe use it)
Sliding Window Counter gives best balance of accuracy and efficiency
Implement at API Gateway for centralized rate limiting
Use Redis for fast, atomic operations
Handle race conditions with atomic operations (INCR, Lua scripts)
Always return HTTP 429 with Retry-After header
Monitor false positives and adjust limits

distributed-system-components > 4. API Gateway - Where rate limiting is implemented
key-patterns > 1. Rate Limiting - Patterns and algorithms
ch03-framework-for-system-design - Apply framework to this problem

Practice this design! Very common interview question. Be ready to:

Explain different algorithms and trade-offs
Draw architecture diagram
Handle distributed system challenges (race conditions, multi-region)
Discuss Redis data structures and operations

Last Updated: 2026-04-09
Status: Common interview question - Must know!

Study Notes by Niladri & AI

Explorer

ch04-rate-limiter

Chapter 4: Design a Rate Limiter

Overview

Problem Statement

Step 1: Requirements & Scope (5 min)

Functional Requirements

Non-Functional Requirements

Step 2: High-Level Design (10 min)

Where to Implement Rate Limiting

Basic Architecture

API Design

Rate Limiting Rules

Step 3: Deep Dive - Rate Limiting Algorithms (20 min)

Algorithm 1: Token Bucket 🪣 (Most Popular!)

Algorithm 2: Leaky Bucket 💧

Algorithm 3: Fixed Window Counter

Algorithm 4: Sliding Window Log 📊

Algorithm 5: Sliding Window Counter 🎯 (Best Balance!)

Algorithm Comparison

Step 4: Deep Dive - Distributed Rate Limiting

Challenge: Race Conditions

Challenge: Synchronization Across Regions

Challenge: Performance & Scalability

Rate Limiting Rules & Configuration

Rule Storage

Rule Priority

Handling Rate Limit Exceeded

Response Format

Client Behavior

Advanced Topics

1. Distributed Rate Limiting with Redis Cluster

2. Rate Limiting by Multiple Dimensions

3. Dynamic Rate Limiting

4. Exemptions & Allowlists

Design Summary

Final Architecture

Key Decisions Summary

Interview Questions & Answers

Key Takeaways

Related Resources

Graph View

Table of Contents

Backlinks