What is Rate Limiting?
Rate limiting controls how many requests a client can make to a service within a time window, protecting against abuse, ensuring fair resource allocation, and preventing system overload.
Think of it like a nightclub with a bouncer: "You can enter 100 times per hour, but not all at once."
Why Rate Limit?
1. Prevent DDoS Attacks
Attacker sends 1,000,000 requests/second Without rate limiting: Server crashes With rate limiting: Block after 1,000 requests/second
2. Fair Resource Allocation
User A: 10,000 requests/min (excessive) User B: 10 requests/min (normal) Without limits: User A consumes all resources, User B gets errors With limits: Both get fair share
3. Cost Control
Cloud auto-scaling without limits: Request spike → 100 new servers → $10,000 bill With rate limiting: Request spike → Reject excess → 10 servers → $1,000 bill
4. Prevent Brute Force Attacks
Login API without limits: Attacker tries 1M passwords in 10 minutes Login API with limits: Attacker limited to 5 attempts/minute → 83 hours per 1M passwords
Rate Limiting Algorithms
1. Token Bucket
Tokens refill at constant rate. Each request consumes a token. Allows bursts.
graph LR
Refill[Token Refill<br/>5 tokens/sec] -->|add tokens| Bucket[Token Bucket<br/>Capacity: 100]
Request[API Request] -->|consume 1 token| Bucket
Bucket -->|has tokens| Accept[✅ Accept Request]
Bucket -->|empty| Reject[❌ 429 Too Many Requests]
Implementation:
import time
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity
self.refill_rate = refill_rate # tokens per second
self.tokens = capacity
self.last_refill = time.time()
def allow_request(self):
# Refill tokens based on elapsed time
now = time.time()
elapsed = now - self.last_refill
tokens_to_add = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + tokens_to_add)
self.last_refill = now
# Try to consume a token
if self.tokens >= 1:
self.tokens -= 1
return True
return False
# Usage
limiter = TokenBucket(capacity=100, refill_rate=10) # 10 req/sec, burst of 100
for _ in range(150):
if limiter.allow_request():
process_request()
else:
return_429_error()
Pros:
- ✅ Allows traffic bursts (good UX)
- ✅ Smooth refill rate
- ✅ Memory efficient
Cons:
- ❌ Bursts can still overwhelm downstream services
Best for: User-facing APIs (good UX for occasional bursts)
2. Leaky Bucket
Requests enter a queue, processed at constant rate regardless of input.
graph TB
Requests[Incoming Requests<br/>Variable Rate] -->|enqueue| Queue[Queue/Bucket]
Queue -->|fixed rate| Process[Process at<br/>Constant Rate]
Queue -->|overflow| Drop[Drop Request<br/>429 Error]
Implementation:
from collections import deque
import time
class LeakyBucket:
def __init__(self, capacity, leak_rate):
self.capacity = capacity
self.leak_rate = leak_rate # requests per second
self.queue = deque()
self.last_leak = time.time()
def allow_request(self):
# Leak (process) requests
now = time.time()
elapsed = now - self.last_leak
leaks = int(elapsed * self.leak_rate)
for _ in range(min(leaks, len(self.queue))):
self.queue.popleft()
self.last_leak = now
# Try to add new request
if len(self.queue) < self.capacity:
self.queue.append(time.time())
return True
return False
# Usage
limiter = LeakyBucket(capacity=100, leak_rate=10)
Pros:
- ✅ Smooth, constant output rate
- ✅ Protects downstream services
Cons:
- ❌ Poor UX (bursts queued, adding latency)
- ❌ Queue can fill up and drop requests
Best for: Protecting backend systems from spikes
3. Fixed Window Counter
Count requests in fixed time windows (e.g., 100 requests per minute).
import time
from collections import defaultdict
class FixedWindowCounter:
def __init__(self, max_requests, window_size):
self.max_requests = max_requests
self.window_size = window_size # in seconds
self.counters = defaultdict(int)
def allow_request(self, user_id):
now = time.time()
window = int(now // self.window_size)
key = f"{user_id}:{window}"
self.counters[key] += 1
if self.counters[key] <= self.max_requests:
return True
return False
# Usage
limiter = FixedWindowCounter(max_requests=100, window_size=60) # 100/min
Problem: Boundary Issue
Window 1: 09:00:00 - 09:00:59 Window 2: 09:01:00 - 09:01:59 User sends: - 100 requests at 09:00:59 ✅ (allowed, window 1) - 100 requests at 09:01:00 ✅ (allowed, window 2) Result: 200 requests in 1 second! (boundary spike)
Pros:
- ✅ Very simple
- ✅ Memory efficient
Cons:
- ❌ Boundary spike problem
- ❌ Uneven traffic distribution
Best for: Simple use cases with low traffic
4. Sliding Window Log
Keep timestamp of each request, count requests in sliding window.
import time
from collections import deque
class SlidingWindowLog:
def __init__(self, max_requests, window_size):
self.max_requests = max_requests
self.window_size = window_size
self.requests = defaultdict(deque)
def allow_request(self, user_id):
now = time.time()
cutoff = now - self.window_size
# Remove old requests
while self.requests[user_id] and self.requests[user_id][0] < cutoff:
self.requests[user_id].popleft()
# Check if under limit
if len(self.requests[user_id]) < self.max_requests:
self.requests[user_id].append(now)
return True
return False
# Usage
limiter = SlidingWindowLog(max_requests=100, window_size=60)
Pros:
- ✅ No boundary issue
- ✅ Accurate
Cons:
- ❌ Memory intensive (stores all timestamps)
- ❌ O(N) complexity for cleanup
Best for: When accuracy is critical and traffic is moderate
5. Sliding Window Counter
Hybrid: combines fixed window efficiency with sliding window accuracy.
import time
class SlidingWindowCounter:
def __init__(self, max_requests, window_size):
self.max_requests = max_requests
self.window_size = window_size
self.counters = {}
def allow_request(self, user_id):
now = time.time()
current_window = int(now // self.window_size)
previous_window = current_window - 1
# Weight of previous window
elapsed_in_current = now % self.window_size
previous_weight = 1 - (elapsed_in_current / self.window_size)
# Calculate weighted count
current_count = self.counters.get(f"{user_id}:{current_window}", 0)
previous_count = self.counters.get(f"{user_id}:{previous_window}", 0)
estimated_count = previous_count * previous_weight + current_count
if estimated_count < self.max_requests:
key = f"{user_id}:{current_window}"
self.counters[key] = current_count + 1
return True
return False
Pros:
- ✅ Memory efficient
- ✅ Smooths boundary spikes
- ✅ Good accuracy
Cons:
- ❌ Approximate (not exact)
Best for: Production systems (best balance)
Comparison Table
| Algorithm | Memory | Accuracy | Boundary Issue | Bursts Allowed | Best For |
|---|---|---|---|---|---|
| Token Bucket | Low | Good | No | ✅ Yes | User-facing APIs |
| Leaky Bucket | Medium | Perfect | No | ❌ No | Backend protection |
| Fixed Window | Very Low | Poor | ✅ Yes | Partial | Simple cases |
| Sliding Log | High | Perfect | No | Controlled | Critical systems |
| Sliding Counter | Low | Good | No | Partial | Production (recommended) |
Distributed Rate Limiting
Challenge
Request 1 → Server A (count = 1) Request 2 → Server B (count = 1) Result: Each server thinks usage is low, both allow Actual: User exceeded limit!
Solution 1: Centralized Counter (Redis)
import redis
import time
class RedisRateLimiter:
def __init__(self, redis_client, max_requests, window_size):
self.redis = redis_client
self.max_requests = max_requests
self.window_size = window_size
def allow_request(self, user_id):
key = f"rate_limit:{user_id}"
now = time.time()
# Use Redis sorted set for sliding window
pipe = self.redis.pipeline()
# Remove old entries
pipe.zremrangebyscore(key, 0, now - self.window_size)
# Count requests in window
pipe.zcard(key)
# Add current request
pipe.zadd(key, {str(now): now})
# Set expiry
pipe.expire(key, self.window_size)
_, count, _, _ = pipe.execute()
return count < self.max_requests
# Usage
redis_client = redis.Redis(host='localhost', port=6379)
limiter = RedisRateLimiter(redis_client, max_requests=100, window_size=60)
Solution 2: Redis Lua Script (Atomic)
-- rate_limit.lua
local key = KEYS[1]
local max_requests = tonumber(ARGV[1])
local window_size = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
-- Remove old entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window_size)
-- Get current count
local count = redis.call('ZCARD', key)
if count < max_requests then
redis.call('ZADD', key, now, now)
redis.call('EXPIRE', key, window_size)
return 1 -- Allow
else
return 0 -- Deny
end
# Python usage
rate_limit_script = redis_client.register_script(lua_script)
def allow_request(user_id):
result = rate_limit_script(
keys=[f"rate_limit:{user_id}"],
args=[100, 60, time.time()]
)
return result == 1
Real-World Examples
GitHub API
X-RateLimit-Limit: 5000 X-RateLimit-Remaining: 4999 X-RateLimit-Reset: 1372700873 X-RateLimit-Used: 1
Strategy: Sliding window, different limits per authentication level
- Unauthenticated: 60 requests/hour
- Authenticated: 5,000 requests/hour
- Enterprise: Custom limits
Stripe API
# Stripe uses token bucket # Response headers Stripe-RateLimit-Limit: 100 Stripe-RateLimit-Remaining: 99 Stripe-RateLimit-Reset: 1609459200
Retry logic:
import time
def call_stripe_api():
while True:
response = stripe.Charge.list(limit=100)
if response.status == 429: # Too Many Requests
retry_after = int(response.headers.get('Retry-After', 60))
time.sleep(retry_after)
continue
return response
Twitter API
Multiple tier limits:
App level: 450 requests / 15 minutes User level: 15 requests / 15 minutes Endpoint specific: Varies
Response Headers
Standard headers to communicate rate limiting:
@app.route('/api/resource')
def api_endpoint():
user_id = get_user_id()
if not limiter.allow_request(user_id):
return jsonify({'error': 'Rate limit exceeded'}), 429, {
'X-RateLimit-Limit': '100',
'X-RateLimit-Remaining': '0',
'X-RateLimit-Reset': str(int(time.time() + 60)),
'Retry-After': '60'
}
# Process request
return jsonify({'data': 'success'}), 200, {
'X-RateLimit-Limit': '100',
'X-RateLimit-Remaining': '99',
'X-RateLimit-Reset': str(int(time.time() + 60))
}
Interview Tips 💡
When discussing rate limiting in system design interviews:
- Ask about requirements: "What's the expected traffic pattern? Do we need to allow bursts?"
- Choose appropriate algorithm: "For user-facing API, Token Bucket allows bursts. For backend protection, Leaky Bucket..."
- Discuss distribution: "With multiple servers, we'd use Redis as centralized rate limit store..."
- Mention atomicity: "We'd use Lua scripts to ensure atomic increment and check..."
- User experience: "Return 429 with Retry-After header so clients know when to retry..."
- Multiple layers: "Rate limit at API gateway (global), service level (per-resource), and user level (fairness)..."
Related Concepts
- Token Bucket vs Leaky Bucket — Detailed algorithm comparison
- Redis Internals — Using Redis for distributed rate limiting
- API Gateway — Gateway-level rate limiting
- Backpressure — Handling overload at application level
- Load Balancing — Distributing rate-limited traffic
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
API Gateway Pattern
The single entry point for microservices. Implementing rate limiting, authentication, and protocol translation/aggregation.
Circuit Breaker Pattern
A mechanism to prevent an application from repeatedly trying to execute an operation that's likely to fail.
System Design: Notification System
How to send millions of SMS, Email, and Push notifications reliably. Message Queues, Rate Limiting, and Retry policies.