Back to All Concepts
SecurityReliabilityAPI DesignIntermediate

Rate Limiting

Traffic control mechanism that limits the number of requests a user can make to prevent abuse, ensure fairness, and protect system resources from overload.

What is Rate Limiting?

Rate limiting controls how many requests a client can make to a service within a time window, protecting against abuse, ensuring fair resource allocation, and preventing system overload.

Think of it like a nightclub with a bouncer: "You can enter 100 times per hour, but not all at once."

Why Rate Limit?

1. Prevent DDoS Attacks

Attacker sends 1,000,000 requests/second
Without rate limiting: Server crashes
With rate limiting: Block after 1,000 requests/second
Click to expand code...

2. Fair Resource Allocation

User A: 10,000 requests/min (excessive)
User B: 10 requests/min (normal)

Without limits: User A consumes all resources, User B gets errors
With limits: Both get fair share
Click to expand code...

3. Cost Control

Cloud auto-scaling without limits:
Request spike → 100 new servers → $10,000 bill

With rate limiting:
Request spike → Reject excess → 10 servers → $1,000 bill
Click to expand code...

4. Prevent Brute Force Attacks

Login API without limits:
Attacker tries 1M passwords in 10 minutes

Login API with limits:
Attacker limited to 5 attempts/minute → 83 hours per 1M passwords
Click to expand code...

Rate Limiting Algorithms

1. Token Bucket

Tokens refill at constant rate. Each request consumes a token. Allows bursts.

mermaid
graph LR
    Refill[Token Refill<br/>5 tokens/sec] -->|add tokens| Bucket[Token Bucket<br/>Capacity: 100]
    Request[API Request] -->|consume 1 token| Bucket
    Bucket -->|has tokens| Accept[✅ Accept Request]
    Bucket -->|empty| Reject[❌ 429 Too Many Requests]
Click to expand code...

Implementation:

python
import time

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.tokens = capacity
        self.last_refill = time.time()
    
    def allow_request(self):
        # Refill tokens based on elapsed time
        now = time.time()
        elapsed = now - self.last_refill
        tokens_to_add = elapsed * self.refill_rate
        
        self.tokens = min(self.capacity, self.tokens + tokens_to_add)
        self.last_refill = now
        
        # Try to consume a token
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

# Usage
limiter = TokenBucket(capacity=100, refill_rate=10)  # 10 req/sec, burst of 100

for _ in range(150):
    if limiter.allow_request():
        process_request()
    else:
        return_429_error()
Click to expand code...

Pros:

  • ✅ Allows traffic bursts (good UX)
  • ✅ Smooth refill rate
  • ✅ Memory efficient

Cons:

  • ❌ Bursts can still overwhelm downstream services

Best for: User-facing APIs (good UX for occasional bursts)


2. Leaky Bucket

Requests enter a queue, processed at constant rate regardless of input.

mermaid
graph TB
    Requests[Incoming Requests<br/>Variable Rate] -->|enqueue| Queue[Queue/Bucket]
    Queue -->|fixed rate| Process[Process at<br/>Constant Rate]
    Queue -->|overflow| Drop[Drop Request<br/>429 Error]
Click to expand code...

Implementation:

python
from collections import deque
import time

class LeakyBucket:
    def __init__(self, capacity, leak_rate):
        self.capacity = capacity
        self.leak_rate = leak_rate  # requests per second
        self.queue = deque()
        self.last_leak = time.time()
    
    def allow_request(self):
        # Leak (process) requests
        now = time.time()
        elapsed = now - self.last_leak
        leaks = int(elapsed * self.leak_rate)
        
        for _ in range(min(leaks, len(self.queue))):
            self.queue.popleft()
        
        self.last_leak = now
        
        # Try to add new request
        if len(self.queue) < self.capacity:
            self.queue.append(time.time())
            return True
        return False

# Usage
limiter = LeakyBucket(capacity=100, leak_rate=10)
Click to expand code...

Pros:

  • ✅ Smooth, constant output rate
  • ✅ Protects downstream services

Cons:

  • ❌ Poor UX (bursts queued, adding latency)
  • ❌ Queue can fill up and drop requests

Best for: Protecting backend systems from spikes


3. Fixed Window Counter

Count requests in fixed time windows (e.g., 100 requests per minute).

python
import time
from collections import defaultdict

class FixedWindowCounter:
    def __init__(self, max_requests, window_size):
        self.max_requests = max_requests
        self.window_size = window_size  # in seconds
        self.counters = defaultdict(int)
    
    def allow_request(self, user_id):
        now = time.time()
        window = int(now // self.window_size)
        key = f"{user_id}:{window}"
        
        self.counters[key] += 1
        
        if self.counters[key] <= self.max_requests:
            return True
        return False

# Usage
limiter = FixedWindowCounter(max_requests=100, window_size=60)  # 100/min
Click to expand code...

Problem: Boundary Issue

Window 1: 09:00:00 - 09:00:59
Window 2: 09:01:00 - 09:01:59

User sends:
- 100 requests at 09:00:59 ✅ (allowed, window 1)
- 100 requests at 09:01:00 ✅ (allowed, window 2)

Result: 200 requests in 1 second! (boundary spike)
Click to expand code...

Pros:

  • ✅ Very simple
  • ✅ Memory efficient

Cons:

  • ❌ Boundary spike problem
  • ❌ Uneven traffic distribution

Best for: Simple use cases with low traffic


4. Sliding Window Log

Keep timestamp of each request, count requests in sliding window.

python
import time
from collections import deque

class SlidingWindowLog:
    def __init__(self, max_requests, window_size):
        self.max_requests = max_requests
        self.window_size = window_size
        self.requests = defaultdict(deque)
    
    def allow_request(self, user_id):
        now = time.time()
        cutoff = now - self.window_size
        
        # Remove old requests
        while self.requests[user_id] and self.requests[user_id][0] < cutoff:
            self.requests[user_id].popleft()
        
        # Check if under limit
        if len(self.requests[user_id]) < self.max_requests:
            self.requests[user_id].append(now)
            return True
        return False

# Usage
limiter = SlidingWindowLog(max_requests=100, window_size=60)
Click to expand code...

Pros:

  • ✅ No boundary issue
  • ✅ Accurate

Cons:

  • ❌ Memory intensive (stores all timestamps)
  • ❌ O(N) complexity for cleanup

Best for: When accuracy is critical and traffic is moderate


5. Sliding Window Counter

Hybrid: combines fixed window efficiency with sliding window accuracy.

python
import time

class SlidingWindowCounter:
    def __init__(self, max_requests, window_size):
        self.max_requests = max_requests
        self.window_size = window_size
        self.counters = {}
    
    def allow_request(self, user_id):
        now = time.time()
        current_window = int(now // self.window_size)
        previous_window = current_window - 1
        
        # Weight of previous window
        elapsed_in_current = now % self.window_size
        previous_weight = 1 - (elapsed_in_current / self.window_size)
        
        # Calculate weighted count
        current_count = self.counters.get(f"{user_id}:{current_window}", 0)
        previous_count = self.counters.get(f"{user_id}:{previous_window}", 0)
        
        estimated_count = previous_count * previous_weight + current_count
        
        if estimated_count < self.max_requests:
            key = f"{user_id}:{current_window}"
            self.counters[key] = current_count + 1
            return True
        return False
Click to expand code...

Pros:

  • ✅ Memory efficient
  • ✅ Smooths boundary spikes
  • ✅ Good accuracy

Cons:

  • ❌ Approximate (not exact)

Best for: Production systems (best balance)


Comparison Table

AlgorithmMemoryAccuracyBoundary IssueBursts AllowedBest For
Token BucketLowGoodNo✅ YesUser-facing APIs
Leaky BucketMediumPerfectNo❌ NoBackend protection
Fixed WindowVery LowPoor✅ YesPartialSimple cases
Sliding LogHighPerfectNoControlledCritical systems
Sliding CounterLowGoodNoPartialProduction (recommended)

Distributed Rate Limiting

Challenge

Request 1 → Server A (count = 1)
Request 2 → Server B (count = 1)
Result: Each server thinks usage is low, both allow
Actual: User exceeded limit!
Click to expand code...

Solution 1: Centralized Counter (Redis)

python
import redis
import time

class RedisRateLimiter:
    def __init__(self, redis_client, max_requests, window_size):
        self.redis = redis_client
        self.max_requests = max_requests
        self.window_size = window_size
    
    def allow_request(self, user_id):
        key = f"rate_limit:{user_id}"
        now = time.time()
        
        # Use Redis sorted set for sliding window
        pipe = self.redis.pipeline()
        
        # Remove old entries
        pipe.zremrangebyscore(key, 0, now - self.window_size)
        
        # Count requests in window
        pipe.zcard(key)
        
        # Add current request
        pipe.zadd(key, {str(now): now})
        
        # Set expiry
        pipe.expire(key, self.window_size)
        
        _, count, _, _ = pipe.execute()
        
        return count < self.max_requests

# Usage
redis_client = redis.Redis(host='localhost', port=6379)
limiter = RedisRateLimiter(redis_client, max_requests=100, window_size=60)
Click to expand code...

Solution 2: Redis Lua Script (Atomic)

lua
-- rate_limit.lua
local key = KEYS[1]
local max_requests = tonumber(ARGV[1])
local window_size = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

-- Remove old entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window_size)

-- Get current count
local count = redis.call('ZCARD', key)

if count < max_requests then
    redis.call('ZADD', key, now, now)
    redis.call('EXPIRE', key, window_size)
    return 1  -- Allow
else
    return 0  -- Deny
end
Click to expand code...
python
# Python usage
rate_limit_script = redis_client.register_script(lua_script)

def allow_request(user_id):
    result = rate_limit_script(
        keys=[f"rate_limit:{user_id}"],
        args=[100, 60, time.time()]
    )
    return result == 1
Click to expand code...

Real-World Examples

GitHub API

X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4999
X-RateLimit-Reset: 1372700873
X-RateLimit-Used: 1
Click to expand code...

Strategy: Sliding window, different limits per authentication level

  • Unauthenticated: 60 requests/hour
  • Authenticated: 5,000 requests/hour
  • Enterprise: Custom limits

Stripe API

python
# Stripe uses token bucket
# Response headers
Stripe-RateLimit-Limit: 100
Stripe-RateLimit-Remaining: 99
Stripe-RateLimit-Reset: 1609459200
Click to expand code...

Retry logic:

python
import time

def call_stripe_api():
    while True:
        response = stripe.Charge.list(limit=100)
        
        if response.status == 429:  # Too Many Requests
            retry_after = int(response.headers.get('Retry-After', 60))
            time.sleep(retry_after)
            continue
        
        return response
Click to expand code...

Twitter API

Multiple tier limits:

App level: 450 requests / 15 minutes
User level: 15 requests / 15 minutes
Endpoint specific: Varies
Click to expand code...

Response Headers

Standard headers to communicate rate limiting:

python
@app.route('/api/resource')
def api_endpoint():
    user_id = get_user_id()
    
    if not limiter.allow_request(user_id):
        return jsonify({'error': 'Rate limit exceeded'}), 429, {
            'X-RateLimit-Limit': '100',
            'X-RateLimit-Remaining': '0',
            'X-RateLimit-Reset': str(int(time.time() + 60)),
            'Retry-After': '60'
        }
    
    # Process request
    return jsonify({'data': 'success'}), 200, {
        'X-RateLimit-Limit': '100',
        'X-RateLimit-Remaining': '99',
        'X-RateLimit-Reset': str(int(time.time() + 60))
    }
Click to expand code...

Interview Tips 💡

When discussing rate limiting in system design interviews:

  1. Ask about requirements: "What's the expected traffic pattern? Do we need to allow bursts?"
  2. Choose appropriate algorithm: "For user-facing API, Token Bucket allows bursts. For backend protection, Leaky Bucket..."
  3. Discuss distribution: "With multiple servers, we'd use Redis as centralized rate limit store..."
  4. Mention atomicity: "We'd use Lua scripts to ensure atomic increment and check..."
  5. User experience: "Return 429 with Retry-After header so clients know when to retry..."
  6. Multiple layers: "Rate limit at API gateway (global), service level (per-resource), and user level (fairness)..."

Related Concepts

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Related Articles