What is Rate Limiting?

Rate limiting controls how many requests a client can make to a service within a time window, protecting against abuse, ensuring fair resource allocation, and preventing system overload.

Think of it like a nightclub with a bouncer: "You can enter 100 times per hour, but not all at once."

Why Rate Limit?

1. Prevent DDoS Attacks

Attacker sends 1,000,000 requests/second
Without rate limiting: Server crashes
With rate limiting: Block after 1,000 requests/second

Click to expand code...

2. Fair Resource Allocation

User A: 10,000 requests/min (excessive)
User B: 10 requests/min (normal)

Without limits: User A consumes all resources, User B gets errors
With limits: Both get fair share

Click to expand code...

3. Cost Control

Cloud auto-scaling without limits:
Request spike → 100 new servers → $10,000 bill

With rate limiting:
Request spike → Reject excess → 10 servers → $1,000 bill

Click to expand code...

4. Prevent Brute Force Attacks

Login API without limits:
Attacker tries 1M passwords in 10 minutes

Login API with limits:
Attacker limited to 5 attempts/minute → 83 hours per 1M passwords

Click to expand code...

Rate Limiting Algorithms

1. Token Bucket

Tokens refill at constant rate. Each request consumes a token. Allows bursts.

mermaid

graph LR
    Refill[Token Refill<br/>5 tokens/sec] -->|add tokens| Bucket[Token Bucket<br/>Capacity: 100]
    Request[API Request] -->|consume 1 token| Bucket
    Bucket -->|has tokens| Accept[✅ Accept Request]
    Bucket -->|empty| Reject[❌ 429 Too Many Requests]

Click to expand code...

Implementation:

python

import time

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.tokens = capacity
        self.last_refill = time.time()
    
    def allow_request(self):
        # Refill tokens based on elapsed time
        now = time.time()
        elapsed = now - self.last_refill
        tokens_to_add = elapsed * self.refill_rate
        
        self.tokens = min(self.capacity, self.tokens + tokens_to_add)
        self.last_refill = now
        
        # Try to consume a token
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

# Usage
limiter = TokenBucket(capacity=100, refill_rate=10)  # 10 req/sec, burst of 100

for _ in range(150):
    if limiter.allow_request():
        process_request()
    else:
        return_429_error()

Click to expand code...

Pros:

✅ Allows traffic bursts (good UX)
✅ Smooth refill rate
✅ Memory efficient

Cons:

❌ Bursts can still overwhelm downstream services

Best for: User-facing APIs (good UX for occasional bursts)

2. Leaky Bucket

Requests enter a queue, processed at constant rate regardless of input.

mermaid

graph TB
    Requests[Incoming Requests<br/>Variable Rate] -->|enqueue| Queue[Queue/Bucket]
    Queue -->|fixed rate| Process[Process at<br/>Constant Rate]
    Queue -->|overflow| Drop[Drop Request<br/>429 Error]

Click to expand code...

Implementation:

python

from collections import deque
import time

class LeakyBucket:
    def __init__(self, capacity, leak_rate):
        self.capacity = capacity
        self.leak_rate = leak_rate  # requests per second
        self.queue = deque()
        self.last_leak = time.time()
    
    def allow_request(self):
        # Leak (process) requests
        now = time.time()
        elapsed = now - self.last_leak
        leaks = int(elapsed * self.leak_rate)
        
        for _ in range(min(leaks, len(self.queue))):
            self.queue.popleft()
        
        self.last_leak = now
        
        # Try to add new request
        if len(self.queue) < self.capacity:
            self.queue.append(time.time())
            return True
        return False

# Usage
limiter = LeakyBucket(capacity=100, leak_rate=10)

Click to expand code...

Pros:

✅ Smooth, constant output rate
✅ Protects downstream services

Cons:

❌ Poor UX (bursts queued, adding latency)
❌ Queue can fill up and drop requests

Best for: Protecting backend systems from spikes

3. Fixed Window Counter

Count requests in fixed time windows (e.g., 100 requests per minute).

python

import time
from collections import defaultdict

class FixedWindowCounter:
    def __init__(self, max_requests, window_size):
        self.max_requests = max_requests
        self.window_size = window_size  # in seconds
        self.counters = defaultdict(int)
    
    def allow_request(self, user_id):
        now = time.time()
        window = int(now // self.window_size)
        key = f"{user_id}:{window}"
        
        self.counters[key] += 1
        
        if self.counters[key] <= self.max_requests:
            return True
        return False

# Usage
limiter = FixedWindowCounter(max_requests=100, window_size=60)  # 100/min

Click to expand code...

Problem: Boundary Issue

Window 1: 09:00:00 - 09:00:59
Window 2: 09:01:00 - 09:01:59

User sends:
- 100 requests at 09:00:59 ✅ (allowed, window 1)
- 100 requests at 09:01:00 ✅ (allowed, window 2)

Result: 200 requests in 1 second! (boundary spike)

Click to expand code...

Pros:

✅ Very simple
✅ Memory efficient

Cons:

❌ Boundary spike problem
❌ Uneven traffic distribution

Best for: Simple use cases with low traffic

4. Sliding Window Log

Keep timestamp of each request, count requests in sliding window.

python

import time
from collections import deque

class SlidingWindowLog:
    def __init__(self, max_requests, window_size):
        self.max_requests = max_requests
        self.window_size = window_size
        self.requests = defaultdict(deque)
    
    def allow_request(self, user_id):
        now = time.time()
        cutoff = now - self.window_size
        
        # Remove old requests
        while self.requests[user_id] and self.requests[user_id][0] < cutoff:
            self.requests[user_id].popleft()
        
        # Check if under limit
        if len(self.requests[user_id]) < self.max_requests:
            self.requests[user_id].append(now)
            return True
        return False

# Usage
limiter = SlidingWindowLog(max_requests=100, window_size=60)

Click to expand code...

Pros:

✅ No boundary issue
✅ Accurate

Cons:

❌ Memory intensive (stores all timestamps)
❌ O(N) complexity for cleanup

Best for: When accuracy is critical and traffic is moderate

5. Sliding Window Counter

Hybrid: combines fixed window efficiency with sliding window accuracy.

python

import time

class SlidingWindowCounter:
    def __init__(self, max_requests, window_size):
        self.max_requests = max_requests
        self.window_size = window_size
        self.counters = {}
    
    def allow_request(self, user_id):
        now = time.time()
        current_window = int(now // self.window_size)
        previous_window = current_window - 1
        
        # Weight of previous window
        elapsed_in_current = now % self.window_size
        previous_weight = 1 - (elapsed_in_current / self.window_size)
        
        # Calculate weighted count
        current_count = self.counters.get(f"{user_id}:{current_window}", 0)
        previous_count = self.counters.get(f"{user_id}:{previous_window}", 0)
        
        estimated_count = previous_count * previous_weight + current_count
        
        if estimated_count < self.max_requests:
            key = f"{user_id}:{current_window}"
            self.counters[key] = current_count + 1
            return True
        return False

Click to expand code...

Pros:

✅ Memory efficient
✅ Smooths boundary spikes
✅ Good accuracy

Cons:

❌ Approximate (not exact)

Best for: Production systems (best balance)

Comparison Table

Algorithm	Memory	Accuracy	Boundary Issue	Bursts Allowed	Best For
Token Bucket	Low	Good	No	✅ Yes	User-facing APIs
Leaky Bucket	Medium	Perfect	No	❌ No	Backend protection
Fixed Window	Very Low	Poor	✅ Yes	Partial	Simple cases
Sliding Log	High	Perfect	No	Controlled	Critical systems
Sliding Counter	Low	Good	No	Partial	Production (recommended)

Distributed Rate Limiting

Challenge

Request 1 → Server A (count = 1)
Request 2 → Server B (count = 1)
Result: Each server thinks usage is low, both allow
Actual: User exceeded limit!

Click to expand code...

Solution 1: Centralized Counter (Redis)

python

import redis
import time

class RedisRateLimiter:
    def __init__(self, redis_client, max_requests, window_size):
        self.redis = redis_client
        self.max_requests = max_requests
        self.window_size = window_size
    
    def allow_request(self, user_id):
        key = f"rate_limit:{user_id}"
        now = time.time()
        
        # Use Redis sorted set for sliding window
        pipe = self.redis.pipeline()
        
        # Remove old entries
        pipe.zremrangebyscore(key, 0, now - self.window_size)
        
        # Count requests in window
        pipe.zcard(key)
        
        # Add current request
        pipe.zadd(key, {str(now): now})
        
        # Set expiry
        pipe.expire(key, self.window_size)
        
        _, count, _, _ = pipe.execute()
        
        return count < self.max_requests

# Usage
redis_client = redis.Redis(host='localhost', port=6379)
limiter = RedisRateLimiter(redis_client, max_requests=100, window_size=60)

Click to expand code...

Solution 2: Redis Lua Script (Atomic)

lua

-- rate_limit.lua
local key = KEYS[1]
local max_requests = tonumber(ARGV[1])
local window_size = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

-- Remove old entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window_size)

-- Get current count
local count = redis.call('ZCARD', key)

if count < max_requests then
    redis.call('ZADD', key, now, now)
    redis.call('EXPIRE', key, window_size)
    return 1  -- Allow
else
    return 0  -- Deny
end

Click to expand code...

python

# Python usage
rate_limit_script = redis_client.register_script(lua_script)

def allow_request(user_id):
    result = rate_limit_script(
        keys=[f"rate_limit:{user_id}"],
        args=[100, 60, time.time()]
    )
    return result == 1

Click to expand code...

Real-World Examples

GitHub API

X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4999
X-RateLimit-Reset: 1372700873
X-RateLimit-Used: 1

Click to expand code...

Strategy: Sliding window, different limits per authentication level

Unauthenticated: 60 requests/hour
Authenticated: 5,000 requests/hour
Enterprise: Custom limits

Stripe API

python

# Stripe uses token bucket
# Response headers
Stripe-RateLimit-Limit: 100
Stripe-RateLimit-Remaining: 99
Stripe-RateLimit-Reset: 1609459200

Click to expand code...

Retry logic:

python

import time

def call_stripe_api():
    while True:
        response = stripe.Charge.list(limit=100)
        
        if response.status == 429:  # Too Many Requests
            retry_after = int(response.headers.get('Retry-After', 60))
            time.sleep(retry_after)
            continue
        
        return response

Click to expand code...

Twitter API

Multiple tier limits:

App level: 450 requests / 15 minutes
User level: 15 requests / 15 minutes
Endpoint specific: Varies

Click to expand code...

Response Headers

Standard headers to communicate rate limiting:

python

@app.route('/api/resource')
def api_endpoint():
    user_id = get_user_id()
    
    if not limiter.allow_request(user_id):
        return jsonify({'error': 'Rate limit exceeded'}), 429, {
            'X-RateLimit-Limit': '100',
            'X-RateLimit-Remaining': '0',
            'X-RateLimit-Reset': str(int(time.time() + 60)),
            'Retry-After': '60'
        }
    
    # Process request
    return jsonify({'data': 'success'}), 200, {
        'X-RateLimit-Limit': '100',
        'X-RateLimit-Remaining': '99',
        'X-RateLimit-Reset': str(int(time.time() + 60))
    }

Click to expand code...

Interview Tips 💡

When discussing rate limiting in system design interviews:

Ask about requirements: "What's the expected traffic pattern? Do we need to allow bursts?"
Choose appropriate algorithm: "For user-facing API, Token Bucket allows bursts. For backend protection, Leaky Bucket..."
Discuss distribution: "With multiple servers, we'd use Redis as centralized rate limit store..."
Mention atomicity: "We'd use Lua scripts to ensure atomic increment and check..."
User experience: "Return 429 with Retry-After header so clients know when to retry..."
Multiple layers: "Rate limit at API gateway (global), service level (per-resource), and user level (fairness)..."

Related Concepts

Token Bucket vs Leaky Bucket — Detailed algorithm comparison
Redis Internals — Using Redis for distributed rate limiting
API Gateway — Gateway-level rate limiting
Backpressure — Handling overload at application level
Load Balancing — Distributing rate-limited traffic

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Beginner