Token Bucket vs Leaky Bucket
Token Bucket and Leaky Bucket are two fundamental algorithms for rate limiting and traffic shaping. Despite their similar names, they serve different purposes and behave very differently.
Quick comparison:
- Token Bucket: Allows bursts, limits average rate
- Leaky Bucket: Smooths traffic, enforces constant rate
Token Bucket Algorithm
How It Works
Imagine a bucket that holds tokens (permissions to make requests):
graph LR
Refill["+R tokens/sec"] --> Bucket[Bucket: C capacity]
Request[Request arrives] -->|"consume 1 token"| Bucket
Bucket -->|"has tokens"| Allow[✓ Process immediately]
Bucket -->|"empty"| Reject[✗ Reject 429]
Mechanics:
- Bucket has capacity
Ctokens (initially full) - Tokens added at rate
Rper second - Each request consumes 1 token
- If bucket empty → reject request
- If tokens available → process immediately
Visual Example
Capacity: 10 tokens Refill: 1 token/second Time | Tokens | Event | Result -----|--------|--------------------|----------- 0s | 10 | User idle (10 sec) | Bucket full 0s | 10 | 5 requests arrive | ✓ All accepted (5 left) 1s | 6 | +1 refill | 6 tokens 1s | 6 | 8 requests arrive | ✓ First 6, ✗ Last 2 rejected 2s | 1 | +1 refill | 1 token 2s | 1 | 1 request arrives | ✓ Accepted
Key insight: Bursts allowed up to capacity, then limited to refill rate.
Implementation (Python)
import time
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate
self.last_refill = time.time()
def consume(self, tokens=1):
self._refill()
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False
def _refill(self):
now = time.time()
elapsed = now - self.last_refill
tokens_to_add = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + tokens_to_add)
self.last_refill = now
# Usage
bucket = TokenBucket(capacity=10, refill_rate=1)
if bucket.consume():
process_request()
else:
return_429_error()
Use Cases
✅ API Rate Limiting
Free tier: 100 req/min average, burst of 20 bucket = TokenBucket(capacity=20, refill_rate=100/60)
✅ User-Facing Services
- Allows natural user behavior (refresh button = 3 quick requests)
- Good UX: occasional bursts don't get penalized
✅ AWS API Gateway Throttling
throttling: burstLimit: 5000 # Token bucket capacity rateLimit: 10000 # Refill rate (requests/sec)
Leaky Bucket Algorithm
How It Works
Imagine a bucket with a hole in the bottom:
graph TD
Input[Requests arrive] -->|"add to queue"| Queue[Queue: Max size Q]
Queue -->|"leak at constant rate R"| Process[Process Request]
Queue -->|"queue full"| Drop[✗ Drop request]
Mechanics:
- Requests enter a queue (bucket)
- Requests processed (leak) at constant rate
R - If queue full → drop incoming requests
- Output rate is always
R, regardless of input
Visual Example
Queue Capacity: 5 requests Leak Rate: 1 request/second Time | Queue | Event | Result -----|-------|--------------------|------------- 0s | [] | 3 requests arrive | Queue: [R1, R2, R3] 1s | [R2,R3] | Process R1 | R1 processed 1s | [R2,R3,R4,R5,R6] | 3 more arrive | R6 dropped (queue full) 2s | [R3,R4,R5] | Process R2 | R2 processed 3s | [R4,R5] | Process R3 | R3 processed
Key insight: Output is smooth and constant, bursts are queued or dropped.
Implementation (Python)
import time
from collections import deque
class LeakyBucket:
def __init__(self, capacity, leak_rate):
self.capacity = capacity
self.queue = deque()
self.leak_rate = leak_rate
self.last_leak = time.time()
def add_request(self, request):
self._leak()
if len(self.queue) < self.capacity:
self.queue.append(request)
return True
return False # Queue full, drop request
def _leak(self):
"""Process requests at constant rate"""
now = time.time()
elapsed = now - self.last_leak
# Process requests that should have leaked
leaks = int(elapsed * self.leak_rate)
for _ in range(min(leaks, len(self.queue))):
request = self.queue.popleft()
process_request(request)
self.last_leak = now
# Usage
bucket = LeakyBucket(capacity=100, leak_rate=10)
if bucket.add_request(incoming_request):
print("Request queued")
else:
print("Request dropped (429)")
Use Cases
✅ Network Traffic Shaping
- ISPs use leaky bucket to smooth bursty traffic
- Prevents network congestion
✅ Video Streaming
- Output frames at constant 30 FPS
- Buffer handles temporary encoding delays
✅ Background Job Processing
- Process tasks at sustainable rate
- Prevents worker overload
Side-by-Side Comparison
| Aspect | Token Bucket | Leaky Bucket |
|---|---|---|
| Burst Handling | ✅ Allows controlled bursts | ❌ Smooths bursts, no acceleration |
| Output Rate | Variable (instant if tokens available) | Constant (always leak rate) |
| Queueing | No queue, immediate decision | Queue requests for processing |
| Memory | O(1) — just track tokens | O(n) — store queued requests |
| Latency | Low (immediate response) | Higher (queuing delay) |
| User Experience | Better (responsive to bursts) | Worse (artificial delays) |
| Server Protection | Moderate | Excellent (guaranteed max rate) |
| Implementation | Simple | Moderate (need queue management) |
Traffic Pattern Comparison
Input: ▂▂▂███████▂▂▂▂▂▂▂ (burst in middle)
Token Bucket Output:
▂▂▂██████▂▂▂▂▂▂▂▂ (burst passed through, some rejected)
Leaky Bucket Output:
▂▂▂▃▃▃▃▃▃▃▃▃▃▃▃▂ (smoothed to constant rate)
When to Use Each
Choose Token Bucket When:
✅ User-facing APIs
- Users expect quick responses to bursts (page refresh)
- Good UX is priority
✅ Variable workload patterns
- Traffic naturally bursty
- Average rate matters more than instantaneous rate
✅ Multi-tier rate limiting
free_user = TokenBucket(capacity=10, refill_rate=1) # 1 req/sec, burst 10 paid_user = TokenBucket(capacity=1000, refill_rate=100) # 100 req/sec, burst 1000
Choose Leaky Bucket When:
✅ Protecting downstream services
- Database can only handle 100 QPS consistently
- Bursts would cause crashes
✅ Regulatory compliance
- Must guarantee max rate (e.g., rate-limited APIs)
✅ Fair scheduling
- Process items in order received
- No priority to bursty users
Hybrid Approaches
Leaky Bucket with Token Bucket
Combine both for flexibility:
class HybridRateLimiter:
def __init__(self):
# Token bucket for burst allowance
self.token_bucket = TokenBucket(capacity=100, refill_rate=10)
# Leaky bucket for smoothing output
self.leaky_bucket = LeakyBucket(capacity=50, leak_rate=10)
def handle_request(self, request):
# First check: Can accept burst?
if not self.token_bucket.consume():
return False, "Rate limit exceeded (burst)"
# Second check: Queue for processing
if not self.leaky_bucket.add_request(request):
return False, "Server busy (queue full)"
return True, "Request queued"
Benefits:
- Token bucket prevents abuse
- Leaky bucket protects server
Real-World Examples
Nginx Rate Limiting (Leaky Bucket)
http {
# Define rate limiting zone (leaky bucket)
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
server {
location /api/ {
limit_req zone=api burst=20 nodelay;
# Leak rate: 10/sec
# Burst capacity: 20 requests
}
}
}
AWS API Gateway (Token Bucket)
# CloudFormation configuration
UsagePlan:
Type: AWS::ApiGateway::UsagePlan
Properties:
Throttle:
BurstLimit: 5000 # Token bucket capacity
RateLimit: 10000 # Refill rate (req/sec)
Google Cloud Armor (Hybrid)
Combines token bucket (per-user limits) with leaky bucket (server protection):
rateLimitOptions:
conformAction: allow
exceedAction: deny
rateLimitThreshold:
count: 100 # Token bucket capacity
intervalSec: 60 # Refill period
banThreshold:
count: 10000 # Leaky bucket queue size
intervalSec: 600
Distributed Implementation Challenges
Problem: Multiple Servers
Server A: User uses 50 tokens Server B: User uses 50 tokens Total: 100 tokens consumed (2x the limit!)
Solution: Centralized Redis
-- token_bucket.lua (Redis Lua script)
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Refill tokens
local elapsed = math.max(0, now - last_refill)
tokens = math.min(capacity, tokens + (elapsed * refill_rate))
-- Consume token
local allowed = 0
if tokens >= 1 then
tokens = tokens - 1
allowed = 1
end
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return {allowed, tokens}
Interview Tips 💡
When discussing rate limiting in system design interviews:
- Clarify requirements: "Do we need to allow bursts for better UX, or enforce strict constant rates?"
- Compare algorithms: "Token bucket allows occasional bursts, while leaky bucket smooths traffic..."
- Justify choice: "For this user-facing API, token bucket is better because users might refresh pages..."
- Discuss distribution: "We'd use Redis with Lua scripts to share rate limit state across servers..."
- Mention alternatives: "We could also use fixed window counters for simpler cases..."
- Real examples: "AWS API Gateway uses token bucket with separate burst and rate limits..."
Related Concepts
- Rate Limiting — Overview of rate limiting strategies
- Circuit Breaker — Protecting services from cascading failures
- API Gateway — Where rate limiting is often implemented
- Backpressure — Related flow control in streaming systems
- Redis Internals — Distributed rate limiting with Redis
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
Geohashing (Location Encoding)
A geocoding system that encodes latitude/longitude coordinates into short alphanumeric strings for efficient proximity searches and spatial indexing.
Raft Consensus Algorithm
A comprehensive guide to Raft, the consensus algorithm powering Etcd, Consul, and Kubernetes. Leader election, log replication, safety guarantees, and production deployment patterns.
Top-K Heavy Hitters (Streaming)
How to find 'Trending Topics' in massive real-time streams. MapReduce, Count-Min Sketches, and Sliding Windows.