Preventing Cascading Failures
If Service A calls Service B, and Service B is down, Service A shouldn't keep waiting and timing out. It should "trip the circuit" and fail fast.
The 3 States (Finite State Machine)
The Circuit Breaker is a state machine with three possible states:
stateDiagram-v2
[*] --> Closed
state "Closed (Success)" as Closed
state "Open (Failure)" as Open
state "Half-Open (Test)" as HalfOpen
Closed --> Open: Failure Count > Threshold
Open --> HalfOpen: Timeout Expired
HalfOpen --> Closed: Success
HalfOpen --> Open: Failure
1. Closed (Normal Operation)
- Behavior: Requests flow through to the service normally.
- Counting: We count failures (e.g., 500 errors, timeouts).
- Tripping: If
failures > thresholdwithin a time window, trip to Open.
2. Open (Tripped)
- Behavior: Requests are blocked immediately. The breaker throws a
CircuitOpenException. - Why: This prevents the "Thundering Herd" problem where 1000s of requests hammer a struggling database, preventing it from recovering.
- Reset: After
reset_timeoutseconds, transition to Half-Open.
3. Half-Open (The Canary)
- Behavior: We allow 1 request to pass through to test the waters.
- Success: The external service is back! Reset counts and go to Closed.
- Failure: It's still broken. Go back to Open and double the wait time (Exponential Backoff).
Implementation (Python)
Here is a thread-safe implementation of a basic Circuit Breaker.
import time
import threading
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=10):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.state = "CLOSED"
self.failures = 0
self.last_failure_time = 0
self.lock = threading.Lock()
def call(self, func, *args, **kwargs):
with self.lock:
if self.state == "OPEN":
# Check if it's time to try again (Half-Open)
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = "HALF-OPEN"
else:
raise Exception("Circuit is OPEN")
try:
result = func(*args, **kwargs)
except Exception as e:
# If we fail in Half-Open, go back to Open immediately
self.failures += 1
self.last_failure_time = time.time()
if self.failures >= self.failure_threshold or self.state == "HALF-OPEN":
self.state = "OPEN"
raise e
# Success! Reset everything.
with self.lock:
self.state = "CLOSED"
self.failures = 0
return result
Advanced Logic
1. Thundering Herd Problem
When a system recovers, if 10,000 users retry at the exact same millisecond, the system crashes again. Solution: Add Jitter (randomness) to the retry interval.
2. Bulkhead Pattern
Often used with Circuit Breakers. Isolate different parts of the system so failures don't cascade.
- Example: Connection pool for Service A is separate from Service B. If A is slow, it consumes its own pool but doesn't starve B.
Why use it?
- Resource Protection: Don't waste threads/connections waiting for a dead service.
- User Experience: Return a fallback (cached data or "Service Unavailable") instantly instead of a 30s spinner.
About ScaleWiki
ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.
Read more about our Editorial Guidelines & Authorship.
Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.
Related Articles
System Design: Notification System
How to send millions of SMS, Email, and Push notifications reliably. Message Queues, Rate Limiting, and Retry policies.
API Gateway Pattern
The single entry point for microservices. Implementing rate limiting, authentication, and protocol translation/aggregation.
Caching Strategies
A breakdown of where to place your cache and how to keep it in sync with your database.