Back to All Concepts
ReliabilityMicroservicesDesign PatternsIntermediate

Circuit Breaker Pattern

A mechanism to prevent an application from repeatedly trying to execute an operation that's likely to fail.

Preventing Cascading Failures

If Service A calls Service B, and Service B is down, Service A shouldn't keep waiting and timing out. It should "trip the circuit" and fail fast.

The 3 States (Finite State Machine)

The Circuit Breaker is a state machine with three possible states:

mermaid
stateDiagram-v2
    [*] --> Closed
    
    state "Closed (Success)" as Closed
    state "Open (Failure)" as Open
    state "Half-Open (Test)" as HalfOpen

    Closed --> Open: Failure Count > Threshold
    Open --> HalfOpen: Timeout Expired
    HalfOpen --> Closed: Success
    HalfOpen --> Open: Failure
Click to expand code...

1. Closed (Normal Operation)

  • Behavior: Requests flow through to the service normally.
  • Counting: We count failures (e.g., 500 errors, timeouts).
  • Tripping: If failures > threshold within a time window, trip to Open.

2. Open (Tripped)

  • Behavior: Requests are blocked immediately. The breaker throws a CircuitOpenException.
  • Why: This prevents the "Thundering Herd" problem where 1000s of requests hammer a struggling database, preventing it from recovering.
  • Reset: After reset_timeout seconds, transition to Half-Open.

3. Half-Open (The Canary)

  • Behavior: We allow 1 request to pass through to test the waters.
  • Success: The external service is back! Reset counts and go to Closed.
  • Failure: It's still broken. Go back to Open and double the wait time (Exponential Backoff).

Implementation (Python)

Here is a thread-safe implementation of a basic Circuit Breaker.

python
import time
import threading

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=10):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.state = "CLOSED"
        self.failures = 0
        self.last_failure_time = 0
        self.lock = threading.Lock()

    def call(self, func, *args, **kwargs):
        with self.lock:
            if self.state == "OPEN":
                # Check if it's time to try again (Half-Open)
                if time.time() - self.last_failure_time > self.recovery_timeout:
                    self.state = "HALF-OPEN"
                else:
                    raise Exception("Circuit is OPEN")

        try:
            result = func(*args, **kwargs)
        except Exception as e:
            # If we fail in Half-Open, go back to Open immediately
            self.failures += 1
            self.last_failure_time = time.time()
            if self.failures >= self.failure_threshold or self.state == "HALF-OPEN":
                self.state = "OPEN"
            raise e

        # Success! Reset everything.
        with self.lock:
            self.state = "CLOSED"
            self.failures = 0
        
        return result
Click to expand code...

Advanced Logic

1. Thundering Herd Problem

When a system recovers, if 10,000 users retry at the exact same millisecond, the system crashes again. Solution: Add Jitter (randomness) to the retry interval.

2. Bulkhead Pattern

Often used with Circuit Breakers. Isolate different parts of the system so failures don't cascade.

  • Example: Connection pool for Service A is separate from Service B. If A is slow, it consumes its own pool but doesn't starve B.

Why use it?

  • Resource Protection: Don't waste threads/connections waiting for a dead service.
  • User Experience: Return a fallback (cached data or "Service Unavailable") instantly instead of a 30s spinner.

About ScaleWiki

ScaleWiki is an interactive educational platform dedicated to demystifying distributed systems, software architecture, and system design. Our mission is to provide high-quality, technically accurate resources for software engineers preparing for interviews or solving complex scaling challenges in production.

Read more about our Editorial Guidelines & Authorship.

Educational Disclaimer: The architectural patterns and system designs discussed in this article are based on common industry practices, technical whitepapers, and public engineering blogs. Actual implementations in enterprise environments may vary significantly based on specific product requirements, legacy constraints, and evolving technologies.

Related Articles