Reliabilitycritical

Circuit Breaker

A circuit breaker stops calls to a failing downstream service after a failure threshold is exceeded, returning an error immediately (or a cached fallback) until the service recovers, preventing cascading failures.

Memory anchor

Circuit breaker = your home's electrical breaker panel. When a downstream appliance (service) short-circuits, the breaker TRIPS to protect the whole house. Half-open = cautiously flipping the switch back to test if the toaster stopped smoking.

Expected depth

Three states: Closed (normal operation, calls pass through), Open (calls fail fast, downstream not called), Half-Open (allow a test request; if it succeeds, close the circuit; if it fails, remain open). Failure threshold triggers Open state: e.g., 50% error rate over 10 requests in 60 seconds. The breaker periodically transitions to Half-Open to test recovery. Libraries: Hystrix (Netflix), Resilience4j, Polly (.NET).

Deep — senior internals

Circuit breakers prevent cascading failures: if Service A calls Service B, and B is slow (not failing fast), A's threads accumulate waiting for B. Eventually A's thread pool is exhausted and A fails for all callers, propagating the failure upstream. The breaker cuts this cascade by failing fast — A gets an immediate error (or cached response) instead of waiting, freeing its threads. Count-based vs time-based windows: count-based opens after N consecutive failures (simple but can open on a burst of unrelated errors). Time-based sliding window (Resilience4j default) tracks failures in the last N seconds — more accurate for noisy systems. Fallback strategies: return cached data (acceptable for feeds, product catalogs), return degraded response (search returns fewer results), or return a clear error (better than a silent timeout). Bulkhead pattern complements circuit breakers: isolate thread pools per downstream service so a slow dependency consumes only its allocated threads, not the entire service thread pool.

🎤Interview-ready answer

I'd implement circuit breakers at every external service call boundary. I'd configure failure thresholds conservatively (80% error rate over 20 calls) to avoid false trips on bursty traffic. The fallback is critical — for product recommendations, I'd return a static 'trending items' list; for search, return the previous result. I'd pair circuit breakers with bulkheads: separate Hystrix thread pools (or semaphore limits) per downstream service so one failing dependency can't exhaust all threads.

Common trap

Opening the circuit breaker too aggressively. If the threshold is 10% error rate over 5 requests, a temporary network blip opens the circuit and takes down a dependency unnecessarily. Tune thresholds based on baseline error rates from production metrics, not theoretical minimums.