Circuit Breaker
A circuit breaker stops calls to a failing downstream service after a failure threshold is exceeded, returning an error immediately (or a cached fallback) until the service recovers, preventing cascading failures.
Circuit breaker = your home's electrical breaker panel. When a downstream appliance (service) short-circuits, the breaker TRIPS to protect the whole house. Half-open = cautiously flipping the switch back to test if the toaster stopped smoking.
Three states: Closed (normal operation, calls pass through), Open (calls fail fast, downstream not called), Half-Open (allow a test request; if it succeeds, close the circuit; if it fails, remain open). Failure threshold triggers Open state: e.g., 50% error rate over 10 requests in 60 seconds. The breaker periodically transitions to Half-Open to test recovery. Libraries: Hystrix (Netflix), Resilience4j, Polly (.NET).
Circuit breakers prevent cascading failures: if Service A calls Service B, and B is slow (not failing fast), A's threads accumulate waiting for B. Eventually A's thread pool is exhausted and A fails for all callers, propagating the failure upstream. The breaker cuts this cascade by failing fast — A gets an immediate error (or cached response) instead of waiting, freeing its threads. Count-based vs time-based windows: count-based opens after N consecutive failures (simple but can open on a burst of unrelated errors). Time-based sliding window (Resilience4j default) tracks failures in the last N seconds — more accurate for noisy systems. Fallback strategies: return cached data (acceptable for feeds, product catalogs), return degraded response (search returns fewer results), or return a clear error (better than a silent timeout). Bulkhead pattern complements circuit breakers: isolate thread pools per downstream service so a slow dependency consumes only its allocated threads, not the entire service thread pool.
I'd implement circuit breakers at every external service call boundary. I'd configure failure thresholds conservatively (80% error rate over 20 calls) to avoid false trips on bursty traffic. The fallback is critical — for product recommendations, I'd return a static 'trending items' list; for search, return the previous result. I'd pair circuit breakers with bulkheads: separate Hystrix thread pools (or semaphore limits) per downstream service so one failing dependency can't exhaust all threads.
Opening the circuit breaker too aggressively. If the threshold is 10% error rate over 5 requests, a temporary network blip opens the circuit and takes down a dependency unnecessarily. Tune thresholds based on baseline error rates from production metrics, not theoretical minimums.