What is Active-Active vs Active-Passive in High-Level Design?

Infrastructure & Deploymentshigh

Active-Active vs Active-Passive

Active-active: all instances/regions handle live traffic simultaneously. Active-passive: one instance/region handles traffic; the passive standby takes over only on failover.

Memory anchor

Active-active = two goalkeepers both guarding the net simultaneously (tricky coordination). Active-passive = one keeper plays while the backup sits on the bench, ready to sub in if the starter gets injured.

Expected depth

Active-passive is simpler: the passive instance is a warm standby that can be promoted in 30–60s. No conflict resolution needed — only one writer exists at a time. Downside: the passive instance is paying for capacity you do not use, and failover involves a state transition that can cause brief unavailability. Active-active runs all instances under load, provides true zero-downtime during regional failures, and uses full capacity. The critical challenge is write conflicts: if two regions accept writes to the same record simultaneously, you need a conflict resolution strategy (last-write-wins with timestamps, CRDTs, or consensus protocol).

Deep — senior internals

Last-write-wins (LWW) with timestamps is dangerous: if clocks are not perfectly synchronized (and they are never perfectly synchronized in distributed systems), LWW silently drops valid writes. CRDTs (Conflict-Free Replicated Data Types) provide mathematically guaranteed convergence for specific data structures (counters, sets, maps). For general-purpose entity writes, the safest active-active strategy is to route writes for a given entity (user, order) to a deterministic home region based on a hash or explicit affinity — this eliminates conflicts for entity writes while retaining the availability benefits of active-active. This is 'regional write affinity' and is the approach used by Shopify for their global infrastructure.

🎤Interview-ready answer

For most systems I recommend active-passive with fast failover (Route 53 health checks + Aurora Global Database promote takes ~<60s) as the starting point — it's dramatically simpler than active-active. I move to active-active only when RTO requirements are under 30 seconds or when global write latency is causing user-facing impact. When I do active-active, I use regional write affinity: users are hashed to a home region for writes, with a consistent hashing scheme, and reads are served locally with cross-region replication lag tolerance.

⚠Common trap

Saying 'we're active-active' when the database is active-passive. The availability guarantee is determined by the weakest component in the chain. Application-level active-active with database-level active-passive provides the database's (lower) availability guarantee.

Related concepts

High-Level Design

Multi-Region Architecture