What is Blue-Green & Canary Deployments in High-Level Design?

Infrastructure & Deploymentshigh

Blue-Green & Canary Deployments

Blue-green: maintain two identical production environments (blue=current, green=new); switch traffic atomically on deploy. Canary: gradually shift a small percentage of traffic to the new version before full rollout.

Memory anchor

Blue-green = having two identical stages — flip the curtain to reveal the new set instantly. Canary = sending one coal miner's canary into the new tunnel first; if it comes back alive, send everyone else.

Expected depth

Blue-green provides instant rollback (flip traffic back to blue) and zero-downtime deployments, but doubles infrastructure cost during the deployment window and requires the new version to handle all traffic at once — no gradual validation. Canary releases expose the new version to a small percentage (1–5%) of real traffic, enabling validation against real usage before full rollout. If error rates spike on the canary, roll back only the canary segment. Canary is superior for validating performance and correctness at scale but requires more sophisticated traffic management infrastructure (Kubernetes traffic splitting, Istio weight-based routing, feature flags).

Deep — senior internals

Database schema migrations are the hardest part of blue-green deployments. Both blue and green versions must be able to run simultaneously against the same database during the deployment window. This requires expand/contract migrations: (1) expand: add the new column/table in a backward-compatible way; (2) deploy the new version; (3) contract: remove the old column/table in a subsequent release. This requires at least 3 deployments per schema change. Canary deployments in Kubernetes can be implemented with Argo Rollouts or Flagger, which automate traffic weight adjustment based on observed metrics (error rate, latency) using Prometheus queries as rollout criteria.

🎤Interview-ready answer

I use canary deployments as the default deployment strategy with automated rollback criteria. Argo Rollouts manages traffic weight: 5% canary for 15 minutes → check error rate < 0.1% and p99 latency < 200ms → 25% → check again → 100%. If any metric check fails, Argo automatically rolls back. For database migrations, I enforce the expand/contract pattern: every migration PR is reviewed for backward compatibility with the previous application version before merging.

⚠Common trap

Running blue-green deployments with a shared database that has a schema migration applied as part of the deployment. If the migration is not backward compatible with the blue version, the instant traffic switch breaks all blue instances. Schema and code deployments must be decoupled.

Related concepts

High-Level Design

Feature Flags