Service Mesh (Istio / Linkerd)
A service mesh is an infrastructure layer that handles service-to-service communication — traffic routing, retries, circuit breaking, mTLS, and observability — using sidecar proxies deployed alongside each service.
Service mesh = invisible bodyguards walking beside every employee. They handle security checks and directions so workers focus on their actual job, not navigating the building.
The sidecar proxy (Envoy in Istio, linkerd-proxy in Linkerd) intercepts all inbound and outbound traffic. This moves reliability and security concerns out of application code and into the infrastructure: developers no longer implement retry logic or TLS handshakes in each service. The control plane (Istiod) distributes policy — traffic rules, mTLS certificates, telemetry configuration — to all sidecars. Benefits: consistent mTLS across all services, fine-grained traffic splitting for canary deployments, automatic distributed tracing instrumentation, and circuit breaking without library changes.
The sidecar adds ~1ms latency per hop and significant memory overhead (Envoy is ~50MB per pod). At very high QPS or on resource-constrained nodes this is measurable. Istio's control plane is operationally complex — misconfigured authorization policies have caused production outages at companies like Lyft. The eBPF-based approach (Cilium, Istio ambient mesh) eliminates the sidecar overhead by moving proxy logic into the kernel, reducing latency and memory usage. For Kubernetes clusters, the service mesh also solves certificate rotation: SPIFFE/SPIRE provides workload identity and automatic certificate rotation, eliminating the manual cert management problem.
I adopt a service mesh when the team count exceeds the point where manually managing retries, circuit breakers, and mTLS in each service becomes inconsistent. The immediate wins are: (1) mTLS for zero-trust networking without code changes; (2) traffic shifting for canary deployments at the infrastructure level; (3) automatic trace propagation without SDK changes in every service. I start with Linkerd for its simplicity and move to Istio only if I need its advanced traffic management features.
Thinking a service mesh eliminates the need to design for failure in application code. The mesh can retry transient failures, but your service must still be idempotent for retries to be safe, and you still need circuit breakers for cascading failure scenarios where the mesh's retry amplifies load.