Liveness & Readiness Probes
Liveness probe: is the container alive? If it fails, kill and restart the container. Readiness probe: is the container ready to serve traffic? If it fails, remove from Service endpoints.
Liveness = poking someone to check if they're ALIVE ('hey, are you breathing?'). If not, call 911 (restart). Readiness = asking 'are you READY to work?' If not, don't send them customers, but don't kill them either. Startup probe = 'still getting dressed? I won't poke you yet.'
Probe types: httpGet (GET request to a path), tcpSocket (TCP connection), exec (run command inside container, exit 0 = success), grpc (gRPC health check). Parameters: initialDelaySeconds (wait before first probe), periodSeconds (interval), failureThreshold (fails before action), successThreshold (successes to recover). Startup probe: for slow-starting containers — disables liveness during startup, prevents premature restarts.
Liveness failure → container restart (not pod eviction) → increments restart count → eventually CrashLoopBackOff with exponential backoff. Readiness failure → pod removed from Endpoints → no traffic (pod keeps Running). Common pattern: /healthz returns 200 immediately (liveness), /ready returns 200 when dependencies (DB, cache) are reachable (readiness). PodDisruptionBudget (PDB): minAvailable or maxUnavailable ensures a minimum number of ready pods during voluntary disruptions (node drain, rolling updates). Without PDB, draining a node can take all replicas offline simultaneously. Readiness gate: additional conditions beyond container readiness — used by service meshes and load balancers to signal pod ready for traffic from their perspective.
Liveness and readiness serve different purposes. Liveness restarts containers that are stuck — infinite loop, deadlock, corrupted internal state. Readiness controls traffic — remove the pod from load balancing while it's initializing, warming up caches, or temporarily degraded. Always define both. For readiness, check actual dependencies — can the app reach its database? Startup probe prevents liveness from killing a slow-starting container during initial boot. PodDisruptionBudget ensures rolling updates and node drains don't take all your pods offline at once.
A failing liveness probe that also makes readiness fail will cause a restart loop — pod gets removed from endpoints, restarts, removed again, restarts. This amplifies an outage. Keep liveness checks lightweight and infrastructure-only (is the process alive?). Put business logic checks in readiness probes only.