Consistencyhigh

Consensus: Raft and Paxos

Consensus algorithms allow a group of distributed nodes to agree on a single value even if some nodes fail, forming the foundation for distributed coordination, leader election, and replicated state machines.

Memory anchor

Raft = electing a class president. One leader, majority vote to pass anything. If the president is absent, hold a new election. Simple democracy -- everyone understands the rules, unlike Paxos (parliamentary procedure nobody can follow).

Expected depth

Paxos was the first practical consensus algorithm (Lamport, written ~1990, published 1998) but is notoriously hard to understand and implement. Raft was designed explicitly for understandability and is now the standard: nodes elect a leader; all writes go through the leader which replicates to followers; commits happen when a majority (quorum) acknowledges. Raft is used in etcd (Kubernetes), CockroachDB, TiKV, and Consul.

Deep — senior internals

Raft phases: Leader election — nodes start as followers; if no heartbeat received within election timeout, a follower becomes a candidate and requests votes. The candidate with the most up-to-date log (highest term + log index) wins. Log replication — leader receives client writes, appends to its log, sends AppendEntries RPCs to followers, commits when a majority acknowledges, then responds to client. Log safety — a leader can only commit entries from its current term; this prevents the 'ghost entry' problem. Raft's key insight: leader-based with strong leader invariants makes reasoning tractable. Multi-Paxos optimizes classic Paxos for repeated consensus by maintaining a stable leader. In practice, latency of consensus is 1 RTT (leader to quorum) — this is the unavoidable cost of strong consistency. Geo-distributed consensus requires crossing regional network RTTs (50–200ms), making it unsuitable for low-latency write paths.

🎤Interview-ready answer

For distributed coordination (leader election, distributed locks, configuration management), I'd use etcd — it's battle-tested Raft and already in every Kubernetes cluster. For data stores requiring strong consistency, I'd use CockroachDB or Spanner rather than implementing Raft myself. The key design question is: can I afford the latency cost of consensus on my write path? For a payment service where consistency is non-negotiable, yes. For a user-facing write path where latency matters, I'd push consistency off the critical path using async reconciliation.

Common trap

Thinking consensus is free in terms of latency. Every consensus round requires responses from a quorum of N/2+1 nodes — that's at minimum 1 network round trip to the slowest quorum member. In a 3-AZ setup, that's at minimum 1ms intra-region latency on every write. For cross-region consensus, latency is 50–200ms. Design your hot path to avoid consensus or use pre-leasing techniques.