Consensus: Raft and Paxos
Consensus algorithms allow a group of distributed nodes to agree on a single value even if some nodes fail, forming the foundation for distributed coordination, leader election, and replicated state machines.
Raft = electing a class president. One leader, majority vote to pass anything. If the president is absent, hold a new election. Simple democracy -- everyone understands the rules, unlike Paxos (parliamentary procedure nobody can follow).
Paxos was the first practical consensus algorithm (Lamport, written ~1990, published 1998) but is notoriously hard to understand and implement. Raft was designed explicitly for understandability and is now the standard: nodes elect a leader; all writes go through the leader which replicates to followers; commits happen when a majority (quorum) acknowledges. Raft is used in etcd (Kubernetes), CockroachDB, TiKV, and Consul.
Raft phases: Leader election — nodes start as followers; if no heartbeat received within election timeout, a follower becomes a candidate and requests votes. The candidate with the most up-to-date log (highest term + log index) wins. Log replication — leader receives client writes, appends to its log, sends AppendEntries RPCs to followers, commits when a majority acknowledges, then responds to client. Log safety — a leader can only commit entries from its current term; this prevents the 'ghost entry' problem. Raft's key insight: leader-based with strong leader invariants makes reasoning tractable. Multi-Paxos optimizes classic Paxos for repeated consensus by maintaining a stable leader. In practice, latency of consensus is 1 RTT (leader to quorum) — this is the unavoidable cost of strong consistency. Geo-distributed consensus requires crossing regional network RTTs (50–200ms), making it unsuitable for low-latency write paths.
For distributed coordination (leader election, distributed locks, configuration management), I'd use etcd — it's battle-tested Raft and already in every Kubernetes cluster. For data stores requiring strong consistency, I'd use CockroachDB or Spanner rather than implementing Raft myself. The key design question is: can I afford the latency cost of consensus on my write path? For a payment service where consistency is non-negotiable, yes. For a user-facing write path where latency matters, I'd push consistency off the critical path using async reconciliation.
Thinking consensus is free in terms of latency. Every consensus round requires responses from a quorum of N/2+1 nodes — that's at minimum 1 network round trip to the slowest quorum member. In a 3-AZ setup, that's at minimum 1ms intra-region latency on every write. For cross-region consensus, latency is 50–200ms. Design your hot path to avoid consensus or use pre-leasing techniques.