Replication Strategies
Replication copies data from a primary database to one or more replicas. It serves two purposes: high availability (if the primary fails, a replica takes over) and read scaling (distribute read queries across replicas). Synchronous replication waits for replicas to confirm before committing — safer but slower. Asynchronous replication commits immediately and sends changes to replicas in the background — faster but replicas may lag.
Replication is a teacher writing on a whiteboard (primary) while TAs copy it onto whiteboards in other rooms (replicas). Sync replication: the teacher waits until at least one TA finishes copying before continuing. Async: the teacher keeps writing and TAs catch up at their own pace — students in other rooms might see yesterday's notes for a few seconds.
PostgreSQL streaming replication sends WAL records to replicas in real-time. Synchronous replication (synchronous_standby_names) guarantees zero data loss but adds write latency (round-trip to the fastest synchronous replica). Async replication is the default and sufficient for most read-scaling use cases, accepting that replicas may be milliseconds to seconds behind. Logical replication (PostgreSQL 10+) replicates specific tables and supports different schemas on publisher/subscriber — useful for zero-downtime migrations and selective replication. MySQL replication: async is default, semi-sync waits for at least one replica to acknowledge receipt of the binlog (but not apply it), Group Replication provides Paxos-based consensus for automatic failover.
Replication lag is the operational challenge. Causes: replica hardware slower than primary, long-running queries on the replica blocking WAL apply, heavy write load on the primary. Monitoring: pg_stat_replication shows replay_lag. Mitigations: dedicated replica hardware, hot_standby_feedback (prevents primary from vacuuming rows still needed by replica queries), and read-your-writes consistency (route reads to the primary immediately after a write, then to replicas after a delay). Multi-region replication introduces network latency — cross-region async replication is standard, cross-region sync replication is impractical for most workloads. CockroachDB and Spanner provide multi-region strong consistency but at significant latency cost for writes. For most applications, async replication with monitoring and read-your-writes routing is the practical choice.
I use asynchronous streaming replication as the default for PostgreSQL — it provides read scaling and high availability with minimal write latency impact. For critical data (financial transactions), I enable synchronous replication to at least one replica to guarantee zero data loss on primary failure. The main operational concern is replication lag: I monitor pg_stat_replication, set up alerts when lag exceeds a threshold, and implement read-your-writes consistency in the application (route reads to the primary for a short window after a write). For zero-downtime schema migrations, logical replication is invaluable — I can replicate to a new table with the updated schema and switch over with minimal downtime.
Assuming replicas are always up to date. Replication lag means a read from a replica immediately after a write to the primary may not see the change. This causes confusing bugs: a user creates a record and then cannot see it. Always implement read-your-writes consistency — route reads to the primary for a short period after writes.