Database Observability
Database observability means monitoring the health, performance, and behavior of your database in production. Key metrics: query latency (p50, p95, p99), connection count, CPU and memory utilization, disk I/O, replication lag, cache hit ratio, dead tuple count, and slow query logs. Without observability, you are flying blind — performance issues are discovered by users, not engineers.
Database observability is the dashboard on your car: pg_stat_statements is the speedometer (which queries are burning the most fuel), cache hit ratio is the temperature gauge (should be 99%+ or the engine is overheating), and replication lag is the 'check engine' light. By the time passengers (users) complain, the engine has been knocking for a while.
PostgreSQL observability stack: pg_stat_statements (query performance — total time, calls, mean time, rows), pg_stat_user_tables (seq scans, index scans, dead tuples, last autovacuum), pg_stat_user_indexes (index usage), pg_stat_activity (active queries, blocked queries, idle-in-transaction connections), pg_stat_replication (replication lag). Buffer cache hit ratio should be >99% (SELECT sum(heap_blks_hit) / sum(heap_blks_hit + heap_blks_read) FROM pg_statio_user_tables). Alert on: replication lag > 10 seconds, connection count approaching max_connections, cache hit ratio dropping, autovacuum not completing (dead tuples growing). For MySQL: SHOW ENGINE INNODB STATUS, performance_schema, slow query log. For DynamoDB: CloudWatch metrics for consumed capacity, throttled requests, system errors.
Advanced observability: auto_explain (PostgreSQL extension) logs execution plans for slow queries without manual EXPLAIN. pg_stat_kcache tracks actual disk I/O and CPU time per query. For lock analysis, pg_stat_activity combined with pg_locks shows blocked and blocking queries. Long-running idle-in-transaction connections are silent killers — they hold locks and prevent autovacuum. Set idle_in_transaction_session_timeout to auto-terminate them. Wait event analysis (pg_stat_activity.wait_event_type) reveals whether queries are waiting on I/O, locks, or CPU. Connection pool monitoring (PgBouncer SHOW STATS) tracks pool utilization, wait time, and query routing. Dashboards should show: top queries by total time (not just slow queries), index hit ratio trend, replication lag over time, autovacuum activity, and connection pool utilization.
Database observability starts with pg_stat_statements — it tells me which queries consume the most total resources (calls times mean_time). I sort by total_time to find the queries worth optimizing. Beyond that, I monitor: buffer cache hit ratio (should be over 99%), replication lag, connection count vs max_connections, dead tuple growth (indicates autovacuum is not keeping up), and lock waits (pg_stat_activity with wait_event). I set alerts for: replication lag exceeding 10 seconds, connection utilization above 80%, cache hit ratio dropping below 99%, and idle-in-transaction connections exceeding a timeout. The key insight is proactive monitoring: by the time users report slowness, the database has been struggling for a while. Dashboards showing query performance trends catch degradation early.
Only monitoring average query latency instead of percentiles. An average of 5ms can hide the fact that p99 is 500ms — meaning 1 in 100 queries is 100x slower. Always monitor p50, p95, and p99 latency. Often the p99 reveals index problems or lock contention that the average conceals.