What is Database Observability in Databases?

Relational Databasesmedium

Database Observability

Database observability means monitoring the health, performance, and behavior of your database in production. Key metrics: query latency (p50, p95, p99), connection count, CPU and memory utilization, disk I/O, replication lag, cache hit ratio, dead tuple count, and slow query logs. Without observability, you are flying blind — performance issues are discovered by users, not engineers.

Memory anchor

Database observability is the dashboard on your car: pg_stat_statements is the speedometer (which queries are burning the most fuel), cache hit ratio is the temperature gauge (should be 99%+ or the engine is overheating), and replication lag is the 'check engine' light. By the time passengers (users) complain, the engine has been knocking for a while.

Expected depth

PostgreSQL observability stack: pg_stat_statements (query performance — total time, calls, mean time, rows), pg_stat_user_tables (seq scans, index scans, dead tuples, last autovacuum), pg_stat_user_indexes (index usage), pg_stat_activity (active queries, blocked queries, idle-in-transaction connections), pg_stat_replication (replication lag). Buffer cache hit ratio should be >99% (SELECT sum(heap_blks_hit) / sum(heap_blks_hit + heap_blks_read) FROM pg_statio_user_tables). Alert on: replication lag > 10 seconds, connection count approaching max_connections, cache hit ratio dropping, autovacuum not completing (dead tuples growing). For MySQL: SHOW ENGINE INNODB STATUS, performance_schema, slow query log. For DynamoDB: CloudWatch metrics for consumed capacity, throttled requests, system errors.

Deep — senior internals

Advanced observability: auto_explain (PostgreSQL extension) logs execution plans for slow queries without manual EXPLAIN. pg_stat_kcache tracks actual disk I/O and CPU time per query. For lock analysis, pg_stat_activity combined with pg_locks shows blocked and blocking queries. Long-running idle-in-transaction connections are silent killers — they hold locks and prevent autovacuum. Set idle_in_transaction_session_timeout to auto-terminate them. Wait event analysis (pg_stat_activity.wait_event_type) reveals whether queries are waiting on I/O, locks, or CPU. Connection pool monitoring (PgBouncer SHOW STATS) tracks pool utilization, wait time, and query routing. Dashboards should show: top queries by total time (not just slow queries), index hit ratio trend, replication lag over time, autovacuum activity, and connection pool utilization.

🎤Interview-ready answer

Database observability starts with pg_stat_statements — it tells me which queries consume the most total resources (calls times mean_time). I sort by total_time to find the queries worth optimizing. Beyond that, I monitor: buffer cache hit ratio (should be over 99%), replication lag, connection count vs max_connections, dead tuple growth (indicates autovacuum is not keeping up), and lock waits (pg_stat_activity with wait_event). I set alerts for: replication lag exceeding 10 seconds, connection utilization above 80%, cache hit ratio dropping below 99%, and idle-in-transaction connections exceeding a timeout. The key insight is proactive monitoring: by the time users report slowness, the database has been struggling for a while. Dashboards showing query performance trends catch degradation early.

⚠Common trap

Only monitoring average query latency instead of percentiles. An average of 5ms can hide the fact that p99 is 500ms — meaning 1 in 100 queries is 100x slower. Always monitor p50, p95, and p99 latency. Often the p99 reveals index problems or lock contention that the average conceals.

Related concepts

Node.js

Database Connection Pooling