Observabilityhigh

RED Metrics (Rate / Errors / Duration)

RED is a framework for instrumenting services: Rate (requests per second), Errors (error rate or count), Duration (request latency histogram). Together they provide a complete picture of service health.

Memory anchor

RED metrics = a doctor's vital signs check. Rate = heart rate (how fast is it pumping?). Errors = blood pressure spikes (something's wrong). Duration = reaction time (how sluggish is the patient?).

Expected depth

For every service endpoint, capture: Rate (how much traffic is it handling?), Errors (what proportion of requests are failing?), Duration (at what latency percentile — p50, p95, p99, p999?). This is sufficient to answer: 'is this service healthy?', 'has this deploy degraded performance?', and 'is this service meeting its SLO?'. USE (Utilization, Saturation, Errors) is a complementary framework for infrastructure resources (CPU, memory, disk, network): Utilization (how busy is the resource?), Saturation (is it queueing?), Errors (is it failing?). RED is for services; USE is for resources.

Deep — senior internals

Latency percentiles matter more than averages. The average is misleading: p50=10ms, p99=2000ms means 1% of users experience 2-second latency, which is invisible in the average. Always instrument with histograms (Prometheus histogram or summary) and report p50, p95, p99, p999. The p999 (99.9th percentile) is the 'long tail' — at 1000 RPS, 1 request per second experiences p999 latency. For SLO measurement, use a ratio metric: error_rate = sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])). This is burned against the error budget.

🎤Interview-ready answer

I instrument every service with Prometheus histograms on three metrics: http_requests_total (labeled by endpoint, method, status_code), http_request_duration_seconds (histogram), and http_errors_total. Grafana dashboards show RED metrics per endpoint per deploy. SLO alerts fire when the 1-hour error burn rate exceeds 2x the allowed budget — fast enough to catch incidents before they exhaust the monthly budget, with enough signal to avoid false positives.

Common trap

Setting alert thresholds on averages rather than percentiles. An average latency alert will miss the p99 degradation that your most demanding users experience.

Related concepts