What is RED Metrics (Rate / Errors / Duration) in High-Level Design?

Observabilityhigh

RED Metrics (Rate / Errors / Duration)

RED is a framework for instrumenting services: Rate (requests per second), Errors (error rate or count), Duration (request latency histogram). Together they provide a complete picture of service health.

Memory anchor

RED metrics = a doctor's vital signs check. Rate = heart rate (how fast is it pumping?). Errors = blood pressure spikes (something's wrong). Duration = reaction time (how sluggish is the patient?).

Expected depth

For every service endpoint, capture: Rate (how much traffic is it handling?), Errors (what proportion of requests are failing?), Duration (at what latency percentile — p50, p95, p99, p999?). This is sufficient to answer: 'is this service healthy?', 'has this deploy degraded performance?', and 'is this service meeting its SLO?'. USE (Utilization, Saturation, Errors) is a complementary framework for infrastructure resources (CPU, memory, disk, network): Utilization (how busy is the resource?), Saturation (is it queueing?), Errors (is it failing?). RED is for services; USE is for resources.

Deep — senior internals

Latency percentiles matter more than averages. The average is misleading: p50=10ms, p99=2000ms means 1% of users experience 2-second latency, which is invisible in the average. Always instrument with histograms (Prometheus histogram or summary) and report p50, p95, p99, p999. The p999 (99.9th percentile) is the 'long tail' — at 1000 RPS, 1 request per second experiences p999 latency. For SLO measurement, use a ratio metric: error_rate = sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])). This is burned against the error budget.

🎤Interview-ready answer

I instrument every service with Prometheus histograms on three metrics: http_requests_total (labeled by endpoint, method, status_code), http_request_duration_seconds (histogram), and http_errors_total. Grafana dashboards show RED metrics per endpoint per deploy. SLO alerts fire when the 1-hour error burn rate exceeds 2x the allowed budget — fast enough to catch incidents before they exhaust the monthly budget, with enough signal to avoid false positives.

⚠Common trap

Setting alert thresholds on averages rather than percentiles. An average latency alert will miss the p99 degradation that your most demanding users experience.

Related concepts

High-Level Design

SLO / SLI / SLA / Error Budgets

System Design

Rate Limiting