What is SLO / SLI / SLA / Error Budgets in High-Level Design?

Observabilityhigh

SLO / SLI / SLA / Error Budgets

SLI (Service Level Indicator): a measurable metric (e.g., request success rate). SLO (Service Level Objective): a target for the SLI (e.g., 99.9% success rate). SLA (Service Level Agreement): a business contract with financial penalties for SLO breach. Error budget: the allowed failure headroom (100% - SLO target).

Memory anchor

Error budget = a jar of 'oops tokens.' Each outage spends a token. When the jar is empty, you stop shipping features and fix things. SLO is your personal speed limit; SLA is the speed that gets you a ticket.

Expected depth

A 99.9% availability SLO means 8.7 hours of allowed downtime per year (0.1% of 8760 hours). The error budget is the 0.1%. Teams track error budget burn rate: if you burn 50% of your monthly error budget in one week, you need to slow down feature work and invest in reliability. Error budgets create alignment between engineering and product: when the budget is healthy, ship features; when it's exhausted, freeze new launches and focus on reliability. SLOs should be set based on user happiness (what latency/error rate causes users to notice?) not on what is technically achievable.

Deep — senior internals

Multi-window, multi-burn-rate alerting is the correct alerting strategy for SLOs. Alert when: (1) burn rate > 14x in a 1-hour window AND burn rate > 14x in a 5-minute window (fast burn — page immediately); (2) burn rate > 6x in a 6-hour window AND burn rate > 6x in 30-minute window (slow burn — ticket/slack). This covers both sudden spikes and slow degradation. This approach, described in the Google SRE Workbook, reduces alert noise while catching all meaningful reliability issues. Error budget policies should be documented: what happens when the budget is exhausted? (engineering freeze, SLO review, postmortem). Without a policy, the error budget is just a metric, not a cultural tool.

🎤Interview-ready answer

I set SLOs by starting with user research: what response time causes users to abandon? What error rate generates support tickets? I target SLOs 20% stricter than SLAs to provide a buffer. Error budget burn is tracked with multi-window alerting in Prometheus/Alertmanager. The error budget policy is: at 50% monthly burn, the team allocates 20% of sprint capacity to reliability work; at 100% burn, all feature work stops until the budget is restored.

⚠Common trap

Conflating SLO with SLA. If your SLO equals your SLA, any SLO violation is immediately an SLA breach and a customer-impacting business event. SLOs must be stricter than SLAs to provide headroom for operational response.

Related concepts

High-Level Design

RED Metrics (Rate / Errors / Duration)