SLA, SLO, and SLI
SLI (Service Level Indicator) is a metric (e.g., p99 latency). SLO (Service Level Objective) is the target for that metric (p99 latency < 200ms, 99.9% of requests). SLA (Service Level Agreement) is a contractual commitment to customers with penalties for violations.
SLI = the speedometer reading. SLO = the speed limit you set for yourself (55 mph). SLA = the speed limit on the sign with a fine if you break it. Always drive under YOUR limit so you never pay the ticket.
SLIs are the raw measurements: latency percentiles, error rate, availability (successful requests / total requests). SLOs are the internal targets teams operate to — they should be tighter than SLAs to leave an error budget. The error budget is the allowed failure time/rate before the SLO is breached: 99.9% availability allows ~43.8 minutes of downtime per month. Error budgets drive decisions: if you're burning budget fast, halt feature releases and focus on reliability. If budget is healthy, you can afford riskier changes.
Availability math: 99.9% = 43.8 min/month downtime. 99.99% = 4.38 min/month. 99.999% = 26 sec/month. Each nine is 10x harder to achieve and costs significantly more. The error budget framework (Google SRE): teams own their error budget. If a system violates its SLO, the team must prioritize reliability over features until the budget recovers. This creates a business-driven incentive for reliability without top-down mandates. Measuring availability correctly: don't just track uptime (is the server running?). Measure request success rate at the client level — a server that returns 500s for all requests is 'up' but has 0% availability. Use synthetic monitoring (probes from multiple regions) plus real user monitoring (RUM) for comprehensive coverage. Latency SLOs should target percentiles, not averages: if p99 is 2 seconds, 1% of users have a terrible experience — that's 10,000 users per million requests. Design for the tail.
I'd define SLOs before building: what does 'reliable' mean for this system? Typically: availability (99.9% or 99.99%), latency (p50/p95/p99 targets), and error rate (< 0.1%). I'd instrument every service to emit these SLIs and set up error budget alerts. When the budget is 50% consumed in the first half of the month, I'd trigger a reliability review. I'd choose SLOs based on what users actually need — a batch processing job has very different requirements than a payment API.
Conflating SLA and SLO. The SLA is the external promise (often 99.9% with service credits). The SLO is the internal target (99.95%) — always tighter to catch problems before they breach the SLA. Never tell a customer your SLO is the same as your SLA.