[DevOps Architecture] SLO and SLI - Don't Pursue 100%, Buy Speed With Error Budget

About this article

As the eleventh installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains SLO, SLI, and SLA.

100% uptime is not a goal to pursue - the essence of SLO is buying dev speed with error budgets. This article covers the relationship of SLI (actuals) / SLO (targets) / SLA (contracts), error-budget operations, choosing user-perspective SLIs, and per-service-type x target numerical gates.

What are SLO and SLI in the first place

Think of on-time train performance. Japanese railways hold a numerical target like “on-time rate of 99.x%”, drawing a clear line on how many minutes of delay are acceptable and what triggers improvement. Aiming for 100% leads to over-investment, so setting a realistic target and operating against it is the railroad’s wisdom.

SLO (Service Level Objective) is the web-service version of an on-time target. It numerically defines service quality goals like “99.9% uptime” or “response time under 200ms.” SLI (Service Level Indicator) is the actual measurement, and SLA (Service Level Agreement) is the contractual value with customers.

Without SLO, “make it more stable” and “make it faster” never end, leading either to over-investment in quality or neglecting quality until a major incident.

Why SLO and SLI are needed

Defining “good enough”

Without SLO, voices like “stabilize more” and “make it faster” never end, freezing development with overquality. Numerical agreement rationalizes investment decisions.

A common language for business and tech

Showing “99.9% uptime = 43 minutes monthly downtime” numerically lets the business side too understand technical compromises. Word-only debates run parallel forever.

Improvement priorities become clear

Areas with frequent SLO violations get investment priority, areas meeting them go to new-feature priority - resource allocation gets logical.

Differences among SLI / SLO / SLA

flowchart LR
    USER([User experience]) --> SLI[SLI<br/>actual<br/>e.g.: this month uptime 99.95%]
    SLI -->|compare with target| SLO[SLO<br/>internal target<br/>e.g.: don't drop below 99.9%]
    SLO -->|margin| SLA[SLA<br/>contract<br/>e.g.: penalty under 99.5%]
    SLO -.|error budget| EB[remaining 0.1%<br/>= 43 min monthly budget]
    EB -->|budget left| RELEASE[new feature release]
    EB -->|exhausted| FREEZE[feature freeze / reliability investment]
    classDef user fill:#fef3c7,stroke:#d97706;
    classDef sli fill:#dbeafe,stroke:#2563eb;
    classDef slo fill:#dcfce7,stroke:#16a34a,stroke-width:2px;
    classDef sla fill:#fae8ff,stroke:#a21caf;
    classDef budget fill:#f0f9ff,stroke:#0369a1;
    classDef bad fill:#fee2e2,stroke:#dc2626;
    class USER user;
    class SLI sli;
    class SLO slo;
    class SLA sla;
    class EB,RELEASE budget;
    class FREEZE bad;

	Meaning	Usage
SLI (Service Level Indicator)	Actual	Current uptime, latency
SLO (Service Level Objective)	Internal target	99.9%, under 200ms
SLA (Service Level Agreement)	Contractual promise	Contracts with external customers, penalty on violation

The iron rule is SLO < SLA. Without making the internal target stricter than the contract value, you break the promise made with customers. SLO is an internal guardrail with margin from SLA.

Typical SLIs

SLIs are metrics measuring user experience, with the following standard patterns. “CPU usage” and “memory consumption” aren’t SLIs (indirect metrics of user impact).

Type	Definition
Availability	Successful requests / total requests
Latency	Within 200ms at 95th percentile
Error rate	5xx errors / total requests
Throughput	Per-second processing count
Accuracy	Correct results / all results
Freshness	Time elapsed since data update

The right answer is choosing axes where users feel broken / slow / wrong.

Availability guideline table

“99.9%” is hard to feel, but converting to downtime makes judgment easier. Choose appropriate level per business requirement.

Availability	Allowed down/month	Allowed down/year	Suited for
99%	About 7 hours	About 3.6 days	Internal tools
99.9%	About 43 min	About 8.7 hours	General B2C services
99.95%	About 22 min	About 4.4 hours	B2B SaaS
99.99%	About 4.3 min	About 52 min	Finance, payments
99.999%	About 26 sec	About 5.2 min	Telecom, power

“99.999%” is the extremely strict level allowing only 20 sec monthly downtime. Excessive for most services.

Error Budget

The “allowable failure amount” for SLO. The thinking is “aim for 99.9%” = “0.1% can fail,” with this 0.1% being the error budget. Used as the operational rule of “release aggressively within budget, freeze releases on over-budget.”

SLO: 99.9% availability (allow 43 min monthly downtime)
|- Beginning of month: 43 min budget
   |- 10 min down on release → 33 min remaining
   |- 30 min down on incident → 3 min remaining
   |- Budget exhausted → freeze releases, prioritize stabilization

While error budget remains, accelerate development; on exhaustion, invest in reliability improvement - the mechanism balancing dev and ops.

SLO selection process

SLOs aren’t decided arbitrarily but agreed with business. Tech-leaning SLOs ignoring user impact are meaningless, requiring agreement among business, sales, and tech.

Step	Content
1. Identify critical path	Features users always pass through
2. User-experience axis	What constitutes “broken”
3. Measure existing values	Know current actuals
4. Propose targets	Realistic targets
5. Stakeholder agreement	Agree with business / management
6. Operations and review	Quarterly review

The safe approach is setting loose first, then gradually tightening. Promising 99.99% from the start can’t be kept.

SLO-based operation

Once SLO is set, operational decisions get numerical. “Should we release?” or “Should we alert?” gets decided by numbers, not feel. This is the essence of SRE operations.

Scenario	Decision criterion
Error budget remaining > 50%	Accelerate new-feature releases
Error budget 10-50%	Normal operation, careful
Error budget < 10%	Freeze releases, stabilize
Error budget exhausted	Stop releases, investigate

When error budget reaches “0,” stopping new features and improving reliability is the SRE rule. Failing to follow loses both reliability and speed.

The moment availability SLO is articulated, the dev team’s atmosphere changes - a standard anecdote of SRE introduction. At sites that agreed on 99.9% numerically, where the vague tension of “always cautious on releases” had pervaded, conversations like “32 min left this month, OK to fail” became possible, doubling new-feature deploy speed - cases also told. SLO is shown as not a number that constrains, but a number that lets you step in with confidence.

SLO alerts (burn rate)

Alerts that early-detect SLO violations use burn rate (budget-consumption speed). Consume 10% of monthly budget in 1 hour - warning; 50% - critical - judging by the slope of budget consumption.

Burn rate	Meaning	Response
1x	Normal consumption	None
5x	Early warning	Investigate
10x	Rapid consumption	Emergency response
50x	Critical	Immediate rollback

SLO burn rate is more meaningful than threshold-based alerts (CPU exceeding 80% etc.). Burn rate reflects user impact.

Multi-window multi-burn-rate

The technique of monitoring both rapid consumption in short time windows and persistent consumption in long windows. Detects both “rapidly eating budget in 1 hour” and “gradually eating over 24 hours.”

Window	Detection target
Short (1 hour, 6 hours)	Rapid incidents
Mid (1 day, 3 days)	Persistent issues
Long (7 days, 30 days)	Chronic quality decline

Detailed in Google SRE Workbook - the standard alert design of modern SRE.

Decision criterion 1: service nature

SLO strictness is decided by service nature. Stricter for life-critical services, looser for experimental ones - realistic.

Service type	Recommended SLO
Internal tools	99% (allow 7 hours monthly)
General B2C services	99.9% (43 min)
Important B2C / payments	99.95%-99.99%
Finance, medical	99.99%+
Beta / experimental	95% is enough

Decision criterion 2: org maturity

Realistic to introduce SLO after the org is somewhat mature. Setting SLO without measurement is meaningless - metric collection foundation comes first.

Maturity	State
Lv1: no monitoring	Build metric foundation first
Lv2: monitoring	Choose SLI candidates seeing actuals
Lv3: SLI selected	Set SLO and start operations
Lv4: SLO operating	Operational decisions by error budget
Lv5: SRE mature	Auto-judgment, autonomous operation

How to choose by case

Personal dev / internal tools

Availability 99% + only latency. Error-budget operations unneeded, just set up metric foundation. 7 hours monthly downtime is realistic, avoiding over-investment.

Startup / general B2C SaaS

3 pillars of availability 99.9% + P95 (95th percentile - response time excluding the slowest 5%) latency + error rate. Start with Datadog or Grafana Cloud’s SLO features. See actuals quarterly to adjust targets, operate the rule of release-freeze on error-budget exhaustion.

Finance / payments / medical

Availability 99.99% + multi-burn-rate alerts. SLA violations link directly to penalties, so SLO < SLA margin design is required. Build a regime where incident notifications immediately reach management.

AI agents / LLM services

4-axis SLO of accuracy, response delay, cost, and safety. Legacy availability SLO alone is insufficient. Measure accuracy SLI with DeepEval, Ragas, etc., continuously monitoring hallucination rate.

SLO-level x service-type numerical gates

Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.

SLO at just “99.9%” is vague - the practice is numerically setting multiple axes per service type.

Service type	Availability SLO	Latency (P95)	Error rate	Error budget/month
Internal tools	99%	1,000ms	1%	7 hours
General B2C Web	99.9%	300ms	0.5%	43 min
B2B SaaS	99.95%	200ms	0.3%	22 min
Finance / payments	99.99%	100ms	0.1%	4.3 min
Telecom, power	99.999%	50ms	0.01%	26 sec
AI agents (LLM)	Accuracy 95% / response delay 3s	Speed + accuracy	Hallucination rate < 5%	Custom design

Burn-rate alert numerical gates: Critical on consuming 2% of budget in 1 hour (burn rate > 14.4x), High on 5% in 6 hours (6x), Warning on 10% in 3 days (1x). This is the Google SRE Workbook standard firing criteria.

For SLO, separate numbers per service type. Same standard for all services becomes excessive or insufficient.

SLO-operation pitfalls and forbidden moves

Typical accident patterns in SLO operation. All produce a state where numbers don’t move.

Forbidden move	Why it’s bad
Set 100% uptime as target	Infinite cost, dev stops. Operations with error budget 0 break down
Set SLO = SLA	No internal guardrail, violation = contract violation with penalties
Make CPU usage an SLI	Doesn’t link directly to user impact. Error rate / latency are correct
Measure SLI with average	Slow 1% of users invisible. Measure with P95/P99
Fix SLO once decided	Both business and tech change. Quarterly review
Continue releases even on error-budget exhaustion	Reliability collapses, customer churn. Freeze releases on exhaustion
Suppress releases with budget remaining	Excessive stabilization, lost dev speed
SRE alone sets SLO without stakeholder agreement	Becomes numbers ignoring business impact
Operate only legacy thresholds without burn-rate alerts	Anomaly detection delayed. Modern standard is burn rate
Apply only availability SLO to AI systems	Need 4 axes of accuracy, response quality, cost
”Aim for 100% uptime” — pursuing perfection	Infinite cost, dev stops. SLO < 100% is correct
”Set SLA equal to SLO” — no margin	No internal guardrail, violation = contract violation with penalties

The reverse-pattern case in finance (set SLA = SLO, incident exceeded SLA → large penalties) is told as a typical case where no internal guardrail links directly to contract violation.

SLO is not a number that constrains, but a number to step in. Buy speed with error budget.

AI decision axes

AI-era favorable	AI-era unfavorable
AI-specific 4-axis SLO (accuracy, delay, cost, safety)	Availability SLO only
SLO < SLA with margin design	SLO = SLA with no internal guardrail
Release decisions by error budget	Error budget not operationalized
Burn-rate alerts	Static threshold alerts

Choose SLI by user impact — not CPU usage, but error rate, latency, accuracy
SLO < SLA with margin design — internal guardrail stricter than contract value
Release decision by error budget — numerically determine release acceleration / freeze per remaining
AI era is 4-axis SLO — add accuracy, cost, safety to availability

Author’s note - cases of “dev stopping from pursuing 100%”

Cases of pursuing perfection only to have key release speed stop are perennial SRE talking points.

There’s a story often heard about a mid-size SaaS where, without setting SLO, “zero incidents” was set as goal, and as a result, no new features shipped for 3 months and customers were taken by competitors. The typical case where dev resources got sucked into infinite tasks like “investigate every Warning” or “resolve every latency degradation,” stopping business. After introducing SLO (99.9%) and changing to operations of allowing in-budget incidents, release speed returned to over double - many such patterns reported.

Another, as a reverse pattern, a financial-system company had SLA (customer contract) at 99.9% but internal SLO also at 99.9%, and an incident exceeding SLA → large penalties for contract violation. A typical case told of the lesson that SLO must be set stricter than SLA with margin design.

Both have “no numerical agreement” as the root cause, slapping home that SLO isn’t a constraining number but a dial for engineering-handling speed-reliability balance.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

Critical path (SLO-target features)
SLI (what to measure)
SLO target (99.9%, 99.95%, etc.)
Difference from SLA (SLO < SLA)
Error-budget operational rules (per-remaining decision criteria)
Burn-rate alerts (short / mid / long term)
Review frequency (quarterly / semi-annually)

Summary

This article covered SLO and SLI, including SLI/SLO/SLA differences, typical SLIs, availability guidelines, error budgets, burn-rate alerts, per-service-type numerical gates, and AI-era 4-axis SLO.

Choose SLI by user impact, SLO<SLA with margin design, release decisions by error budget, AI era guarantees quality with 4-axis SLO. That is the practical answer for SLO/SLI design in 2026.

Next time we’ll cover incident response (on-call, postmortem).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.