DevOps Architecture

[DevOps Architecture] SLO and SLI - Don't Pursue 100%, Buy Speed With Error Budget

[DevOps Architecture] SLO and SLI - Don't Pursue 100%, Buy Speed With Error Budget

About this article

As the eleventh installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains SLO, SLI, and SLA.

100% uptime is not a goal to pursue - the essence of SLO is buying dev speed with error budgets. This article covers the relationship of SLI (actuals) / SLO (targets) / SLA (contracts), error-budget operations, choosing user-perspective SLIs, and per-service-type x target numerical gates.

What are SLO and SLI in the first place

Think of on-time train performance. Japanese railways hold a numerical target like “on-time rate of 99.x%”, drawing a clear line on how many minutes of delay are acceptable and what triggers improvement. Aiming for 100% leads to over-investment, so setting a realistic target and operating against it is the railroad’s wisdom.

SLO (Service Level Objective) is the web-service version of an on-time target. It numerically defines service quality goals like “99.9% uptime” or “response time under 200ms.” SLI (Service Level Indicator) is the actual measurement, and SLA (Service Level Agreement) is the contractual value with customers.

Without SLO, “make it more stable” and “make it faster” never end, leading either to over-investment in quality or neglecting quality until a major incident.

Why SLO and SLI are needed

Defining “good enough”

Without SLO, voices like “stabilize more” and “make it faster” never end, freezing development with overquality. Numerical agreement rationalizes investment decisions.

A common language for business and tech

Showing “99.9% uptime = 43 minutes monthly downtime” numerically lets the business side too understand technical compromises. Word-only debates run parallel forever.

Improvement priorities become clear

Areas with frequent SLO violations get investment priority, areas meeting them go to new-feature priority - resource allocation gets logical.

Differences among SLI / SLO / SLA

flowchart LR
    USER([User experience]) --> SLI[SLI<br/>actual<br/>e.g.: this month uptime 99.95%]
    SLI -->|compare with target| SLO[SLO<br/>internal target<br/>e.g.: don't drop below 99.9%]
    SLO -->|margin| SLA[SLA<br/>contract<br/>e.g.: penalty under 99.5%]
    SLO -.|error budget| EB[remaining 0.1%<br/>= 43 min monthly budget]
    EB -->|budget left| RELEASE[new feature release]
    EB -->|exhausted| FREEZE[feature freeze / reliability investment]
    classDef user fill:#fef3c7,stroke:#d97706;
    classDef sli fill:#dbeafe,stroke:#2563eb;
    classDef slo fill:#dcfce7,stroke:#16a34a,stroke-width:2px;
    classDef sla fill:#fae8ff,stroke:#a21caf;
    classDef budget fill:#f0f9ff,stroke:#0369a1;
    classDef bad fill:#fee2e2,stroke:#dc2626;
    class USER user;
    class SLI sli;
    class SLO slo;
    class SLA sla;
    class EB,RELEASE budget;
    class FREEZE bad;
MeaningUsage
SLI (Service Level Indicator)ActualCurrent uptime, latency
SLO (Service Level Objective)Internal target99.9%, under 200ms
SLA (Service Level Agreement)Contractual promiseContracts with external customers, penalty on violation

The iron rule is SLO < SLA. Without making the internal target stricter than the contract value, you break the promise made with customers. SLO is an internal guardrail with margin from SLA.

Typical SLIs

SLIs are metrics measuring user experience, with the following standard patterns. “CPU usage” and “memory consumption” aren’t SLIs (indirect metrics of user impact).

TypeDefinition
AvailabilitySuccessful requests / total requests
LatencyWithin 200ms at 95th percentile
Error rate5xx errors / total requests
ThroughputPer-second processing count
AccuracyCorrect results / all results
FreshnessTime elapsed since data update

The right answer is choosing axes where users feel broken / slow / wrong.

Availability guideline table

“99.9%” is hard to feel, but converting to downtime makes judgment easier. Choose appropriate level per business requirement.

AvailabilityAllowed down/monthAllowed down/yearSuited for
99%About 7 hoursAbout 3.6 daysInternal tools
99.9%About 43 minAbout 8.7 hoursGeneral B2C services
99.95%About 22 minAbout 4.4 hoursB2B SaaS
99.99%About 4.3 minAbout 52 minFinance, payments
99.999%About 26 secAbout 5.2 minTelecom, power

“99.999%” is the extremely strict level allowing only 20 sec monthly downtime. Excessive for most services.

Error Budget

The “allowable failure amount” for SLO. The thinking is “aim for 99.9%” = “0.1% can fail,” with this 0.1% being the error budget. Used as the operational rule of “release aggressively within budget, freeze releases on over-budget.”

SLO: 99.9% availability (allow 43 min monthly downtime)
|- Beginning of month: 43 min budget
   |- 10 min down on release → 33 min remaining
   |- 30 min down on incident → 3 min remaining
   |- Budget exhausted → freeze releases, prioritize stabilization

While error budget remains, accelerate development; on exhaustion, invest in reliability improvement - the mechanism balancing dev and ops.

SLO selection process

SLOs aren’t decided arbitrarily but agreed with business. Tech-leaning SLOs ignoring user impact are meaningless, requiring agreement among business, sales, and tech.

StepContent
1. Identify critical pathFeatures users always pass through
2. User-experience axisWhat constitutes “broken”
3. Measure existing valuesKnow current actuals
4. Propose targetsRealistic targets
5. Stakeholder agreementAgree with business / management
6. Operations and reviewQuarterly review

The safe approach is setting loose first, then gradually tightening. Promising 99.99% from the start can’t be kept.

SLO-based operation

Once SLO is set, operational decisions get numerical. “Should we release?” or “Should we alert?” gets decided by numbers, not feel. This is the essence of SRE operations.

ScenarioDecision criterion
Error budget remaining > 50%Accelerate new-feature releases
Error budget 10-50%Normal operation, careful
Error budget < 10%Freeze releases, stabilize
Error budget exhaustedStop releases, investigate

When error budget reaches “0,” stopping new features and improving reliability is the SRE rule. Failing to follow loses both reliability and speed.

The moment availability SLO is articulated, the dev team’s atmosphere changes - a standard anecdote of SRE introduction. At sites that agreed on 99.9% numerically, where the vague tension of “always cautious on releases” had pervaded, conversations like “32 min left this month, OK to fail” became possible, doubling new-feature deploy speed - cases also told. SLO is shown as not a number that constrains, but a number that lets you step in with confidence.

SLO alerts (burn rate)

Alerts that early-detect SLO violations use burn rate (budget-consumption speed). Consume 10% of monthly budget in 1 hour - warning; 50% - critical - judging by the slope of budget consumption.

Burn rateMeaningResponse
1xNormal consumptionNone
5xEarly warningInvestigate
10xRapid consumptionEmergency response
50xCriticalImmediate rollback

SLO burn rate is more meaningful than threshold-based alerts (CPU exceeding 80% etc.). Burn rate reflects user impact.

Multi-window multi-burn-rate

The technique of monitoring both rapid consumption in short time windows and persistent consumption in long windows. Detects both “rapidly eating budget in 1 hour” and “gradually eating over 24 hours.”

WindowDetection target
Short (1 hour, 6 hours)Rapid incidents
Mid (1 day, 3 days)Persistent issues
Long (7 days, 30 days)Chronic quality decline

Detailed in Google SRE Workbook - the standard alert design of modern SRE.

Decision criterion 1: service nature

SLO strictness is decided by service nature. Stricter for life-critical services, looser for experimental ones - realistic.

Service typeRecommended SLO
Internal tools99% (allow 7 hours monthly)
General B2C services99.9% (43 min)
Important B2C / payments99.95%-99.99%
Finance, medical99.99%+
Beta / experimental95% is enough

Decision criterion 2: org maturity

Realistic to introduce SLO after the org is somewhat mature. Setting SLO without measurement is meaningless - metric collection foundation comes first.

MaturityState
Lv1: no monitoringBuild metric foundation first
Lv2: monitoringChoose SLI candidates seeing actuals
Lv3: SLI selectedSet SLO and start operations
Lv4: SLO operatingOperational decisions by error budget
Lv5: SRE matureAuto-judgment, autonomous operation

How to choose by case

Personal dev / internal tools

Availability 99% + only latency. Error-budget operations unneeded, just set up metric foundation. 7 hours monthly downtime is realistic, avoiding over-investment.

Startup / general B2C SaaS

3 pillars of availability 99.9% + P95 (95th percentile - response time excluding the slowest 5%) latency + error rate. Start with Datadog or Grafana Cloud’s SLO features. See actuals quarterly to adjust targets, operate the rule of release-freeze on error-budget exhaustion.

Finance / payments / medical

Availability 99.99% + multi-burn-rate alerts. SLA violations link directly to penalties, so SLO < SLA margin design is required. Build a regime where incident notifications immediately reach management.

AI agents / LLM services

4-axis SLO of accuracy, response delay, cost, and safety. Legacy availability SLO alone is insufficient. Measure accuracy SLI with DeepEval, Ragas, etc., continuously monitoring hallucination rate.

SLO-level x service-type numerical gates

Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.

SLO at just “99.9%” is vague - the practice is numerically setting multiple axes per service type.

Service typeAvailability SLOLatency (P95)Error rateError budget/month
Internal tools99%1,000ms1%7 hours
General B2C Web99.9%300ms0.5%43 min
B2B SaaS99.95%200ms0.3%22 min
Finance / payments99.99%100ms0.1%4.3 min
Telecom, power99.999%50ms0.01%26 sec
AI agents (LLM)Accuracy 95% / response delay 3sSpeed + accuracyHallucination rate < 5%Custom design

Burn-rate alert numerical gates: Critical on consuming 2% of budget in 1 hour (burn rate > 14.4x), High on 5% in 6 hours (6x), Warning on 10% in 3 days (1x). This is the Google SRE Workbook standard firing criteria.

For SLO, separate numbers per service type. Same standard for all services becomes excessive or insufficient.

SLO-operation pitfalls and forbidden moves

Typical accident patterns in SLO operation. All produce a state where numbers don’t move.

Forbidden moveWhy it’s bad
Set 100% uptime as targetInfinite cost, dev stops. Operations with error budget 0 break down
Set SLO = SLANo internal guardrail, violation = contract violation with penalties
Make CPU usage an SLIDoesn’t link directly to user impact. Error rate / latency are correct
Measure SLI with averageSlow 1% of users invisible. Measure with P95/P99
Fix SLO once decidedBoth business and tech change. Quarterly review
Continue releases even on error-budget exhaustionReliability collapses, customer churn. Freeze releases on exhaustion
Suppress releases with budget remainingExcessive stabilization, lost dev speed
SRE alone sets SLO without stakeholder agreementBecomes numbers ignoring business impact
Operate only legacy thresholds without burn-rate alertsAnomaly detection delayed. Modern standard is burn rate
Apply only availability SLO to AI systemsNeed 4 axes of accuracy, response quality, cost
”Aim for 100% uptime” — pursuing perfectionInfinite cost, dev stops. SLO < 100% is correct
”Set SLA equal to SLO” — no marginNo internal guardrail, violation = contract violation with penalties

The reverse-pattern case in finance (set SLA = SLO, incident exceeded SLA → large penalties) is told as a typical case where no internal guardrail links directly to contract violation.

SLO is not a number that constrains, but a number to step in. Buy speed with error budget.

AI decision axes

AI-era favorableAI-era unfavorable
AI-specific 4-axis SLO (accuracy, delay, cost, safety)Availability SLO only
SLO < SLA with margin designSLO = SLA with no internal guardrail
Release decisions by error budgetError budget not operationalized
Burn-rate alertsStatic threshold alerts
  1. Choose SLI by user impact — not CPU usage, but error rate, latency, accuracy
  2. SLO < SLA with margin design — internal guardrail stricter than contract value
  3. Release decision by error budget — numerically determine release acceleration / freeze per remaining
  4. AI era is 4-axis SLO — add accuracy, cost, safety to availability

Author’s note - cases of “dev stopping from pursuing 100%”

Cases of pursuing perfection only to have key release speed stop are perennial SRE talking points.

There’s a story often heard about a mid-size SaaS where, without setting SLO, “zero incidents” was set as goal, and as a result, no new features shipped for 3 months and customers were taken by competitors. The typical case where dev resources got sucked into infinite tasks like “investigate every Warning” or “resolve every latency degradation,” stopping business. After introducing SLO (99.9%) and changing to operations of allowing in-budget incidents, release speed returned to over double - many such patterns reported.

Another, as a reverse pattern, a financial-system company had SLA (customer contract) at 99.9% but internal SLO also at 99.9%, and an incident exceeding SLA → large penalties for contract violation. A typical case told of the lesson that SLO must be set stricter than SLA with margin design.

Both have “no numerical agreement” as the root cause, slapping home that SLO isn’t a constraining number but a dial for engineering-handling speed-reliability balance.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • Critical path (SLO-target features)
  • SLI (what to measure)
  • SLO target (99.9%, 99.95%, etc.)
  • Difference from SLA (SLO < SLA)
  • Error-budget operational rules (per-remaining decision criteria)
  • Burn-rate alerts (short / mid / long term)
  • Review frequency (quarterly / semi-annually)

Summary

This article covered SLO and SLI, including SLI/SLO/SLA differences, typical SLIs, availability guidelines, error budgets, burn-rate alerts, per-service-type numerical gates, and AI-era 4-axis SLO.

Choose SLI by user impact, SLO<SLA with margin design, release decisions by error budget, AI era guarantees quality with 4-axis SLO. That is the practical answer for SLO/SLI design in 2026.

Next time we’ll cover incident response (on-call, postmortem).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.