About this article
As the eleventh installment of the âDevOps Architectureâ category in the series âArchitecture Crash Course for the Generative-AI Era,â this article explains SLO, SLI, and SLA.
100% uptime is not a goal to pursue - the essence of SLO is buying dev speed with error budgets. This article covers the relationship of SLI (actuals) / SLO (targets) / SLA (contracts), error-budget operations, choosing user-perspective SLIs, and per-service-type x target numerical gates.
What are SLO and SLI in the first place
Think of on-time train performance. Japanese railways hold a numerical target like âon-time rate of 99.x%â, drawing a clear line on how many minutes of delay are acceptable and what triggers improvement. Aiming for 100% leads to over-investment, so setting a realistic target and operating against it is the railroadâs wisdom.
SLO (Service Level Objective) is the web-service version of an on-time target. It numerically defines service quality goals like â99.9% uptimeâ or âresponse time under 200ms.â SLI (Service Level Indicator) is the actual measurement, and SLA (Service Level Agreement) is the contractual value with customers.
Without SLO, âmake it more stableâ and âmake it fasterâ never end, leading either to over-investment in quality or neglecting quality until a major incident.
Why SLO and SLI are needed
Defining âgood enoughâ
Without SLO, voices like âstabilize moreâ and âmake it fasterâ never end, freezing development with overquality. Numerical agreement rationalizes investment decisions.
A common language for business and tech
Showing â99.9% uptime = 43 minutes monthly downtimeâ numerically lets the business side too understand technical compromises. Word-only debates run parallel forever.
Improvement priorities become clear
Areas with frequent SLO violations get investment priority, areas meeting them go to new-feature priority - resource allocation gets logical.
Differences among SLI / SLO / SLA
flowchart LR
USER([User experience]) --> SLI[SLI<br/>actual<br/>e.g.: this month uptime 99.95%]
SLI -->|compare with target| SLO[SLO<br/>internal target<br/>e.g.: don't drop below 99.9%]
SLO -->|margin| SLA[SLA<br/>contract<br/>e.g.: penalty under 99.5%]
SLO -.|error budget| EB[remaining 0.1%<br/>= 43 min monthly budget]
EB -->|budget left| RELEASE[new feature release]
EB -->|exhausted| FREEZE[feature freeze / reliability investment]
classDef user fill:#fef3c7,stroke:#d97706;
classDef sli fill:#dbeafe,stroke:#2563eb;
classDef slo fill:#dcfce7,stroke:#16a34a,stroke-width:2px;
classDef sla fill:#fae8ff,stroke:#a21caf;
classDef budget fill:#f0f9ff,stroke:#0369a1;
classDef bad fill:#fee2e2,stroke:#dc2626;
class USER user;
class SLI sli;
class SLO slo;
class SLA sla;
class EB,RELEASE budget;
class FREEZE bad;
| Meaning | Usage | |
|---|---|---|
| SLI (Service Level Indicator) | Actual | Current uptime, latency |
| SLO (Service Level Objective) | Internal target | 99.9%, under 200ms |
| SLA (Service Level Agreement) | Contractual promise | Contracts with external customers, penalty on violation |
The iron rule is SLO < SLA. Without making the internal target stricter than the contract value, you break the promise made with customers. SLO is an internal guardrail with margin from SLA.
Typical SLIs
SLIs are metrics measuring user experience, with the following standard patterns. âCPU usageâ and âmemory consumptionâ arenât SLIs (indirect metrics of user impact).
| Type | Definition |
|---|---|
| Availability | Successful requests / total requests |
| Latency | Within 200ms at 95th percentile |
| Error rate | 5xx errors / total requests |
| Throughput | Per-second processing count |
| Accuracy | Correct results / all results |
| Freshness | Time elapsed since data update |
The right answer is choosing axes where users feel broken / slow / wrong.
Availability guideline table
â99.9%â is hard to feel, but converting to downtime makes judgment easier. Choose appropriate level per business requirement.
| Availability | Allowed down/month | Allowed down/year | Suited for |
|---|---|---|---|
| 99% | About 7 hours | About 3.6 days | Internal tools |
| 99.9% | About 43 min | About 8.7 hours | General B2C services |
| 99.95% | About 22 min | About 4.4 hours | B2B SaaS |
| 99.99% | About 4.3 min | About 52 min | Finance, payments |
| 99.999% | About 26 sec | About 5.2 min | Telecom, power |
â99.999%â is the extremely strict level allowing only 20 sec monthly downtime. Excessive for most services.
Error Budget
The âallowable failure amountâ for SLO. The thinking is âaim for 99.9%â = â0.1% can fail,â with this 0.1% being the error budget. Used as the operational rule of ârelease aggressively within budget, freeze releases on over-budget.â
SLO: 99.9% availability (allow 43 min monthly downtime)
|- Beginning of month: 43 min budget
|- 10 min down on release â 33 min remaining
|- 30 min down on incident â 3 min remaining
|- Budget exhausted â freeze releases, prioritize stabilization
While error budget remains, accelerate development; on exhaustion, invest in reliability improvement - the mechanism balancing dev and ops.
SLO selection process
SLOs arenât decided arbitrarily but agreed with business. Tech-leaning SLOs ignoring user impact are meaningless, requiring agreement among business, sales, and tech.
| Step | Content |
|---|---|
| 1. Identify critical path | Features users always pass through |
| 2. User-experience axis | What constitutes âbrokenâ |
| 3. Measure existing values | Know current actuals |
| 4. Propose targets | Realistic targets |
| 5. Stakeholder agreement | Agree with business / management |
| 6. Operations and review | Quarterly review |
The safe approach is setting loose first, then gradually tightening. Promising 99.99% from the start canât be kept.
SLO-based operation
Once SLO is set, operational decisions get numerical. âShould we release?â or âShould we alert?â gets decided by numbers, not feel. This is the essence of SRE operations.
| Scenario | Decision criterion |
|---|---|
| Error budget remaining > 50% | Accelerate new-feature releases |
| Error budget 10-50% | Normal operation, careful |
| Error budget < 10% | Freeze releases, stabilize |
| Error budget exhausted | Stop releases, investigate |
When error budget reaches â0,â stopping new features and improving reliability is the SRE rule. Failing to follow loses both reliability and speed.
The moment availability SLO is articulated, the dev teamâs atmosphere changes - a standard anecdote of SRE introduction. At sites that agreed on 99.9% numerically, where the vague tension of âalways cautious on releasesâ had pervaded, conversations like â32 min left this month, OK to failâ became possible, doubling new-feature deploy speed - cases also told. SLO is shown as not a number that constrains, but a number that lets you step in with confidence.
SLO alerts (burn rate)
Alerts that early-detect SLO violations use burn rate (budget-consumption speed). Consume 10% of monthly budget in 1 hour - warning; 50% - critical - judging by the slope of budget consumption.
| Burn rate | Meaning | Response |
|---|---|---|
| 1x | Normal consumption | None |
| 5x | Early warning | Investigate |
| 10x | Rapid consumption | Emergency response |
| 50x | Critical | Immediate rollback |
SLO burn rate is more meaningful than threshold-based alerts (CPU exceeding 80% etc.). Burn rate reflects user impact.
Multi-window multi-burn-rate
The technique of monitoring both rapid consumption in short time windows and persistent consumption in long windows. Detects both ârapidly eating budget in 1 hourâ and âgradually eating over 24 hours.â
| Window | Detection target |
|---|---|
| Short (1 hour, 6 hours) | Rapid incidents |
| Mid (1 day, 3 days) | Persistent issues |
| Long (7 days, 30 days) | Chronic quality decline |
Detailed in Google SRE Workbook - the standard alert design of modern SRE.
Decision criterion 1: service nature
SLO strictness is decided by service nature. Stricter for life-critical services, looser for experimental ones - realistic.
| Service type | Recommended SLO |
|---|---|
| Internal tools | 99% (allow 7 hours monthly) |
| General B2C services | 99.9% (43 min) |
| Important B2C / payments | 99.95%-99.99% |
| Finance, medical | 99.99%+ |
| Beta / experimental | 95% is enough |
Decision criterion 2: org maturity
Realistic to introduce SLO after the org is somewhat mature. Setting SLO without measurement is meaningless - metric collection foundation comes first.
| Maturity | State |
|---|---|
| Lv1: no monitoring | Build metric foundation first |
| Lv2: monitoring | Choose SLI candidates seeing actuals |
| Lv3: SLI selected | Set SLO and start operations |
| Lv4: SLO operating | Operational decisions by error budget |
| Lv5: SRE mature | Auto-judgment, autonomous operation |
How to choose by case
Personal dev / internal tools
Availability 99% + only latency. Error-budget operations unneeded, just set up metric foundation. 7 hours monthly downtime is realistic, avoiding over-investment.
Startup / general B2C SaaS
3 pillars of availability 99.9% + P95 (95th percentile - response time excluding the slowest 5%) latency + error rate. Start with Datadog or Grafana Cloudâs SLO features. See actuals quarterly to adjust targets, operate the rule of release-freeze on error-budget exhaustion.
Finance / payments / medical
Availability 99.99% + multi-burn-rate alerts. SLA violations link directly to penalties, so SLO < SLA margin design is required. Build a regime where incident notifications immediately reach management.
AI agents / LLM services
4-axis SLO of accuracy, response delay, cost, and safety. Legacy availability SLO alone is insufficient. Measure accuracy SLI with DeepEval, Ragas, etc., continuously monitoring hallucination rate.
SLO-level x service-type numerical gates
Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.
SLO at just â99.9%â is vague - the practice is numerically setting multiple axes per service type.
| Service type | Availability SLO | Latency (P95) | Error rate | Error budget/month |
|---|---|---|---|---|
| Internal tools | 99% | 1,000ms | 1% | 7 hours |
| General B2C Web | 99.9% | 300ms | 0.5% | 43 min |
| B2B SaaS | 99.95% | 200ms | 0.3% | 22 min |
| Finance / payments | 99.99% | 100ms | 0.1% | 4.3 min |
| Telecom, power | 99.999% | 50ms | 0.01% | 26 sec |
| AI agents (LLM) | Accuracy 95% / response delay 3s | Speed + accuracy | Hallucination rate < 5% | Custom design |
Burn-rate alert numerical gates: Critical on consuming 2% of budget in 1 hour (burn rate > 14.4x), High on 5% in 6 hours (6x), Warning on 10% in 3 days (1x). This is the Google SRE Workbook standard firing criteria.
For SLO, separate numbers per service type. Same standard for all services becomes excessive or insufficient.
SLO-operation pitfalls and forbidden moves
Typical accident patterns in SLO operation. All produce a state where numbers donât move.
| Forbidden move | Why itâs bad |
|---|---|
| Set 100% uptime as target | Infinite cost, dev stops. Operations with error budget 0 break down |
| Set SLO = SLA | No internal guardrail, violation = contract violation with penalties |
| Make CPU usage an SLI | Doesnât link directly to user impact. Error rate / latency are correct |
| Measure SLI with average | Slow 1% of users invisible. Measure with P95/P99 |
| Fix SLO once decided | Both business and tech change. Quarterly review |
| Continue releases even on error-budget exhaustion | Reliability collapses, customer churn. Freeze releases on exhaustion |
| Suppress releases with budget remaining | Excessive stabilization, lost dev speed |
| SRE alone sets SLO without stakeholder agreement | Becomes numbers ignoring business impact |
| Operate only legacy thresholds without burn-rate alerts | Anomaly detection delayed. Modern standard is burn rate |
| Apply only availability SLO to AI systems | Need 4 axes of accuracy, response quality, cost |
| âAim for 100% uptimeâ â pursuing perfection | Infinite cost, dev stops. SLO < 100% is correct |
| âSet SLA equal to SLOâ â no margin | No internal guardrail, violation = contract violation with penalties |
The reverse-pattern case in finance (set SLA = SLO, incident exceeded SLA â large penalties) is told as a typical case where no internal guardrail links directly to contract violation.
SLO is not a number that constrains, but a number to step in. Buy speed with error budget.
AI decision axes
| AI-era favorable | AI-era unfavorable |
|---|---|
| AI-specific 4-axis SLO (accuracy, delay, cost, safety) | Availability SLO only |
| SLO < SLA with margin design | SLO = SLA with no internal guardrail |
| Release decisions by error budget | Error budget not operationalized |
| Burn-rate alerts | Static threshold alerts |
- Choose SLI by user impact â not CPU usage, but error rate, latency, accuracy
- SLO < SLA with margin design â internal guardrail stricter than contract value
- Release decision by error budget â numerically determine release acceleration / freeze per remaining
- AI era is 4-axis SLO â add accuracy, cost, safety to availability
Authorâs note - cases of âdev stopping from pursuing 100%â
Cases of pursuing perfection only to have key release speed stop are perennial SRE talking points.
Thereâs a story often heard about a mid-size SaaS where, without setting SLO, âzero incidentsâ was set as goal, and as a result, no new features shipped for 3 months and customers were taken by competitors. The typical case where dev resources got sucked into infinite tasks like âinvestigate every Warningâ or âresolve every latency degradation,â stopping business. After introducing SLO (99.9%) and changing to operations of allowing in-budget incidents, release speed returned to over double - many such patterns reported.
Another, as a reverse pattern, a financial-system company had SLA (customer contract) at 99.9% but internal SLO also at 99.9%, and an incident exceeding SLA â large penalties for contract violation. A typical case told of the lesson that SLO must be set stricter than SLA with margin design.
Both have âno numerical agreementâ as the root cause, slapping home that SLO isnât a constraining number but a dial for engineering-handling speed-reliability balance.
What to decide - what is your projectâs answer?
For each of the following, try to articulate your projectâs answer in 1-2 sentences. Starting work with these vague always invites later questions like âwhy did we decide this again?â
- Critical path (SLO-target features)
- SLI (what to measure)
- SLO target (99.9%, 99.95%, etc.)
- Difference from SLA (SLO < SLA)
- Error-budget operational rules (per-remaining decision criteria)
- Burn-rate alerts (short / mid / long term)
- Review frequency (quarterly / semi-annually)
Summary
This article covered SLO and SLI, including SLI/SLO/SLA differences, typical SLIs, availability guidelines, error budgets, burn-rate alerts, per-service-type numerical gates, and AI-era 4-axis SLO.
Choose SLI by user impact, SLO<SLA with margin design, release decisions by error budget, AI era guarantees quality with 4-axis SLO. That is the practical answer for SLO/SLI design in 2026.
Next time weâll cover incident response (on-call, postmortem).
I hope youâll read the next article as well.
đ Series: Architecture Crash Course for the Generative-AI Era (64/89)