About this article
As the thirteenth installment of the âDevOps Architectureâ category in the series âArchitecture Crash Course for the Generative-AI Era,â this article explains SRE practices.
SRE is engineering that eliminates operations - reducing manual work is the main job. This article handles toil reduction, prioritization via SLO, error-budget operations, chaos engineering, on-call rotation, and team topology (Embedded SRE / Platform SRE) - design that operationally translates Google-originated operational engineering.
Other articles in this category
Why SRE is needed
Cloud / microservices complicate operations
Managing modern systems with hundreds of services via legacy âmanual-based operationsâ is impossible. The SRE approach of operating with code becomes required.
Balancing dev speed and reliability
Legacy operations teams tended to favor âstop changes for stabilization,â conflicting with development. SRE mediates both via the number of error budget.
Reduce engineersâ operational burden
Manual on-call and alert response exhaust engineers. SRE reduces burden via automation, focusing on essential problem-solving.
Main SRE practices
The main 8 practices systematized in Googleâs SRE book. Combine these to build operational culture.
flowchart TB
SRE([SRE implementation])
subgraph MEAS["Measure"]
SLO[SLO/error budget]
CAP[capacity planning]
end
subgraph REDUCE["Reduce"]
TOIL[Toil reduction<br/>automate repetitive work]
CHAOS[Chaos engineering]
end
subgraph LEARN["Learn"]
PM[Postmortem<br/>Blameless]
PRR[Production<br/>Readiness Review]
end
subgraph RESPOND["Prepare"]
IR[Incident Response]
ON[On-call design]
end
SRE --> MEAS
SRE --> REDUCE
SRE --> LEARN
SRE --> RESPOND
classDef sre fill:#fef3c7,stroke:#d97706,stroke-width:2px;
classDef m fill:#dbeafe,stroke:#2563eb;
classDef r fill:#dcfce7,stroke:#16a34a;
classDef l fill:#fae8ff,stroke:#a21caf;
classDef p fill:#f0f9ff,stroke:#0369a1;
class SRE sre;
class MEAS,SLO,CAP m;
class REDUCE,TOIL,CHAOS r;
class LEARN,PM,PRR l;
class RESPOND,IR,ON p;
| Practice | Content |
|---|---|
| SLO / error budget | Manage reliability numerically |
| Toil reduction | Automate repetitive work |
| Postmortem | Blameless review |
| On-call design | Rotation, load management |
| Capacity planning | Scale prediction and prep |
| Incident Response | Systematize incident handling |
| Chaos engineering | Intentionally break to learn |
| Production Readiness Review | Pre-production audit |
Toil reduction
Repeated manual work and automatable operational work is called Toil at Google, with the explicit goal of keeping it under 50% of SRE time. Beyond 50%, SREs canât develop and the org canât generate value.
| Toil example | Path to automation |
|---|---|
| Server restart | Auto-recovery |
| Log investigation | Observability foundation |
| Permission grants | Self-service-ization |
| Migration | CI/CD |
| Alert response | Auto-execution of Runbooks |
Toil isnât evil but is work that doesnât grow you. SREs writing code to eliminate it is the main job, with the ideal being investing 20%+ monthly in automation.
Error-budget operations
The core of SRE is disciplining operations by error budget against SLO. Explicit switching - accelerate development while error budget remains, redirect to reliability investment on exhaustion.
Error budget remaining 70% --> Ship new features aggressively
Error budget remaining 30% --> Cautious, careful
Error budget remaining 5% --> Freeze releases, stabilize
Error budget remaining 0% --> Stop new features, focus on quality
This mechanism dissolves dev-vs-ops conflict. Decision can be made on the objective fact âbudget exhaustedâ rather than âstop releases.â
Chaos engineering
The technique of intentionally causing failures in production to verify system fault tolerance. Originated with Netflixâs Chaos Monkey, ârandomly killing instances in productionâ forces designs that constantly tolerate failures.
| Tool | Use case |
|---|---|
| Chaos Monkey | Instance stoppage |
| Gremlin | Commercial, comprehensive |
| Chaos Mesh | OSS for K8s |
| LitmusChaos | K8s, CNCF |
| AWS FIS | AWS-integrated service |
The premise is system architecture âon the premise that failures happen,â and periodically practicing breaking builds an org that doesnât panic during real-world failures.
SLO-culture penetration
Just installing SLOs doesnât function. Settling as organizational culture is needed - the following habits matter.
| Habit | Content |
|---|---|
| SLO review | Quarterly target review |
| Weekly SLI check | Trend monitoring of actuals |
| Budget-exhaustion response | Planned freeze / stabilize |
| Stakeholder agreement | Discussion with business |
The core of cultural transformation is discarding â100% uptime = ideal.â SLO is a symbol of âdonât aim for perfection, aim for sufficient.â
Capacity planning
Plans for future traffic growth. Falling behind links directly to incidents and missed opportunities, so itâs an important area SRE continuously works on.
| Step | Content |
|---|---|
| Demand forecast | Business plan and traffic |
| Current grasp | Resource utilization, margin |
| Bottleneck identification | The first to clog |
| Procurement plan | Cloud reservations, contracts |
| Load tests | Verification at predicted values |
| Periodic review | Monthly, quarterly |
The safer strategy for services expecting rapid growth is holding more margin.
Production Readiness Review
The mechanism of auditing before putting new services into production. Called PRR at Google, where the SRE team evaluates dev-team services.
| Viewpoint | Content |
|---|---|
| Observability | Metrics / logs / traces ready |
| SLO definition | Targets and measurement methods |
| Capacity | Tolerance to expected load |
| Deploy strategy | Safe release procedures |
| Disaster countermeasures | Recovery procedures on failure |
| On-call regime | Responders, manuals |
The gate of donât put services failing PRR into production guarantees quality.
Relationship with DevOps
SRE is positioned as the concrete implementation of DevOps. If DevOps is philosophy / culture, SRE is the practical pattern.
| DevOps | SRE | |
|---|---|---|
| Position | Philosophy / culture | Concrete practice |
| Origin | Around 2009, Patrick Debois | 2003, Google |
| Focus | Integration of dev and ops | Solve operations with engineering |
| Metric | DORA 4 metrics | SLO / error budget |
| Role | No clear role | SRE engineer |
Itâs also said that âDevOps is the ideal, SRE is the implementation pattern.â
Platform Engineering
Developing SRE thinking, the dedicated team improving in-house developer experience is Platform Engineering. Provides an Internal Developer Platform (IDP), preparing environments where dev teams can autonomously and safely deploy / operate.
| Provided | Content |
|---|---|
| Self-service portal | Deploy, env creation |
| Golden Path | Standard tech stack |
| Automation tools | CI/CD, IaC |
| Monitoring foundation | Common observability |
Backstage (OSS by Spotify) is the representative IDP, adopted by many companies.
Decision criterion 1: org scale
SRE introduction level is decided by org scale. Creating a company-wide SRE team is for mid-size enterprises and up; startups realistically start with concurrent roles.
| Scale | Recommended |
|---|---|
| Startup | Devs concurrent SRE |
| Mid-size | SRE team (2-5 people) |
| Large enterprise | SRE per org |
| Global | Central SRE + per-business SRE |
Decision criterion 2: org maturity
Realistic to phase SRE introduction. Starting everything at once causes chaos.
| Phase | Content |
|---|---|
| Phase 1 | Metric / log foundation |
| Phase 2 | SLI measurement, SLO trial operation |
| Phase 3 | Error-budget operation |
| Phase 4 | Toil-automation investment |
| Phase 5 | Chaos engineering |
How to choose by case
Personal dev / small team
Devs concurrent SRE + Phase 1-2 only. Set up metric foundation (CloudWatch etc.), SLI measurement is enough. SLO operation, error budget, PRR after org maturity. For Toil, finding âmanual work over 2 hours monthlyâ and automating is enough.
Startup / growth-stage SaaS
1-2 SRE-concurrent engineers + SLO trial operation + Toil-reduction culture. Push to Phase 3, release decisions by error budget, Git-managed Runbooks. Building self-service portal with Backstage raises dev speed.
Mid-size enterprise / microservices ops
Dedicated SRE team 3-10 + PRR + chaos drills. Implement all to Phase 5, monthly chaos drills with LitmusChaos / AWS FIS, build IDP with Platform Engineering team. Include Toil-reduction KPI in org goals.
Large enterprise / regulated industries
Two-tier structure of central SRE + business-unit SRE + AIOps. Central provides company-wide standards (Golden Path), business units run own SLO operations. Automate routine response with Datadog Bits AI / Resolve AI, humans focus on strategy / improvement design.
Common misconceptions
SRE means infrastructure operators
Different. Theyâre specialists solving operational problems with software engineering. SREs not writing code have low value.
Set SLO and youâre SRE
SLO is one means. The total of culture, process, and automation is SRE.
Toil should be zero
Realistically impossible. Under 50% is Googleâs guideline. Some manual work for new features is necessary.
SRE is the new name for ops team
Just changing signs doesnât function. Transformation from authority, culture, and hiring criteria is needed. Cases where existing ops teams just got the name âSREâ attached, with continuing midnight manual on-call, no time or authority to write code, and Toil rate at 95% even after a year - centered on Japanese companies, many such non-laughing cases are told. Not signs, but how time is spent defines SRE.
Phased SRE-maturity roadmap
SRE is a cultural transformation that doesnât realize overnight. Phased introduction compliant with Google SRE Workbook is realistic.
| Phase | Period | Implementation | Required SREs |
|---|---|---|---|
| Phase 1: Measurement foundation | ~6 months | Prometheus / Datadog adoption, metric / log setup | 0-1 (concurrent) |
| Phase 2: SLO trial operation | ~1 year | SLI selection, measurement, tentative target | 1-2 |
| Phase 3: Error-budget operation | ~1.5 years | Release-freeze rule on exhaustion, quarterly review | 2-5 |
| Phase 4: Toil-reduction culture | ~2 years | Toil under 50% target, 20% automation investment, Runbook as Code | 3-10 |
| Phase 5: Chaos / IDP | ~3 years | Monthly chaos engineering, build IDP like Backstage | 5+ |
| Phase 6: AIOps | ~5 years | Auto-first-response with PagerDuty AIOps / Resolve AI | Prompt + Systems Engineer |
The Toil target line is under 50% of SRE time - Googleâs official guideline. Over 50% means SREs canât develop and the org canât generate value. Allocating 20% monthly to automation investment is the empirical rule keeping Toil at sustainable levels.
SRE is not signs but how time is spent. Toil over 50% is sign-only, not real SRE.
SRE-operation pitfalls and forbidden moves
Typical accident patterns in SRE introduction. All have the structure of just changing signs without changing contents.
| Forbidden move | Why itâs bad |
|---|---|
| Just change existing ops teamâs business cards to SRE | Toil rate 95% even after a year, SRE not writing code is zero value |
| Donât give SRE authority / time to write code | Stays as ops worker, generates no essential value |
| Set SLO and abandon | No one looks, quarterly review required |
| Try to make Toil zero | Realistically impossible, under 50% is Googleâs guideline |
| Start chaos engineering without production experience | Major accident on first try, practice in staging first |
| Put new services into production without PRR | Services with unmaintained monitoring / SLO / Runbook reach production |
| Continue releases even on error-budget exhaustion | Reliability collapses, error-budget operations meaningless |
| Donât measure on-call load | 5+ late-night calls monthly exhausts SREs / leads to resignation |
| Donât create independent SRE team | Falls to subcontracting under dev teams |
| Defer investment in Platform Engineering / IDP | Each team operates independently, no standardization, inefficient |
The Japanese-company SRE-sign problem (just attaching the name âSREâ to existing ops teams, with continuing midnight manual on-call, no time or authority to write code, Toil rate 95% after a year) slaps home the lesson ânot signs but how time is spent defines SRE.â Netflix has continuously killed production randomly with Chaos Monkey since the 2010s, calmly continuing service even on AWS partial outages - this gap is the difference between âreal SREâ and âname only.â
SRE is engineering going to kill Toil. If running by hand, itâs evidence you havenât yet scaled.
AI-era perspective
When AI-driven dev (vibe coding) and AI usage are the premise, SRE evolves into operations collaborating with AI agents. The ultimate form of Toil reduction is AI autonomously making operational decisions, with Datadog Bits AI / PagerDuty AIOps / Resolve AI etc. already commercialized.
| Favored in the AI era | Disfavored in the AI era |
|---|---|
| Code-ized Runbooks | Procedures in peopleâs heads |
| Structured logs / metrics | String-log-centric |
| Automated Toil | Manual work remaining |
| Quantitative reliability management | Sensory operations |
The future SRE image is humans managing AI agents. AI handles first response, humans concentrate on designing agent learning / judgment criteria. AI-era SRE is becoming the fusion role of Prompt Engineer + Systems Engineer.
AI-era SRE entrusts operational automation to AI, focuses on strategic design.
What to decide - what is your projectâs answer?
For each of the following, try to articulate your projectâs answer in 1-2 sentences. Starting work with these vague always invites later questions like âwhy did we decide this again?â
- SRE-org placement (centralized / distributed)
- SLO management process (setup, review frequency)
- Error-budget operational rules (freeze criteria)
- Toil-reduction target (under 50%, automation investment rate)
- Chaos engineering (frequency, scope)
- PRR process (new-service introduction audit)
- AI tool adoption (AIOps, auto-diagnosis)
Authorâs note - cases visualizing the gap between âSRE signâ and âreal SREâ
Whether SRE is a surface sign or real cultural transformation greatly divides the orgâs fate.
After Google launched its SRE team in 2003 and the practice was disclosed in the 2016 book âSite Reliability Engineering,â companies worldwide followed. But centered on Japanese companies, cases of just changing existing ops teamsâ business cards to âSREâ, with reality remaining manual night on-call and phone response, came one after another. Without time or authority to write code, after a year Toil rate at 95%, SLO defined only formally with no one watching - non-laughing field cases are repeatedly told even today.
In contrast, Netflix is famous for thoroughgoing SRE thinking. Netflix has continuously killed production instances randomly with Chaos Monkey since the 2010s, with design on the premise of breaking in production becoming standard. As a result, even on AWS partial outages, Netflix alone calmly continues service - cases observed multiple times. Whatâs visible here is the fact that âSRE isnât signs but a problem of how time is spent and culture.â
Itâs not âcalling yourself SRE makes you SREâ - only whether you can have time confronting Toil-reduction with code decides SREâs essence.
How to make the final call
The core of SRE is the thinking of solving operations with engineering. Legacy manual-based operations donât scale in modern systems with hundreds of microservices. Googleâs SRE is the totality of practices code-izing and numerically capturing operations - SLO, error budget, Toil reduction, postmortem, chaos engineering - functioning via combination of culture, process, and automation rather than individual tools. The misconception âSRE is just changing the ops teamâs signâ is the most common antipattern, requiring transformation from authority, culture, and hiring criteria.
Another decisive axis is evolution to SRE managing AI agents. In the era when Datadog Bits AI / PagerDuty AIOps / Resolve AI handle first response / root-cause-candidate presentation / Runbook auto-execution, human SREs concentrate on designing agent learning / judgment criteria, strategic reliability investment, and organizational-culture cultivation. The fusion role of Prompt Engineer and Systems Engineer is the AI-era SRE image.
Selection priorities
- Phase introduction - Phase 1 (measurement foundation) â Phase 5 (chaos) order, donât do all at once
- Keep Toil under 50% - over means SRE canât develop, invest 20% monthly in automation
- Mediate dev/reliability balance via error budget - objectively judge release acceleration / freeze per remaining
- Delegate routine response to AIOps - humans focus on strategy / improvement design / agent management
âSolve operations with code, entrust to AI, focus on strategy.â This is the modern SRE image.
Summary
This article covered SRE practices, including main practices, Toil reduction, error-budget operations, chaos engineering, PRR, Platform Engineering, and AIOps collaboration.
Phase introduction, keep Toil under 50%, mediate balance via error budget, delegate routine to AIOps. That is the practical answer for SRE practices in 2026.
Next time weâll cover documentation (README, ADR, Runbook).
I hope youâll read the next article as well.
đ Series: Architecture Crash Course for the Generative-AI Era (66/89)