[DevOps Architecture] SRE Practices

About this article

As the thirteenth installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains SRE practices.

SRE is engineering that eliminates operations - reducing manual work is the main job. This article handles toil reduction, prioritization via SLO, error-budget operations, chaos engineering, on-call rotation, and team topology (Embedded SRE / Platform SRE) - design that operationally translates Google-originated operational engineering.

What are SRE practices in the first place

Key SRE Practices

Picture factory production management. A well-run factory doesn’t have workers manually checking quality every time — it has automated inspection, early defect detection, line-halt criteria, and improvement cycles built into the system. Maintaining quality through systems rather than individual skill is the essence of production management.

SRE practices are production management for system operations. They are the concrete methods Google systematized — toil reduction, prioritization via SLO, error-budget operations, chaos engineering — maintaining and improving operational quality without depending on specific individuals.

Without SRE practices, operations become an endless repetition of manual work. Engineers get consumed by alert response and can’t invest time in essential improvements.

Why SRE is needed

Cloud / microservices complicate operations

Managing modern systems with hundreds of services via legacy “manual-based operations” is impossible. The SRE approach of operating with code becomes required.

Balancing dev speed and reliability

Legacy operations teams tended to favor “stop changes for stabilization,” conflicting with development. SRE mediates both via the number of error budget.

Reduce engineers’ operational burden

Manual on-call and alert response exhaust engineers. SRE reduces burden via automation, focusing on essential problem-solving.

Main SRE practices

The main 8 practices systematized in Google’s SRE book. Combine these to build operational culture.

Practice	Content
SLO / error budget	Manage reliability numerically
Toil reduction	Automate repetitive work
Postmortem	Blameless review
On-call design	Rotation, load management
Capacity planning	Scale prediction and prep
Incident Response	Systematize incident handling
Chaos engineering	Intentionally break to learn
Production Readiness Review	Pre-production audit

Toil reduction

Repeated manual work and automatable operational work is called Toil at Google, with the explicit goal of keeping it under 50% of SRE time. Beyond 50%, SREs can’t develop and the org can’t generate value.

Toil example	Path to automation
Server restart	Auto-recovery
Log investigation	Observability foundation
Permission grants	Self-service-ization
Migration	CI/CD
Alert response	Auto-execution of Runbooks

Toil isn’t evil but is work that doesn’t grow you. SREs writing code to eliminate it is the main job, with the ideal being investing 20%+ monthly in automation.

Error-budget operations

The core of SRE is disciplining operations by error budget against SLO. Explicit switching - accelerate development while error budget remains, redirect to reliability investment on exhaustion.

Error budget remaining 70% --> Ship new features aggressively
Error budget remaining 30% --> Cautious, careful
Error budget remaining 5%  --> Freeze releases, stabilize
Error budget remaining 0%  --> Stop new features, focus on quality

This mechanism dissolves dev-vs-ops conflict. Decision can be made on the objective fact “budget exhausted” rather than “stop releases.”

Chaos engineering

The technique of intentionally causing failures in production to verify system fault tolerance. Originated with Netflix’s Chaos Monkey, “randomly killing instances in production” forces designs that constantly tolerate failures.

Tool	Use case
Chaos Monkey	Instance stoppage
Gremlin	Commercial, comprehensive
Chaos Mesh	OSS for K8s
LitmusChaos	K8s, CNCF
AWS FIS	AWS-integrated service

The premise is system architecture “on the premise that failures happen,” and periodically practicing breaking builds an org that doesn’t panic during real-world failures.

SLO-culture penetration

Just installing SLOs doesn’t function. Settling as organizational culture is needed - the following habits matter.

Habit	Content
SLO review	Quarterly target review
Weekly SLI check	Trend monitoring of actuals
Budget-exhaustion response	Planned freeze / stabilize
Stakeholder agreement	Discussion with business

The core of cultural transformation is discarding “100% uptime = ideal.” SLO is a symbol of “don’t aim for perfection, aim for sufficient.”

Capacity planning

Plans for future traffic growth. Falling behind links directly to incidents and missed opportunities, so it’s an important area SRE continuously works on.

Step	Content
Demand forecast	Business plan and traffic
Current grasp	Resource utilization, margin
Bottleneck identification	The first to clog
Procurement plan	Cloud reservations, contracts
Load tests	Verification at predicted values
Periodic review	Monthly, quarterly

The safer strategy for services expecting rapid growth is holding more margin.

Production Readiness Review

The mechanism of auditing before putting new services into production. Called PRR at Google, where the SRE team evaluates dev-team services.

Viewpoint	Content
Observability	Metrics / logs / traces ready
SLO definition	Targets and measurement methods
Capacity	Tolerance to expected load
Deploy strategy	Safe release procedures
Disaster countermeasures	Recovery procedures on failure
On-call regime	Responders, manuals

The gate of don’t put services failing PRR into production guarantees quality.

Relationship with DevOps

SRE is positioned as the concrete implementation of DevOps. If DevOps is philosophy / culture, SRE is the practical pattern.

	DevOps	SRE
Position	Philosophy / culture	Concrete practice
Origin	Around 2009, Patrick Debois	2003, Google
Focus	Integration of dev and ops	Solve operations with engineering
Metric	DORA 4 metrics	SLO / error budget
Role	No clear role	SRE engineer

It’s also said that “DevOps is the ideal, SRE is the implementation pattern.”

Platform Engineering

Developing SRE thinking, the dedicated team improving in-house developer experience is Platform Engineering. Provides an Internal Developer Platform (IDP), preparing environments where dev teams can autonomously and safely deploy / operate.

Provided	Content
Self-service portal	Deploy, env creation
Golden Path	Standard tech stack
Automation tools	CI/CD, IaC
Monitoring foundation	Common observability

Backstage (OSS by Spotify) is the representative IDP, adopted by many companies.

Decision criterion 1: org scale

SRE introduction level is decided by org scale. Creating a company-wide SRE team is for mid-size enterprises and up; startups realistically start with concurrent roles.

Scale	Recommended
Startup	Devs concurrent SRE
Mid-size	SRE team (2-5 people)
Large enterprise	SRE per org
Global	Central SRE + per-business SRE

Decision criterion 2: org maturity

Realistic to phase SRE introduction. Starting everything at once causes chaos.

Phase	Content
Phase 1	Metric / log foundation
Phase 2	SLI measurement, SLO trial operation
Phase 3	Error-budget operation
Phase 4	Toil-automation investment
Phase 5	Chaos engineering

How to choose by case

Personal dev / small team

Devs concurrent SRE + Phase 1-2 only. Set up metric foundation (CloudWatch etc.), SLI measurement is enough. SLO operation, error budget, PRR after org maturity. For Toil, finding “manual work over 2 hours monthly” and automating is enough.

Startup / growth-stage SaaS

1-2 SRE-concurrent engineers + SLO trial operation + Toil-reduction culture. Push to Phase 3, release decisions by error budget, Git-managed Runbooks. Building self-service portal with Backstage raises dev speed.

Mid-size enterprise / microservices ops

Dedicated SRE team 3-10 + PRR + chaos drills. Implement all to Phase 5, monthly chaos drills with LitmusChaos / AWS FIS, build IDP with Platform Engineering team. Include Toil-reduction KPI in org goals.

Large enterprise / regulated industries

Two-tier structure of central SRE + business-unit SRE + AIOps. Central provides company-wide standards (Golden Path), business units run own SLO operations. Automate routine response with Datadog Bits AI / Resolve AI, humans focus on strategy / improvement design.

Phased SRE-maturity roadmap

SRE is a cultural transformation that doesn’t realize overnight. Phased introduction compliant with Google SRE Workbook is realistic.

Phase	Period	Implementation	Required SREs
Phase 1: Measurement foundation	~6 months	Prometheus / Datadog adoption, metric / log setup	0-1 (concurrent)
Phase 2: SLO trial operation	~1 year	SLI selection, measurement, tentative target	1-2
Phase 3: Error-budget operation	~1.5 years	Release-freeze rule on exhaustion, quarterly review	2-5
Phase 4: Toil-reduction culture	~2 years	Toil under 50% target, 20% automation investment, Runbook as Code	3-10
Phase 5: Chaos / IDP	~3 years	Monthly chaos engineering, build IDP like Backstage	5+
Phase 6: AIOps	~5 years	Auto-first-response with PagerDuty AIOps / Resolve AI	Prompt + Systems Engineer

The Toil target line is under 50% of SRE time - Google’s official guideline. Over 50% means SREs can’t develop and the org can’t generate value. Allocating 20% monthly to automation investment is the empirical rule keeping Toil at sustainable levels.

SRE is not signs but how time is spent. Toil over 50% is sign-only, not real SRE.

SRE-operation pitfalls and forbidden moves

Typical accident patterns in SRE introduction. All have the structure of just changing signs without changing contents.

Forbidden move	Why it’s bad
Just change existing ops team’s business cards to SRE	Toil rate 95% even after a year, SRE not writing code is zero value
Don’t give SRE authority / time to write code	Stays as ops worker, generates no essential value
Set SLO and abandon	No one looks, quarterly review required
Try to make Toil zero	Realistically impossible, under 50% is Google’s guideline
Start chaos engineering without production experience	Major accident on first try, practice in staging first
Put new services into production without PRR	Services with unmaintained monitoring / SLO / Runbook reach production
Continue releases even on error-budget exhaustion	Reliability collapses, error-budget operations meaningless
Don’t measure on-call load	5+ late-night calls monthly exhausts SREs / leads to resignation
Don’t create independent SRE team	Falls to subcontracting under dev teams
Defer investment in Platform Engineering / IDP	Each team operates independently, no standardization, inefficient
”SRE means infrastructure operators” — conflating the two	SRE is the methodology of solving operations with software engineering. SREs not writing code have low value
”Toil should be zero” — pursuing perfection	Realistically impossible. Under 50% is Google’s guideline. Some manual work for new features is necessary

The Japanese-company SRE-sign problem (just attaching the name “SRE” to existing ops teams, with continuing midnight manual on-call, no time or authority to write code, Toil rate 95% after a year) slaps home the lesson “not signs but how time is spent defines SRE.” Netflix has continuously killed production randomly with Chaos Monkey since the 2010s, calmly continuing service even on AWS partial outages - this gap is the difference between “real SRE” and “name only.”

SRE is engineering going to kill Toil. If running by hand, it’s evidence you haven’t yet scaled.

AI decision axes

AI-era favorable	AI-era unfavorable
Code-ized Runbooks	Procedures in people’s heads
Structured logs / metrics	String-log-centric
Automated Toil	Manual work remaining
Quantitative reliability management	Sensory operations

Phase introduction — Phase 1 (measurement foundation) → Phase 5 (chaos) order, don’t do all at once
Keep Toil under 50% — over means SRE can’t develop, invest 20% monthly in automation
Mediate dev/reliability balance via error budget — objectively judge release acceleration / freeze per remaining
Delegate routine response to AIOps — humans focus on strategy / improvement design / agent management

AIOps is being adopted in 3 stages in practice

As of 2026, practical AIOps adoption is progressing in the following 3 stages:

Stage 1: Anomaly-detection automation (AI detects metric baseline deviations → alerts humans)
Stage 2: Root-cause estimation (AI cross-analyzes logs, traces, and metrics for RCA → notifies via Slack)
Stage 3: Auto-recovery (AI executes recovery operations per Runbook → humans confirm after the fact)

Most organizations are at Stage 1-2, but advancing to Stage 3 requires codified, trustworthy Runbooks and a design that strictly controls AI’s operational permissions via IAM.

Toil-automation ROI has changed with AI

Previously, Toil automation was judged by “development cost of automation scripts vs. repeated cost of manual work.” In configurations where AI reads Runbooks and auto-executes, there’s no need to write automation scripts from scratch - cases are increasing where just maintaining Markdown procedure documents lets AI execute them. The ROI of automation investment has improved dramatically.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

SRE-org placement (centralized / distributed)
SLO management process (setup, review frequency)
Error-budget operational rules (freeze criteria)
Toil-reduction target (under 50%, automation investment rate)
Chaos engineering (frequency, scope)
PRR process (new-service introduction audit)
AI tool adoption (AIOps, auto-diagnosis)

Author’s note - cases visualizing the gap between “SRE sign” and “real SRE”

Whether SRE is a surface sign or real cultural transformation greatly divides the org’s fate.

After Google launched its SRE team in 2003 and the practice was disclosed in the 2016 book “Site Reliability Engineering,” companies worldwide followed. But centered on Japanese companies, cases of just changing existing ops teams’ business cards to “SRE”, with reality remaining manual night on-call and phone response, came one after another. Without time or authority to write code, after a year Toil rate at 95%, SLO defined only formally with no one watching - non-laughing field cases are repeatedly told even today.

In contrast, Netflix is famous for thoroughgoing SRE thinking. Netflix has continuously killed production instances randomly with Chaos Monkey since the 2010s, with design on the premise of breaking in production becoming standard. As a result, even on AWS partial outages, Netflix alone calmly continues service - cases observed multiple times. What’s visible here is the fact that “SRE isn’t signs but a problem of how time is spent and culture.”

It’s not “calling yourself SRE makes you SRE” - only whether you can have time confronting Toil-reduction with code decides SRE’s essence.

https://en.senkohome.com/arch-intro-devops-sre/ https://en.senkohome.com/arch-intro-devops-overview/ https://en.senkohome.com/arch-intro-index-devops/

Summary

This article covered SRE practices, including main practices, Toil reduction, error-budget operations, chaos engineering, PRR, Platform Engineering, and AIOps collaboration.

Phase introduction, keep Toil under 50%, mediate balance via error budget, delegate routine to AIOps. That is the practical answer for SRE practices in 2026.

Next time we’ll cover documentation (README, ADR, Runbook).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.