DevOps Architecture

[DevOps Architecture] SRE Practices

[DevOps Architecture] SRE Practices

About this article

As the thirteenth installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains SRE practices.

SRE is engineering that eliminates operations - reducing manual work is the main job. This article handles toil reduction, prioritization via SLO, error-budget operations, chaos engineering, on-call rotation, and team topology (Embedded SRE / Platform SRE) - design that operationally translates Google-originated operational engineering.

What are SRE practices in the first place

Key SRE Practices

Picture factory production management. A well-run factory doesn’t have workers manually checking quality every time — it has automated inspection, early defect detection, line-halt criteria, and improvement cycles built into the system. Maintaining quality through systems rather than individual skill is the essence of production management.

SRE practices are production management for system operations. They are the concrete methods Google systematized — toil reduction, prioritization via SLO, error-budget operations, chaos engineering — maintaining and improving operational quality without depending on specific individuals.

Without SRE practices, operations become an endless repetition of manual work. Engineers get consumed by alert response and can’t invest time in essential improvements.

Why SRE is needed

Cloud / microservices complicate operations

Managing modern systems with hundreds of services via legacy “manual-based operations” is impossible. The SRE approach of operating with code becomes required.

Balancing dev speed and reliability

Legacy operations teams tended to favor “stop changes for stabilization,” conflicting with development. SRE mediates both via the number of error budget.

Reduce engineers’ operational burden

Manual on-call and alert response exhaust engineers. SRE reduces burden via automation, focusing on essential problem-solving.

Main SRE practices

The main 8 practices systematized in Google’s SRE book. Combine these to build operational culture.

Google SRE's Eight Key Practices Like factory production management. Maintain quality through systems, eliminate key-person dependencies 1 SLO / Error Budget The core of managing reliability with numbers Within budget → push, depleted → defend SRE's most important concept Resolve the speed vs stability tradeoff with numbers 2 Toil Reduction Automate repetitive manual work Keep below 50% of SRE time Over 50% = danger zone Invest 20%+ monthly in automation 3 Post-mortem Blameless (No finger-pointing) Learn from failures and prevent recurrence Blame game = death of culture Record timeline, root cause, and improvements 4 On-Call Design Distribute load through rotation Prevent alert fatigue Target under 12h per week Managed with PagerDuty / Opsgenie 5 Capacity Planning Prepare for future load increases Demand forecast → Bottleneck identification → Procurement Review monthly and quarterly 6 Incident Response Systematizing Incident Response Incident Commander System Standardize initial response with Runbooks 7 Chaos Engineering Intentionally break to verify resilience Chaos Monkey / Gremlin Practice breaking to prepare for production incidents 8 PRR Pre-production Review Verify SLO, monitoring, and deployment strategy Cannot release without passing Implementation Order: Phase1 Monitoring Infrastructure → Phase2 SLI/SLO → Phase3 Error Budget → Phase4 Toil Automation → Phase5 Chaos Exercises SRE is the concrete implementation of DevOps. Maintain quality through systems without burning out engineers
PracticeContent
SLO / error budgetManage reliability numerically
Toil reductionAutomate repetitive work
PostmortemBlameless review
On-call designRotation, load management
Capacity planningScale prediction and prep
Incident ResponseSystematize incident handling
Chaos engineeringIntentionally break to learn
Production Readiness ReviewPre-production audit

Toil reduction

Repeated manual work and automatable operational work is called Toil at Google, with the explicit goal of keeping it under 50% of SRE time. Beyond 50%, SREs can’t develop and the org can’t generate value.

Toil examplePath to automation
Server restartAuto-recovery
Log investigationObservability foundation
Permission grantsSelf-service-ization
MigrationCI/CD
Alert responseAuto-execution of Runbooks

Toil isn’t evil but is work that doesn’t grow you. SREs writing code to eliminate it is the main job, with the ideal being investing 20%+ monthly in automation.

Error-budget operations

The core of SRE is disciplining operations by error budget against SLO. Explicit switching - accelerate development while error budget remains, redirect to reliability investment on exhaustion.

Error budget remaining 70% --> Ship new features aggressively
Error budget remaining 30% --> Cautious, careful
Error budget remaining 5%  --> Freeze releases, stabilize
Error budget remaining 0%  --> Stop new features, focus on quality

This mechanism dissolves dev-vs-ops conflict. Decision can be made on the objective fact “budget exhausted” rather than “stop releases.”

Chaos engineering

The technique of intentionally causing failures in production to verify system fault tolerance. Originated with Netflix’s Chaos Monkey, “randomly killing instances in production” forces designs that constantly tolerate failures.

ToolUse case
Chaos MonkeyInstance stoppage
GremlinCommercial, comprehensive
Chaos MeshOSS for K8s
LitmusChaosK8s, CNCF
AWS FISAWS-integrated service

The premise is system architecture “on the premise that failures happen,” and periodically practicing breaking builds an org that doesn’t panic during real-world failures.

SLO-culture penetration

Just installing SLOs doesn’t function. Settling as organizational culture is needed - the following habits matter.

HabitContent
SLO reviewQuarterly target review
Weekly SLI checkTrend monitoring of actuals
Budget-exhaustion responsePlanned freeze / stabilize
Stakeholder agreementDiscussion with business

The core of cultural transformation is discarding “100% uptime = ideal.” SLO is a symbol of “don’t aim for perfection, aim for sufficient.”

Capacity planning

Plans for future traffic growth. Falling behind links directly to incidents and missed opportunities, so it’s an important area SRE continuously works on.

StepContent
Demand forecastBusiness plan and traffic
Current graspResource utilization, margin
Bottleneck identificationThe first to clog
Procurement planCloud reservations, contracts
Load testsVerification at predicted values
Periodic reviewMonthly, quarterly

The safer strategy for services expecting rapid growth is holding more margin.

Production Readiness Review

The mechanism of auditing before putting new services into production. Called PRR at Google, where the SRE team evaluates dev-team services.

ViewpointContent
ObservabilityMetrics / logs / traces ready
SLO definitionTargets and measurement methods
CapacityTolerance to expected load
Deploy strategySafe release procedures
Disaster countermeasuresRecovery procedures on failure
On-call regimeResponders, manuals

The gate of don’t put services failing PRR into production guarantees quality.

Relationship with DevOps

SRE is positioned as the concrete implementation of DevOps. If DevOps is philosophy / culture, SRE is the practical pattern.

DevOpsSRE
PositionPhilosophy / cultureConcrete practice
OriginAround 2009, Patrick Debois2003, Google
FocusIntegration of dev and opsSolve operations with engineering
MetricDORA 4 metricsSLO / error budget
RoleNo clear roleSRE engineer

It’s also said that “DevOps is the ideal, SRE is the implementation pattern.”

Platform Engineering

Developing SRE thinking, the dedicated team improving in-house developer experience is Platform Engineering. Provides an Internal Developer Platform (IDP), preparing environments where dev teams can autonomously and safely deploy / operate.

ProvidedContent
Self-service portalDeploy, env creation
Golden PathStandard tech stack
Automation toolsCI/CD, IaC
Monitoring foundationCommon observability

Backstage (OSS by Spotify) is the representative IDP, adopted by many companies.

Decision criterion 1: org scale

SRE introduction level is decided by org scale. Creating a company-wide SRE team is for mid-size enterprises and up; startups realistically start with concurrent roles.

ScaleRecommended
StartupDevs concurrent SRE
Mid-sizeSRE team (2-5 people)
Large enterpriseSRE per org
GlobalCentral SRE + per-business SRE

Decision criterion 2: org maturity

Realistic to phase SRE introduction. Starting everything at once causes chaos.

PhaseContent
Phase 1Metric / log foundation
Phase 2SLI measurement, SLO trial operation
Phase 3Error-budget operation
Phase 4Toil-automation investment
Phase 5Chaos engineering

How to choose by case

Personal dev / small team

Devs concurrent SRE + Phase 1-2 only. Set up metric foundation (CloudWatch etc.), SLI measurement is enough. SLO operation, error budget, PRR after org maturity. For Toil, finding “manual work over 2 hours monthly” and automating is enough.

Startup / growth-stage SaaS

1-2 SRE-concurrent engineers + SLO trial operation + Toil-reduction culture. Push to Phase 3, release decisions by error budget, Git-managed Runbooks. Building self-service portal with Backstage raises dev speed.

Mid-size enterprise / microservices ops

Dedicated SRE team 3-10 + PRR + chaos drills. Implement all to Phase 5, monthly chaos drills with LitmusChaos / AWS FIS, build IDP with Platform Engineering team. Include Toil-reduction KPI in org goals.

Large enterprise / regulated industries

Two-tier structure of central SRE + business-unit SRE + AIOps. Central provides company-wide standards (Golden Path), business units run own SLO operations. Automate routine response with Datadog Bits AI / Resolve AI, humans focus on strategy / improvement design.

Phased SRE-maturity roadmap

SRE is a cultural transformation that doesn’t realize overnight. Phased introduction compliant with Google SRE Workbook is realistic.

PhasePeriodImplementationRequired SREs
Phase 1: Measurement foundation~6 monthsPrometheus / Datadog adoption, metric / log setup0-1 (concurrent)
Phase 2: SLO trial operation~1 yearSLI selection, measurement, tentative target1-2
Phase 3: Error-budget operation~1.5 yearsRelease-freeze rule on exhaustion, quarterly review2-5
Phase 4: Toil-reduction culture~2 yearsToil under 50% target, 20% automation investment, Runbook as Code3-10
Phase 5: Chaos / IDP~3 yearsMonthly chaos engineering, build IDP like Backstage5+
Phase 6: AIOps~5 yearsAuto-first-response with PagerDuty AIOps / Resolve AIPrompt + Systems Engineer

The Toil target line is under 50% of SRE time - Google’s official guideline. Over 50% means SREs can’t develop and the org can’t generate value. Allocating 20% monthly to automation investment is the empirical rule keeping Toil at sustainable levels.

SRE is not signs but how time is spent. Toil over 50% is sign-only, not real SRE.

SRE-operation pitfalls and forbidden moves

Typical accident patterns in SRE introduction. All have the structure of just changing signs without changing contents.

Forbidden moveWhy it’s bad
Just change existing ops team’s business cards to SREToil rate 95% even after a year, SRE not writing code is zero value
Don’t give SRE authority / time to write codeStays as ops worker, generates no essential value
Set SLO and abandonNo one looks, quarterly review required
Try to make Toil zeroRealistically impossible, under 50% is Google’s guideline
Start chaos engineering without production experienceMajor accident on first try, practice in staging first
Put new services into production without PRRServices with unmaintained monitoring / SLO / Runbook reach production
Continue releases even on error-budget exhaustionReliability collapses, error-budget operations meaningless
Don’t measure on-call load5+ late-night calls monthly exhausts SREs / leads to resignation
Don’t create independent SRE teamFalls to subcontracting under dev teams
Defer investment in Platform Engineering / IDPEach team operates independently, no standardization, inefficient
”SRE means infrastructure operators” — conflating the twoSRE is the methodology of solving operations with software engineering. SREs not writing code have low value
”Toil should be zero” — pursuing perfectionRealistically impossible. Under 50% is Google’s guideline. Some manual work for new features is necessary

The Japanese-company SRE-sign problem (just attaching the name “SRE” to existing ops teams, with continuing midnight manual on-call, no time or authority to write code, Toil rate 95% after a year) slaps home the lesson “not signs but how time is spent defines SRE.” Netflix has continuously killed production randomly with Chaos Monkey since the 2010s, calmly continuing service even on AWS partial outages - this gap is the difference between “real SRE” and “name only.”

SRE is engineering going to kill Toil. If running by hand, it’s evidence you haven’t yet scaled.

AI decision axes

AI-era favorableAI-era unfavorable
Code-ized RunbooksProcedures in people’s heads
Structured logs / metricsString-log-centric
Automated ToilManual work remaining
Quantitative reliability managementSensory operations
  1. Phase introduction — Phase 1 (measurement foundation) → Phase 5 (chaos) order, don’t do all at once
  2. Keep Toil under 50% — over means SRE can’t develop, invest 20% monthly in automation
  3. Mediate dev/reliability balance via error budget — objectively judge release acceleration / freeze per remaining
  4. Delegate routine response to AIOps — humans focus on strategy / improvement design / agent management

AIOps is being adopted in 3 stages in practice

As of 2026, practical AIOps adoption is progressing in the following 3 stages:

  • Stage 1: Anomaly-detection automation (AI detects metric baseline deviations → alerts humans)
  • Stage 2: Root-cause estimation (AI cross-analyzes logs, traces, and metrics for RCA → notifies via Slack)
  • Stage 3: Auto-recovery (AI executes recovery operations per Runbook → humans confirm after the fact)

Most organizations are at Stage 1-2, but advancing to Stage 3 requires codified, trustworthy Runbooks and a design that strictly controls AI’s operational permissions via IAM.

Toil-automation ROI has changed with AI

Previously, Toil automation was judged by “development cost of automation scripts vs. repeated cost of manual work.” In configurations where AI reads Runbooks and auto-executes, there’s no need to write automation scripts from scratch - cases are increasing where just maintaining Markdown procedure documents lets AI execute them. The ROI of automation investment has improved dramatically.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • SRE-org placement (centralized / distributed)
  • SLO management process (setup, review frequency)
  • Error-budget operational rules (freeze criteria)
  • Toil-reduction target (under 50%, automation investment rate)
  • Chaos engineering (frequency, scope)
  • PRR process (new-service introduction audit)
  • AI tool adoption (AIOps, auto-diagnosis)

Author’s note - cases visualizing the gap between “SRE sign” and “real SRE”

Whether SRE is a surface sign or real cultural transformation greatly divides the org’s fate.

After Google launched its SRE team in 2003 and the practice was disclosed in the 2016 book “Site Reliability Engineering,” companies worldwide followed. But centered on Japanese companies, cases of just changing existing ops teams’ business cards to “SRE”, with reality remaining manual night on-call and phone response, came one after another. Without time or authority to write code, after a year Toil rate at 95%, SLO defined only formally with no one watching - non-laughing field cases are repeatedly told even today.

In contrast, Netflix is famous for thoroughgoing SRE thinking. Netflix has continuously killed production instances randomly with Chaos Monkey since the 2010s, with design on the premise of breaking in production becoming standard. As a result, even on AWS partial outages, Netflix alone calmly continues service - cases observed multiple times. What’s visible here is the fact that “SRE isn’t signs but a problem of how time is spent and culture.”

It’s not “calling yourself SRE makes you SRE” - only whether you can have time confronting Toil-reduction with code decides SRE’s essence.

https://en.senkohome.com/arch-intro-devops-sre/ https://en.senkohome.com/arch-intro-devops-overview/ https://en.senkohome.com/arch-intro-index-devops/

Summary

This article covered SRE practices, including main practices, Toil reduction, error-budget operations, chaos engineering, PRR, Platform Engineering, and AIOps collaboration.

Phase introduction, keep Toil under 50%, mediate balance via error budget, delegate routine to AIOps. That is the practical answer for SRE practices in 2026.

Next time we’ll cover documentation (README, ADR, Runbook).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.

📚 Series: Architecture Crash Course for the Generative-AI Era (66/89)