DevOps Architecture

[DevOps Architecture] SRE Practices - Toil Reduction and Chaos Drills

[DevOps Architecture] SRE Practices - Toil Reduction and Chaos Drills

About this article

As the thirteenth installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains SRE practices.

SRE is engineering that eliminates operations - reducing manual work is the main job. This article handles toil reduction, prioritization via SLO, error-budget operations, chaos engineering, on-call rotation, and team topology (Embedded SRE / Platform SRE) - design that operationally translates Google-originated operational engineering.

DevOps Architecture Overview — One Pipeline for Build, Ship, and Runen.senkohome.com/arch-intro-devops-overview/[DevOps Architecture] DevOps and SRE Overview - Speed and Stability Coexisten.senkohome.com/arch-intro-devops-sre/[DevOps Architecture] Version Control - Git + Monorepo + GitHub Flow Is the Standarden.senkohome.com/arch-intro-devops-vcs/[DevOps Architecture] Dev Environment and Local Execution - Half a Day to First Commiten.senkohome.com/arch-intro-devops-devenv/[DevOps Architecture] Code Review - PR 300 Lines + 1 Approver + CODEOWNERSen.senkohome.com/arch-intro-devops-review/[DevOps Architecture] Test Design - Pyramid + Testcontainers + Branch Coverageen.senkohome.com/arch-intro-devops-test/[DevOps Architecture] CI/CD - GitHub Actions + OIDC + Feature Flag Is the Standarden.senkohome.com/arch-intro-devops-cicd/[DevOps Architecture] Deploy Strategy - Raise Frequency, Lower Risken.senkohome.com/arch-intro-devops-deploy/[DevOps Architecture] Monitoring and Observability - Three Pillars + OpenTelemetry + SLO Alertsen.senkohome.com/arch-intro-devops-observability/[DevOps Architecture] Log Design - Structured JSON + No PII + Phased Cold-Tieringen.senkohome.com/arch-intro-devops-logging/[DevOps Architecture] SLO and SLI - Don't Pursue 100%, Buy Speed With Error Budgeten.senkohome.com/arch-intro-devops-slo/[DevOps Architecture] Incident Response - Resolve via Mechanism, Not Heroesen.senkohome.com/arch-intro-devops-incident/[DevOps Architecture] Documentation - Lean README + ADR + OpenAPI Toward Giten.senkohome.com/arch-intro-devops-docs/[DevOps Architecture] Ticket and Project Management - Epic/Story/Task + 1-Day Granularityen.senkohome.com/arch-intro-devops-ticket/

Why SRE is needed

Cloud / microservices complicate operations

Managing modern systems with hundreds of services via legacy “manual-based operations” is impossible. The SRE approach of operating with code becomes required.

Balancing dev speed and reliability

Legacy operations teams tended to favor “stop changes for stabilization,” conflicting with development. SRE mediates both via the number of error budget.

Reduce engineers’ operational burden

Manual on-call and alert response exhaust engineers. SRE reduces burden via automation, focusing on essential problem-solving.

Main SRE practices

The main 8 practices systematized in Google’s SRE book. Combine these to build operational culture.

flowchart TB
    SRE([SRE implementation])
    subgraph MEAS["Measure"]
        SLO[SLO/error budget]
        CAP[capacity planning]
    end
    subgraph REDUCE["Reduce"]
        TOIL[Toil reduction<br/>automate repetitive work]
        CHAOS[Chaos engineering]
    end
    subgraph LEARN["Learn"]
        PM[Postmortem<br/>Blameless]
        PRR[Production<br/>Readiness Review]
    end
    subgraph RESPOND["Prepare"]
        IR[Incident Response]
        ON[On-call design]
    end
    SRE --> MEAS
    SRE --> REDUCE
    SRE --> LEARN
    SRE --> RESPOND
    classDef sre fill:#fef3c7,stroke:#d97706,stroke-width:2px;
    classDef m fill:#dbeafe,stroke:#2563eb;
    classDef r fill:#dcfce7,stroke:#16a34a;
    classDef l fill:#fae8ff,stroke:#a21caf;
    classDef p fill:#f0f9ff,stroke:#0369a1;
    class SRE sre;
    class MEAS,SLO,CAP m;
    class REDUCE,TOIL,CHAOS r;
    class LEARN,PM,PRR l;
    class RESPOND,IR,ON p;
PracticeContent
SLO / error budgetManage reliability numerically
Toil reductionAutomate repetitive work
PostmortemBlameless review
On-call designRotation, load management
Capacity planningScale prediction and prep
Incident ResponseSystematize incident handling
Chaos engineeringIntentionally break to learn
Production Readiness ReviewPre-production audit

Toil reduction

Repeated manual work and automatable operational work is called Toil at Google, with the explicit goal of keeping it under 50% of SRE time. Beyond 50%, SREs can’t develop and the org can’t generate value.

Toil examplePath to automation
Server restartAuto-recovery
Log investigationObservability foundation
Permission grantsSelf-service-ization
MigrationCI/CD
Alert responseAuto-execution of Runbooks

Toil isn’t evil but is work that doesn’t grow you. SREs writing code to eliminate it is the main job, with the ideal being investing 20%+ monthly in automation.

Error-budget operations

The core of SRE is disciplining operations by error budget against SLO. Explicit switching - accelerate development while error budget remains, redirect to reliability investment on exhaustion.

Error budget remaining 70% --> Ship new features aggressively
Error budget remaining 30% --> Cautious, careful
Error budget remaining 5%  --> Freeze releases, stabilize
Error budget remaining 0%  --> Stop new features, focus on quality

This mechanism dissolves dev-vs-ops conflict. Decision can be made on the objective fact “budget exhausted” rather than “stop releases.”

Chaos engineering

The technique of intentionally causing failures in production to verify system fault tolerance. Originated with Netflix’s Chaos Monkey, “randomly killing instances in production” forces designs that constantly tolerate failures.

ToolUse case
Chaos MonkeyInstance stoppage
GremlinCommercial, comprehensive
Chaos MeshOSS for K8s
LitmusChaosK8s, CNCF
AWS FISAWS-integrated service

The premise is system architecture “on the premise that failures happen,” and periodically practicing breaking builds an org that doesn’t panic during real-world failures.

SLO-culture penetration

Just installing SLOs doesn’t function. Settling as organizational culture is needed - the following habits matter.

HabitContent
SLO reviewQuarterly target review
Weekly SLI checkTrend monitoring of actuals
Budget-exhaustion responsePlanned freeze / stabilize
Stakeholder agreementDiscussion with business

The core of cultural transformation is discarding “100% uptime = ideal.” SLO is a symbol of “don’t aim for perfection, aim for sufficient.”

Capacity planning

Plans for future traffic growth. Falling behind links directly to incidents and missed opportunities, so it’s an important area SRE continuously works on.

StepContent
Demand forecastBusiness plan and traffic
Current graspResource utilization, margin
Bottleneck identificationThe first to clog
Procurement planCloud reservations, contracts
Load testsVerification at predicted values
Periodic reviewMonthly, quarterly

The safer strategy for services expecting rapid growth is holding more margin.

Production Readiness Review

The mechanism of auditing before putting new services into production. Called PRR at Google, where the SRE team evaluates dev-team services.

ViewpointContent
ObservabilityMetrics / logs / traces ready
SLO definitionTargets and measurement methods
CapacityTolerance to expected load
Deploy strategySafe release procedures
Disaster countermeasuresRecovery procedures on failure
On-call regimeResponders, manuals

The gate of don’t put services failing PRR into production guarantees quality.

Relationship with DevOps

SRE is positioned as the concrete implementation of DevOps. If DevOps is philosophy / culture, SRE is the practical pattern.

DevOpsSRE
PositionPhilosophy / cultureConcrete practice
OriginAround 2009, Patrick Debois2003, Google
FocusIntegration of dev and opsSolve operations with engineering
MetricDORA 4 metricsSLO / error budget
RoleNo clear roleSRE engineer

It’s also said that “DevOps is the ideal, SRE is the implementation pattern.”

Platform Engineering

Developing SRE thinking, the dedicated team improving in-house developer experience is Platform Engineering. Provides an Internal Developer Platform (IDP), preparing environments where dev teams can autonomously and safely deploy / operate.

ProvidedContent
Self-service portalDeploy, env creation
Golden PathStandard tech stack
Automation toolsCI/CD, IaC
Monitoring foundationCommon observability

Backstage (OSS by Spotify) is the representative IDP, adopted by many companies.

Decision criterion 1: org scale

SRE introduction level is decided by org scale. Creating a company-wide SRE team is for mid-size enterprises and up; startups realistically start with concurrent roles.

ScaleRecommended
StartupDevs concurrent SRE
Mid-sizeSRE team (2-5 people)
Large enterpriseSRE per org
GlobalCentral SRE + per-business SRE

Decision criterion 2: org maturity

Realistic to phase SRE introduction. Starting everything at once causes chaos.

PhaseContent
Phase 1Metric / log foundation
Phase 2SLI measurement, SLO trial operation
Phase 3Error-budget operation
Phase 4Toil-automation investment
Phase 5Chaos engineering

How to choose by case

Personal dev / small team

Devs concurrent SRE + Phase 1-2 only. Set up metric foundation (CloudWatch etc.), SLI measurement is enough. SLO operation, error budget, PRR after org maturity. For Toil, finding “manual work over 2 hours monthly” and automating is enough.

Startup / growth-stage SaaS

1-2 SRE-concurrent engineers + SLO trial operation + Toil-reduction culture. Push to Phase 3, release decisions by error budget, Git-managed Runbooks. Building self-service portal with Backstage raises dev speed.

Mid-size enterprise / microservices ops

Dedicated SRE team 3-10 + PRR + chaos drills. Implement all to Phase 5, monthly chaos drills with LitmusChaos / AWS FIS, build IDP with Platform Engineering team. Include Toil-reduction KPI in org goals.

Large enterprise / regulated industries

Two-tier structure of central SRE + business-unit SRE + AIOps. Central provides company-wide standards (Golden Path), business units run own SLO operations. Automate routine response with Datadog Bits AI / Resolve AI, humans focus on strategy / improvement design.

Common misconceptions

SRE means infrastructure operators

Different. They’re specialists solving operational problems with software engineering. SREs not writing code have low value.

Set SLO and you’re SRE

SLO is one means. The total of culture, process, and automation is SRE.

Toil should be zero

Realistically impossible. Under 50% is Google’s guideline. Some manual work for new features is necessary.

SRE is the new name for ops team

Just changing signs doesn’t function. Transformation from authority, culture, and hiring criteria is needed. Cases where existing ops teams just got the name “SRE” attached, with continuing midnight manual on-call, no time or authority to write code, and Toil rate at 95% even after a year - centered on Japanese companies, many such non-laughing cases are told. Not signs, but how time is spent defines SRE.

Phased SRE-maturity roadmap

SRE is a cultural transformation that doesn’t realize overnight. Phased introduction compliant with Google SRE Workbook is realistic.

PhasePeriodImplementationRequired SREs
Phase 1: Measurement foundation~6 monthsPrometheus / Datadog adoption, metric / log setup0-1 (concurrent)
Phase 2: SLO trial operation~1 yearSLI selection, measurement, tentative target1-2
Phase 3: Error-budget operation~1.5 yearsRelease-freeze rule on exhaustion, quarterly review2-5
Phase 4: Toil-reduction culture~2 yearsToil under 50% target, 20% automation investment, Runbook as Code3-10
Phase 5: Chaos / IDP~3 yearsMonthly chaos engineering, build IDP like Backstage5+
Phase 6: AIOps~5 yearsAuto-first-response with PagerDuty AIOps / Resolve AIPrompt + Systems Engineer

The Toil target line is under 50% of SRE time - Google’s official guideline. Over 50% means SREs can’t develop and the org can’t generate value. Allocating 20% monthly to automation investment is the empirical rule keeping Toil at sustainable levels.

SRE is not signs but how time is spent. Toil over 50% is sign-only, not real SRE.

SRE-operation pitfalls and forbidden moves

Typical accident patterns in SRE introduction. All have the structure of just changing signs without changing contents.

Forbidden moveWhy it’s bad
Just change existing ops team’s business cards to SREToil rate 95% even after a year, SRE not writing code is zero value
Don’t give SRE authority / time to write codeStays as ops worker, generates no essential value
Set SLO and abandonNo one looks, quarterly review required
Try to make Toil zeroRealistically impossible, under 50% is Google’s guideline
Start chaos engineering without production experienceMajor accident on first try, practice in staging first
Put new services into production without PRRServices with unmaintained monitoring / SLO / Runbook reach production
Continue releases even on error-budget exhaustionReliability collapses, error-budget operations meaningless
Don’t measure on-call load5+ late-night calls monthly exhausts SREs / leads to resignation
Don’t create independent SRE teamFalls to subcontracting under dev teams
Defer investment in Platform Engineering / IDPEach team operates independently, no standardization, inefficient

The Japanese-company SRE-sign problem (just attaching the name “SRE” to existing ops teams, with continuing midnight manual on-call, no time or authority to write code, Toil rate 95% after a year) slaps home the lesson “not signs but how time is spent defines SRE.” Netflix has continuously killed production randomly with Chaos Monkey since the 2010s, calmly continuing service even on AWS partial outages - this gap is the difference between “real SRE” and “name only.”

SRE is engineering going to kill Toil. If running by hand, it’s evidence you haven’t yet scaled.

AI-era perspective

When AI-driven dev (vibe coding) and AI usage are the premise, SRE evolves into operations collaborating with AI agents. The ultimate form of Toil reduction is AI autonomously making operational decisions, with Datadog Bits AI / PagerDuty AIOps / Resolve AI etc. already commercialized.

Favored in the AI eraDisfavored in the AI era
Code-ized RunbooksProcedures in people’s heads
Structured logs / metricsString-log-centric
Automated ToilManual work remaining
Quantitative reliability managementSensory operations

The future SRE image is humans managing AI agents. AI handles first response, humans concentrate on designing agent learning / judgment criteria. AI-era SRE is becoming the fusion role of Prompt Engineer + Systems Engineer.

AI-era SRE entrusts operational automation to AI, focuses on strategic design.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • SRE-org placement (centralized / distributed)
  • SLO management process (setup, review frequency)
  • Error-budget operational rules (freeze criteria)
  • Toil-reduction target (under 50%, automation investment rate)
  • Chaos engineering (frequency, scope)
  • PRR process (new-service introduction audit)
  • AI tool adoption (AIOps, auto-diagnosis)

Author’s note - cases visualizing the gap between “SRE sign” and “real SRE”

Whether SRE is a surface sign or real cultural transformation greatly divides the org’s fate.

After Google launched its SRE team in 2003 and the practice was disclosed in the 2016 book “Site Reliability Engineering,” companies worldwide followed. But centered on Japanese companies, cases of just changing existing ops teams’ business cards to “SRE”, with reality remaining manual night on-call and phone response, came one after another. Without time or authority to write code, after a year Toil rate at 95%, SLO defined only formally with no one watching - non-laughing field cases are repeatedly told even today.

In contrast, Netflix is famous for thoroughgoing SRE thinking. Netflix has continuously killed production instances randomly with Chaos Monkey since the 2010s, with design on the premise of breaking in production becoming standard. As a result, even on AWS partial outages, Netflix alone calmly continues service - cases observed multiple times. What’s visible here is the fact that “SRE isn’t signs but a problem of how time is spent and culture.”

It’s not “calling yourself SRE makes you SRE” - only whether you can have time confronting Toil-reduction with code decides SRE’s essence.

How to make the final call

The core of SRE is the thinking of solving operations with engineering. Legacy manual-based operations don’t scale in modern systems with hundreds of microservices. Google’s SRE is the totality of practices code-izing and numerically capturing operations - SLO, error budget, Toil reduction, postmortem, chaos engineering - functioning via combination of culture, process, and automation rather than individual tools. The misconception “SRE is just changing the ops team’s sign” is the most common antipattern, requiring transformation from authority, culture, and hiring criteria.

Another decisive axis is evolution to SRE managing AI agents. In the era when Datadog Bits AI / PagerDuty AIOps / Resolve AI handle first response / root-cause-candidate presentation / Runbook auto-execution, human SREs concentrate on designing agent learning / judgment criteria, strategic reliability investment, and organizational-culture cultivation. The fusion role of Prompt Engineer and Systems Engineer is the AI-era SRE image.

Selection priorities

  1. Phase introduction - Phase 1 (measurement foundation) → Phase 5 (chaos) order, don’t do all at once
  2. Keep Toil under 50% - over means SRE can’t develop, invest 20% monthly in automation
  3. Mediate dev/reliability balance via error budget - objectively judge release acceleration / freeze per remaining
  4. Delegate routine response to AIOps - humans focus on strategy / improvement design / agent management

“Solve operations with code, entrust to AI, focus on strategy.” This is the modern SRE image.

Summary

This article covered SRE practices, including main practices, Toil reduction, error-budget operations, chaos engineering, PRR, Platform Engineering, and AIOps collaboration.

Phase introduction, keep Toil under 50%, mediate balance via error budget, delegate routine to AIOps. That is the practical answer for SRE practices in 2026.

Next time we’ll cover documentation (README, ADR, Runbook).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.