DevOps Architecture

[DevOps Architecture] DevOps and SRE Overview - Speed and Stability Coexist

[DevOps Architecture] DevOps and SRE Overview - Speed and Stability Coexist

About this article

As the second installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains DevOps and SRE overview.

The practical view is that DevOps (a cultural movement since 2009) and SRE (a Google-originated engineering practice since 2003) climbed the same goal via different paths. This article handles components from both - resolving the Wall of Confusion, SLO, error budgets, toil reduction, postmortems - and organizes the modern consensus that “separating dev and ops is now outdated.”

What are DevOps and SRE in the first place

Picture condominium management. The construction company (dev) builds the building; a management company (ops) maintains it. Previously these two were completely separated. The construction company’s attitude was “done once built”, while the management company’s was “don’t change anything” — their goals were exact opposites.

DevOps is the paradigm shift of having the same team handle both construction and management. Builders take responsibility through operations, and operational insights feed back into the next design. SRE (Site Reliability Engineering) is Google’s engineering systematization of this philosophy — a methodology of “managing reliability numerically and reducing operational burden through automation.”

Without the DevOps/SRE mindset, the wall between dev and ops remains, and every release sparks conflict, and incidents devolve into blame games.

Why DevOps and SRE are needed

Wall of Confusion (the organizational divide between dev and ops where goals and evaluation metrics were exact opposites) - dev teams evaluated by “number of features released fast,” ops teams by “system uptime.” With structurally conflicting goals, every release saw a tug-of-war of “dev wanting to ship” vs “ops wanting to stop,” and incidents devolved into mutual blame - a culture once standard in the industry.

To break this wall, the bidirectional crossing - “dev teams take responsibility through deployment and operation” and “ops teams work in code” - has been the movement of the last 15 years. Famous cultural reforms like Amazon’s API Mandate (2002) and Netflix’s Freedom & Responsibility are also known as cases anticipating this crossing.

Old world (~2009)New world (post-DevOps/SRE)
Dev = speed, ops = stabilitySame team sees both
Deploy = ops manual workDeploy executed by pipeline
At incidents: “whose fault?”At incidents: “how to prevent recurrence?”
Ops = midnight phone dutyOps automated by code

Are DevOps and SRE different things

The two are often confused, but pinning down “different origins, different emphases” organizes them.

DevOps’s main battlefield is culture/organizational theory, a movement spreading CI/CD, automation, measurement, and sharing with the goal of “dev-ops collaboration.” SRE is a Google-originated engineering method, a methodology of running ops decisions by numbers with SLO (Service Level Objective, the numerical reliability goal) and error budgets (allowable breakage budget) at its core.

ViewpointDevOpsSRE
Origin2009 DevOpsDays (Belgium)2003 inside Google → 2016 book
FocusCulture, organization, collaborationEngineering reliability
Core weaponsCI/CD, automation, measurement, sharingSLO, error budget, toil reduction
Typical roleDevOps engineer (cross-cutting)SRE (reliability-dedicated)
Main battlefieldProcess, pipelineProduction operations, reliability

Google itself explains SRE as an implementation that’s a subset of DevOps. Not opposing - SRE is one way to implement DevOps.

DORA 4 metrics - the numbers separating strong and weak teams

DORA (DevOps Research and Assessment, the research team led by Dr. Nicole Forsgren et al., systematized in the 2018 book “Accelerate”) investigated teams worldwide for nearly 10 years and concluded that the difference between strong and weak teams converges to 4 numbers. These 4 metrics have become the common language for measuring DevOps/SRE improvement results.

quadrantChart
    title DORA's finding "Speed and stability coexist"
    x-axis Low stability --> High stability
    y-axis Slow speed --> Fast speed
    quadrant-1 Elite (ideal)
    quadrant-2 Unstable and slow
    quadrant-3 Low (worst)
    quadrant-4 Cautious
    Elite (top 10%): [0.9, 0.9]
    High: [0.7, 0.7]
    Medium: [0.5, 0.5]
    Low (bottom 30%): [0.2, 0.2]
MetricMeaningElite (top 10%)Low (bottom 30%)
Deploy frequencyHow often production is shippedMultiple per dayLess than monthly
Lead time for changesCode commit → productionLess than 1 hourMore than 1 month
MTTR (Mean Time To Recovery)Incident → recoveryLess than 1 hourMore than 1 month
Change failure rateRate of deploys causing problems0-15%46-60%

What deserves attention is DORA’s finding that speed and stability coexist. Teams that release more frequently have lower failure rates and faster recovery. The Elite face is fast and unbreakable, not “slow and unbreakable” - the biggest impact of this research.

CALMS - the 5 pillars of DevOps

The standard framework for organizing DevOps as a “cultural movement” is CALMS. Below shows which chapter article each item corresponds to.

PillarMeaningThis chapter’s correspondence
CultureDon’t punish failures, sharePostmortems, blameless
AutomationAutomate manual workCI/CD, IaC, GitOps
LeanFlow small and fastTrunk Based, Feature Flag
MeasurementMeasure everythingDORA, SLO, observability
SharingShare learningRunbooks, documentation, PRs

Especially without Measurement, the other 4 turn into “a movement that runs on mood.” Measure first is DevOps’s first step.

Phased practice - priorities change with team maturity

“Do everything” is something no one can execute, so split priorities by maturity. The current practical standard is below.

PhaseTeam sizeTop priorityTarget (DORA equivalent)
1. Startup~10CI (auto-test), config management, READMERelease within 2h / MTTR within 1 day
2. Growth10-50CD automation, monitoring, alerts, on-call1 deploy/day / MTTR 1 hour
3. Maturity50-200SLO operation, error budget, Feature FlagMultiple deploys/day / MTTR 30 min
4. Large-scale200+Platform Engineering / internal developer platformTens of deploys/day / MTTR 15 min

Discussing SLO in phase 1 is premature; conversely, manual deploys still in phase 3 is laziness. Investing too early or too late is an accident - the practical feel.

90% of teams are in phases 1-2, many Elites in phase 3, and Platform Engineering (4) is still a minority today.

Platform Engineering - the recent main battlefield

Platform Engineering is the next wave of DevOps that spread rapidly from around 2020, with the core idea of “a dedicated team builds an internal developer platform (IDP), and app developers just ride on it.”

In the past, “dev teams do everything” was the DevOps ideal, but in practice cognitive load rose too much and many teams broke down. Platform Engineering is the role-rediscovery movement of “let’s place a platform team dedicated to designing developer experience (DX).”

ViewpointLegacy DevOpsPlatform Engineering
WhoEach dev teamPlatform team
WhatAll in-houseProvides shared platform
UsersEveryone grasps everythingDevelopers use via API
Representative caseNetflix until ~2019Spotify’s Backstage

Backstage (the IDP framework Spotify open-sourced, released 2020) is the symbol of Platform Engineering.

SLO and error budget - SRE’s core in one sheet

Setting SLO at “monthly availability 99.9%” auto-derives that about 43 minutes of downtime per month is allowed. This “allowed downtime” is the error budget, letting you run the speed-stability tradeoff by numbers - attack within budget, defense if exceeded.

StateDecision
Plenty of error budgetAggressive release, experiments OK
Less than 20% budget leftSuppress new features, invest in stability
Budget exhaustedRelease freeze, focus on incident-recurrence prevention

This is SRE’s decisive invention. “The fruitless debate of ‘stability or speed - which to prioritize’ is auto-resolved by numbers.” Details handled in another article.

100% availability is not the goal. The moment you aim for it, costs balloon infinitely - the paradox of SRE thinking.

Historical incidents - all became DevOps/SRE lessons

Each principle of DevOps/SRE was designed backwards from actual large incidents. As lessons running through the entire chapter, just 3 points (incident details in appendix “Critical Incident Cases”).

IncidentLesson for DevOps/SRE
Knight Capital 2012 (45 min, $440M loss, bankruptcy)Ban Feature Flag re-use / Canary Release / Kill Switch required
GitLab 2017 (rm -rf on prod DB, 4 of 5 backups broken)Periodic restore drills / verify not just taken but restorable
Slack 2022 (AWS config change on New Year’s Day, hours of stoppage)Redundancy of dependencies and SLO independence / cascade failures in multi-tenant era

Modern DevOps/SRE principles are prescriptions derived backwards from past disasters. Copying just principles without reading the cases means you can’t feel why they’re needed.

DevOps pitfalls and forbidden moves

I’ve often seen formalization that calls itself “doing DevOps” while missing the essence. Listing landmines to avoid:

Forbidden moveWhy it’s bad
Make a dedicated DevOps teamJust adds a new silo. DevOps is a wall-breaking movement
Just install tools without changing culturePutting in Jenkins changes nothing if the org stays the same
Raise only release speed without setting SLODon’t notice when broken, big incident months later
Hunt for blame in postmortemsNo one reports honestly thereafter, recurrence prevention fails
Reduce DevOps = automationForgetting Culture/Lean/Sharing means it doesn’t sustain
Idealize everyone does everythingCognitive-load exhaustion, should move to Platform Engineering
Continue manual operation under SRE bannerToil over 50% disqualifies as SRE; without time for automation, not SRE
Gamify DORA metricsDistortion of just raising deploy frequency with empty contents
Don’t measure because “we’re special”Teams that don’t measure 100% don’t improve
”DevOps means having a DevOps team” — making it a dedicated functionJust adds a new silo. DevOps is a wall-breaking movement
”Should aim for 100% availability” — pursuing perfectionInfinite cost, dev stops. Agreeing on allowable breakage via SLO is SRE’s core

Toil (operational work repeated manually that should be automatable) is SRE’s central concept; over 50% of the team’s total work hours is the danger zone.

Specific numerical gates - DevOps health-check thresholds

In an area that tends to drift to abstract argument, here are current numerical guidelines. Use for self-diagnosis “where are we?”

ItemDangerTolerableTarget
Deploy frequencyLess than monthlyWeeklyMultiple/day
Lead time (commit → prod)Over 1 month1 weekLess than 1 hour
MTTROver 1 day1 hourLess than 30 min
Change failure rateOver 30%15%Less than 5%
Toil ratioOver 50%30%Less than 20%
CI runtime (PR)Over 30 min10 minLess than 5 min
Production SLO compliance (recent 30 days)SLO not metMet50%+ error budget remaining

Teams that can’t speak in numbers can’t improve. Measure first, then talk.

AI decision axes

AI-era favorableAI-era unfavorable
Structured logs + OpenTelemetryNatural-language logs + custom format
Declarative via IaC/GitOpsManual SSH + Excel manuals
Markdown + Git for docsConfluence + Word, scattered
PRs/Issues as decision historyDiscussions flowing in Slack threads
  1. Measure DORA 4 metrics first — teams that can’t speak in numbers can’t improve.
  2. Investment matched to phase — SLO too early in startup, manual deploy too late in maturity.
  3. Don’t create a “DevOps team” — don’t make new walls in a wall-breaking movement.
  4. Machine-readable operational data — that’s where the AI-era pipeline starts.

Author’s note - the aftermath of a mid-size SIer making a “DevOps team”

I’ll introduce a typical case repeatedly observed in the industry.

A mid-size SIer flying the flag of “promoting DevOps-ization” launched a dedicated DevOps promotion department with 20 people, deploying Jenkins, Terraform, and Kubernetes company-wide - this much progressed energetically. But half a year later, dev teams said “we ask the DevOps department to fix CI and wait 3 days,” DevOps said “dev doesn’t learn Terraform so it all flows here,” reportedly fixing into a new silo with both sides discontent. Tools assembled but a new wall added - dead-center DevOps antipattern outcome.

In contrast, several companies thoroughly avoided “using the DevOps word” and leaned into Platform Engineering shape - each dev team takes responsibility through deployment + a cross-cutting platform team provides the foundation - quietly reaching DORA Elite over 2-3 years. DevOps formalizes when you wave the banner, the front-runner is thoroughly executing the principles without waving the banner.

A repeatedly-observed industry rule: DevOps success cases don’t use the word DevOps much. The paradox that teams not naming it more faithfully practice DevOps principles in the result.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • Where does the current team stand on DORA 4 metrics (measure first)
  • Which phase (1-startup to 4-large-scale) are we in now
  • Can SLOs be numerically agreed with business / still too early
  • Is there culture for blameless postmortems
  • Should we step into Platform Engineering (200+ as guideline)
  • Are machine-readable operational data (structured logs, IaC, Markdown runbooks) in place
  • Can we get by without creating a dedicated DevOps team

Summary

This article covered DevOps and SRE overview, including origins, relationship with SRE, DORA 4 metrics, CALMS, SLO and error budgets, Platform Engineering, and historical-incident lessons.

Measure DORA 4 metrics first, invest matched to phase, don’t create a DevOps team, organize machine-readable operational data. That is the practical answer for DevOps/SRE in 2026.

Next time we’ll cover version control (Git, branch strategy, monorepo).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.