About this article
As the second installment of the âDevOps Architectureâ category in the series âArchitecture Crash Course for the Generative-AI Era,â this article explains DevOps and SRE overview.
The practical view is that DevOps (a cultural movement since 2009) and SRE (a Google-originated engineering practice since 2003) climbed the same goal via different paths. This article handles components from both - resolving the Wall of Confusion, SLO, error budgets, toil reduction, postmortems - and organizes the modern consensus that âseparating dev and ops is now outdated.â
What are DevOps and SRE in the first place
Picture condominium management. The construction company (dev) builds the building; a management company (ops) maintains it. Previously these two were completely separated. The construction companyâs attitude was âdone once builtâ, while the management companyâs was âdonât change anythingâ â their goals were exact opposites.
DevOps is the paradigm shift of having the same team handle both construction and management. Builders take responsibility through operations, and operational insights feed back into the next design. SRE (Site Reliability Engineering) is Googleâs engineering systematization of this philosophy â a methodology of âmanaging reliability numerically and reducing operational burden through automation.â
Without the DevOps/SRE mindset, the wall between dev and ops remains, and every release sparks conflict, and incidents devolve into blame games.
Why DevOps and SRE are needed
Wall of Confusion (the organizational divide between dev and ops where goals and evaluation metrics were exact opposites) - dev teams evaluated by ânumber of features released fast,â ops teams by âsystem uptime.â With structurally conflicting goals, every release saw a tug-of-war of âdev wanting to shipâ vs âops wanting to stop,â and incidents devolved into mutual blame - a culture once standard in the industry.
To break this wall, the bidirectional crossing - âdev teams take responsibility through deployment and operationâ and âops teams work in codeâ - has been the movement of the last 15 years. Famous cultural reforms like Amazonâs API Mandate (2002) and Netflixâs Freedom & Responsibility are also known as cases anticipating this crossing.
| Old world (~2009) | New world (post-DevOps/SRE) |
|---|---|
| Dev = speed, ops = stability | Same team sees both |
| Deploy = ops manual work | Deploy executed by pipeline |
| At incidents: âwhose fault?â | At incidents: âhow to prevent recurrence?â |
| Ops = midnight phone duty | Ops automated by code |
Are DevOps and SRE different things
The two are often confused, but pinning down âdifferent origins, different emphasesâ organizes them.
DevOpsâs main battlefield is culture/organizational theory, a movement spreading CI/CD, automation, measurement, and sharing with the goal of âdev-ops collaboration.â SRE is a Google-originated engineering method, a methodology of running ops decisions by numbers with SLO (Service Level Objective, the numerical reliability goal) and error budgets (allowable breakage budget) at its core.
| Viewpoint | DevOps | SRE |
|---|---|---|
| Origin | 2009 DevOpsDays (Belgium) | 2003 inside Google â 2016 book |
| Focus | Culture, organization, collaboration | Engineering reliability |
| Core weapons | CI/CD, automation, measurement, sharing | SLO, error budget, toil reduction |
| Typical role | DevOps engineer (cross-cutting) | SRE (reliability-dedicated) |
| Main battlefield | Process, pipeline | Production operations, reliability |
Google itself explains SRE as an implementation thatâs a subset of DevOps. Not opposing - SRE is one way to implement DevOps.
DORA 4 metrics - the numbers separating strong and weak teams
DORA (DevOps Research and Assessment, the research team led by Dr. Nicole Forsgren et al., systematized in the 2018 book âAccelerateâ) investigated teams worldwide for nearly 10 years and concluded that the difference between strong and weak teams converges to 4 numbers. These 4 metrics have become the common language for measuring DevOps/SRE improvement results.
quadrantChart
title DORA's finding "Speed and stability coexist"
x-axis Low stability --> High stability
y-axis Slow speed --> Fast speed
quadrant-1 Elite (ideal)
quadrant-2 Unstable and slow
quadrant-3 Low (worst)
quadrant-4 Cautious
Elite (top 10%): [0.9, 0.9]
High: [0.7, 0.7]
Medium: [0.5, 0.5]
Low (bottom 30%): [0.2, 0.2]
| Metric | Meaning | Elite (top 10%) | Low (bottom 30%) |
|---|---|---|---|
| Deploy frequency | How often production is shipped | Multiple per day | Less than monthly |
| Lead time for changes | Code commit â production | Less than 1 hour | More than 1 month |
| MTTR (Mean Time To Recovery) | Incident â recovery | Less than 1 hour | More than 1 month |
| Change failure rate | Rate of deploys causing problems | 0-15% | 46-60% |
What deserves attention is DORAâs finding that speed and stability coexist. Teams that release more frequently have lower failure rates and faster recovery. The Elite face is fast and unbreakable, not âslow and unbreakableâ - the biggest impact of this research.
CALMS - the 5 pillars of DevOps
The standard framework for organizing DevOps as a âcultural movementâ is CALMS. Below shows which chapter article each item corresponds to.
| Pillar | Meaning | This chapterâs correspondence |
|---|---|---|
| Culture | Donât punish failures, share | Postmortems, blameless |
| Automation | Automate manual work | CI/CD, IaC, GitOps |
| Lean | Flow small and fast | Trunk Based, Feature Flag |
| Measurement | Measure everything | DORA, SLO, observability |
| Sharing | Share learning | Runbooks, documentation, PRs |
Especially without Measurement, the other 4 turn into âa movement that runs on mood.â Measure first is DevOpsâs first step.
Phased practice - priorities change with team maturity
âDo everythingâ is something no one can execute, so split priorities by maturity. The current practical standard is below.
| Phase | Team size | Top priority | Target (DORA equivalent) |
|---|---|---|---|
| 1. Startup | ~10 | CI (auto-test), config management, README | Release within 2h / MTTR within 1 day |
| 2. Growth | 10-50 | CD automation, monitoring, alerts, on-call | 1 deploy/day / MTTR 1 hour |
| 3. Maturity | 50-200 | SLO operation, error budget, Feature Flag | Multiple deploys/day / MTTR 30 min |
| 4. Large-scale | 200+ | Platform Engineering / internal developer platform | Tens of deploys/day / MTTR 15 min |
Discussing SLO in phase 1 is premature; conversely, manual deploys still in phase 3 is laziness. Investing too early or too late is an accident - the practical feel.
90% of teams are in phases 1-2, many Elites in phase 3, and Platform Engineering (4) is still a minority today.
Platform Engineering - the recent main battlefield
Platform Engineering is the next wave of DevOps that spread rapidly from around 2020, with the core idea of âa dedicated team builds an internal developer platform (IDP), and app developers just ride on it.â
In the past, âdev teams do everythingâ was the DevOps ideal, but in practice cognitive load rose too much and many teams broke down. Platform Engineering is the role-rediscovery movement of âletâs place a platform team dedicated to designing developer experience (DX).â
| Viewpoint | Legacy DevOps | Platform Engineering |
|---|---|---|
| Who | Each dev team | Platform team |
| What | All in-house | Provides shared platform |
| Users | Everyone grasps everything | Developers use via API |
| Representative case | Netflix until ~2019 | Spotifyâs Backstage |
Backstage (the IDP framework Spotify open-sourced, released 2020) is the symbol of Platform Engineering.
SLO and error budget - SREâs core in one sheet
Setting SLO at âmonthly availability 99.9%â auto-derives that about 43 minutes of downtime per month is allowed. This âallowed downtimeâ is the error budget, letting you run the speed-stability tradeoff by numbers - attack within budget, defense if exceeded.
| State | Decision |
|---|---|
| Plenty of error budget | Aggressive release, experiments OK |
| Less than 20% budget left | Suppress new features, invest in stability |
| Budget exhausted | Release freeze, focus on incident-recurrence prevention |
This is SREâs decisive invention. âThe fruitless debate of âstability or speed - which to prioritizeâ is auto-resolved by numbers.â Details handled in another article.
100% availability is not the goal. The moment you aim for it, costs balloon infinitely - the paradox of SRE thinking.
Historical incidents - all became DevOps/SRE lessons
Each principle of DevOps/SRE was designed backwards from actual large incidents. As lessons running through the entire chapter, just 3 points (incident details in appendix âCritical Incident Casesâ).
| Incident | Lesson for DevOps/SRE |
|---|---|
| Knight Capital 2012 (45 min, $440M loss, bankruptcy) | Ban Feature Flag re-use / Canary Release / Kill Switch required |
| GitLab 2017 (rm -rf on prod DB, 4 of 5 backups broken) | Periodic restore drills / verify not just taken but restorable |
| Slack 2022 (AWS config change on New Yearâs Day, hours of stoppage) | Redundancy of dependencies and SLO independence / cascade failures in multi-tenant era |
Modern DevOps/SRE principles are prescriptions derived backwards from past disasters. Copying just principles without reading the cases means you canât feel why theyâre needed.
DevOps pitfalls and forbidden moves
Iâve often seen formalization that calls itself âdoing DevOpsâ while missing the essence. Listing landmines to avoid:
| Forbidden move | Why itâs bad |
|---|---|
| Make a dedicated DevOps team | Just adds a new silo. DevOps is a wall-breaking movement |
| Just install tools without changing culture | Putting in Jenkins changes nothing if the org stays the same |
| Raise only release speed without setting SLO | Donât notice when broken, big incident months later |
| Hunt for blame in postmortems | No one reports honestly thereafter, recurrence prevention fails |
| Reduce DevOps = automation | Forgetting Culture/Lean/Sharing means it doesnât sustain |
| Idealize everyone does everything | Cognitive-load exhaustion, should move to Platform Engineering |
| Continue manual operation under SRE banner | Toil over 50% disqualifies as SRE; without time for automation, not SRE |
| Gamify DORA metrics | Distortion of just raising deploy frequency with empty contents |
| Donât measure because âweâre specialâ | Teams that donât measure 100% donât improve |
| âDevOps means having a DevOps teamâ â making it a dedicated function | Just adds a new silo. DevOps is a wall-breaking movement |
| âShould aim for 100% availabilityâ â pursuing perfection | Infinite cost, dev stops. Agreeing on allowable breakage via SLO is SREâs core |
Toil (operational work repeated manually that should be automatable) is SREâs central concept; over 50% of the teamâs total work hours is the danger zone.
Specific numerical gates - DevOps health-check thresholds
In an area that tends to drift to abstract argument, here are current numerical guidelines. Use for self-diagnosis âwhere are we?â
| Item | Danger | Tolerable | Target |
|---|---|---|---|
| Deploy frequency | Less than monthly | Weekly | Multiple/day |
| Lead time (commit â prod) | Over 1 month | 1 week | Less than 1 hour |
| MTTR | Over 1 day | 1 hour | Less than 30 min |
| Change failure rate | Over 30% | 15% | Less than 5% |
| Toil ratio | Over 50% | 30% | Less than 20% |
| CI runtime (PR) | Over 30 min | 10 min | Less than 5 min |
| Production SLO compliance (recent 30 days) | SLO not met | Met | 50%+ error budget remaining |
Teams that canât speak in numbers canât improve. Measure first, then talk.
AI decision axes
| AI-era favorable | AI-era unfavorable |
|---|---|
| Structured logs + OpenTelemetry | Natural-language logs + custom format |
| Declarative via IaC/GitOps | Manual SSH + Excel manuals |
| Markdown + Git for docs | Confluence + Word, scattered |
| PRs/Issues as decision history | Discussions flowing in Slack threads |
- Measure DORA 4 metrics first â teams that canât speak in numbers canât improve.
- Investment matched to phase â SLO too early in startup, manual deploy too late in maturity.
- Donât create a âDevOps teamâ â donât make new walls in a wall-breaking movement.
- Machine-readable operational data â thatâs where the AI-era pipeline starts.
Authorâs note - the aftermath of a mid-size SIer making a âDevOps teamâ
Iâll introduce a typical case repeatedly observed in the industry.
A mid-size SIer flying the flag of âpromoting DevOps-izationâ launched a dedicated DevOps promotion department with 20 people, deploying Jenkins, Terraform, and Kubernetes company-wide - this much progressed energetically. But half a year later, dev teams said âwe ask the DevOps department to fix CI and wait 3 days,â DevOps said âdev doesnât learn Terraform so it all flows here,â reportedly fixing into a new silo with both sides discontent. Tools assembled but a new wall added - dead-center DevOps antipattern outcome.
In contrast, several companies thoroughly avoided âusing the DevOps wordâ and leaned into Platform Engineering shape - each dev team takes responsibility through deployment + a cross-cutting platform team provides the foundation - quietly reaching DORA Elite over 2-3 years. DevOps formalizes when you wave the banner, the front-runner is thoroughly executing the principles without waving the banner.
A repeatedly-observed industry rule: DevOps success cases donât use the word DevOps much. The paradox that teams not naming it more faithfully practice DevOps principles in the result.
What to decide - what is your projectâs answer?
For each of the following, try to articulate your projectâs answer in 1-2 sentences. Starting work with these vague always invites later questions like âwhy did we decide this again?â
- Where does the current team stand on DORA 4 metrics (measure first)
- Which phase (1-startup to 4-large-scale) are we in now
- Can SLOs be numerically agreed with business / still too early
- Is there culture for blameless postmortems
- Should we step into Platform Engineering (200+ as guideline)
- Are machine-readable operational data (structured logs, IaC, Markdown runbooks) in place
- Can we get by without creating a dedicated DevOps team
Summary
This article covered DevOps and SRE overview, including origins, relationship with SRE, DORA 4 metrics, CALMS, SLO and error budgets, Platform Engineering, and historical-incident lessons.
Measure DORA 4 metrics first, invest matched to phase, donât create a DevOps team, organize machine-readable operational data. That is the practical answer for DevOps/SRE in 2026.
Next time weâll cover version control (Git, branch strategy, monorepo).
I hope youâll read the next article as well.
đ Series: Architecture Crash Course for the Generative-AI Era (55/89)