DevOps Architecture

DevOps Architecture Overview — One Pipeline for Build, Ship, and Run

DevOps Architecture Overview — One Pipeline for Build, Ship, and Run

About this article

This article is the first article in the “DevOps Architecture” category of the Architecture Crash Course for the Generative-AI Era series. It covers the big picture of DevOps and operations architecture.

The “build machinery (VCS, CI/CD, test, review, dev environment)” and the “keep-it-running machinery (monitoring, logs, SLO, incident, SRE)” are treated as a single connected lifecycle. With DevOps and SRE adoption, the line between dev and ops has dissolved; designing them as separate jobs is obsolete as of 2026. This article works as the map for all 16 articles in the category.

What is DevOps architecture in the first place

Picture a factory production line. If the product-design department and the line-operations department were completely separated, problems like “can’t build to spec” and “the line is down but design doesn’t know” would happen constantly. Unifying both into a single line is modern factory management.

DevOps architecture is the same idea. It is the discipline of designing the code-writing machinery (VCS, CI/CD, test, review) and the keep-it-running machinery (monitoring, logs, SLO, incident response) as a single connected lifecycle.

If dev and ops are separate, every release triggers a tug-of-war between “dev wants to ship” and “ops wants to stop”, and incidents devolve into blame games.

Why design dev and ops as one

The same code flows from dev to production in a straight line

VCS -> CI -> deploy -> monitoring all sit on a single pipeline. Optimizing only one side has no point — it’s that integrated in modern software.

We measure on the same metrics now

The DORA (DevOps Research and Assessment, Google’s published team study) four metrics — deploy frequency, lead time, MTTR, change-failure rate — assume that dev speed and ops stability get measured by the same formula. Improving only one side won’t move the numbers.

”Operations as code” is the AI-era assumption

IaC (Infrastructure as Code) and GitOps (Git-driven operations automation) are the main field, and operations runs on the same skill set as development. Operations that involves manually SSH-editing config files is debt in the AI era.

The dev/ops dichotomy is a residual image from the old org chart. In practice it’s now one river.

The full lifecycle covered in this chapter

flowchart TB
    subgraph DEV["Development"]
        VCS[VCS] --> ENV[Dev environment] --> REV[Code review] --> TEST[Test] --> CI[CI]
    end
    subgraph REL["Release"]
        DEPLOY[Deploy strategy]
    end
    subgraph OPS["Operations"]
        OBS[Monitoring &<br/>Observability] --> LOG[Logs] --> SLO[SLO/SLI] --> INC[Incident response]
    end
    subgraph IMP["Continuous improvement"]
        SRE[SRE practices]
    end
    subgraph CROSS["Cross-cutting"]
        DOC[Documentation]
        TICKET[Tickets]
    end
    DEV --> REL --> OPS --> IMP
    CROSS -.-> DEV
    CROSS -.-> OPS
    classDef dev fill:#dbeafe,stroke:#2563eb;
    classDef rel fill:#fef3c7,stroke:#d97706;
    classDef ops fill:#fae8ff,stroke:#a21caf;
    classDef imp fill:#dcfce7,stroke:#16a34a;
    classDef cross fill:#f0f9ff,stroke:#0369a1;
    class DEV,VCS,ENV,REV,TEST,CI dev;
    class REL,DEPLOY rel;
    class OPS,OBS,LOG,SLO,INC ops;
    class IMP,SRE imp;
    class CROSS,DOC,TICKET cross;

Read this chapter left to right and you have the full path from “becomes code” to “is delivered” to “keeps running” for one application. Each article stands alone, but starting with DevOps & SRE: The Big Picture makes the ordering click.

Article ordering

#ArticleStage
01DevOps & SRE: The Big PictureMap of the chapter
02VCSGit, branching strategy
03Dev environment & local executionDeveloper experience
04Code reviewPR operation
05Test designAutomated test strategy
06CI/CDPipeline design
07Deploy strategyCanary, Blue-Green
08Monitoring & observabilityMetrics, traces
09Logging designStructured logs
10SLOs and SLIsReliability targets
11Incident responseOn-call, postmortems
12SRE practicesContinuous improvement, toil reduction
13DocumentationCross-cutting, long-lived
14Tickets and project managementCross-cutting, decisions

What you must decide 1: development process

ItemExamples
Git hostingGitHub / GitLab / Bitbucket
BranchingGitHub Flow / Trunk-Based / GitFlow
CI/CDGitHub Actions / GitLab CI / CircleCI
Test pyramidUnit / integration / E2E ratios
Review policy2-approver / CODEOWNERS / merge queue
Dev environmentDocker Compose / Dev Container / cloud IDE
Documentation homeIn-repo md / Notion / Confluence

What you must decide 2: operations

ItemExamples
Monitoring toolPrometheus / Datadog / New Relic
LoggingCloudWatch Logs / Loki / Splunk
Distributed tracingOpenTelemetry / Jaeger / X-Ray
SLO/SLI99.9% availability / p99 (response time excluding the slowest 1%)
AlertingStatic thresholds / anomaly detection / SLO burn rate
NotificationsPagerDuty / Slack / Opsgenie
On-call24/7 / business hours / weekly rotation
Error-budget operationFreeze releases on overrun?

What you must decide 3: release & cross-cutting

ItemExamples
Deploy strategyBlue-Green / Canary / Rolling
Feature flagsLaunchDarkly / Unleash / DIY
Rollback policyAuto / manual / not possible
BackupFrequency / retention / generations
Restore drillsAnnual / quarterly / monthly
Capacity planningAuto-scale / manual review
TicketsJira / Linear / GitHub Projects

Service-type × maturity ladder

Note: industry rates as of April 2026. Periodic refresh required.

DevOps investment levels vary heavily by service type. Both running finance-grade SRE on an MVP and leaving manual deploys on a payment system are sources of incidents.

Service typeSLODeployMonitoringOn-callAnnual ops cost
Internal tool99%Manual or light CDCloudWatch standardBusiness hours only~$1k
General B2C web99.9%CD + CanaryDatadog free / Grafana Cloud2-3 part-time + PagerDuty~$10k
B2B SaaS99.95%Multiple/day + feature flagsDatadog / New Relic2-3 dedicated SREs~$100k
Finance / payment99.99%Strict staged releaseSIEM + UEBA + APM24/7 SRE + SOC~$1M+
Telco / utilities99.999%Quarterly / annualEnterprise integratedFollow-the-Sun~$10M+

The construction cost between 99.9% and 99.99% differs by several multiples. Without a numeric agreement with the business, “as high as possible” is the road to bankruptcy. 100% is not a goal — that’s the ideology threading through the whole chapter.

SLO is a numeric agreement with the business. “Don’t go down” never converges as a sentence.

The three pillars of operations design

The core of operations is monitoring, logs, distributed tracing. The framing of treating them as one is observability (a design philosophy that lets you investigate unknown problems after the fact). Missing any one turns the system into a black box.

PillarRoleTools
MonitoringVisualize state in numbersPrometheus / Datadog / CloudWatch
LoggingRecord events as textLoki / Splunk / CloudWatch Logs
Distributed tracingTrace request pathsJaeger / Tempo / X-Ray

The current standard is to send unified data through OpenTelemetry (the standard spec for monitoring data) and view it across tools in Grafana or Datadog. The first decision is not the tool — it’s standardizing instrumentation.

SRE’s core — agree numerically on “how much breakage is okay”

The substance of SRE comes down to SLO (Service Level Objective — internally agreed reliability target) and error budget (the “how much breakage is OK” within the SLO). With a 99.9% monthly availability SLO, ~43 minutes/month of downtime is allowed; that is the error budget.

Within budget, push releases. Past budget, freeze releases and focus on stabilization. That’s the SRE method of “running speed and reliability on the same metric.”

ConceptMeaning
SLI (Service Level Indicator)Measured value (response time, success rate, …)
SLO (Service Level Objective)Internally agreed target
SLA (Service Level Agreement)Customer contract (compensation if missed)
Error budget”How much breakage is OK” within the SLO

100% availability is impossible. Agree numerically, and trade off speed and reliability — the core of SRE.

DORA four metrics — team health check

Google’s DevOps Research & Assessment narrowed the gap between strong and weak teams to four numbers. The fact that DevOps and SRE are measured in the same formula is the foundation of this chapter’s framing.

MetricElite (top 10%)Low
Deploy frequencyMultiple per dayLess than monthly
Change lead time< 1 hour> 1 month
MTTR (Mean Time To Recovery)< 1 hour> 1 month
Change-failure rate0-15%46-60%

Detail and improvement priority are in DevOps & SRE: The Big Picture. Read each chapter article as a piece moving one of the DORA metrics.

Architecture-level traps

Forbidden moveWhy
Monitoring/logs bolted on afterCannot identify causes during incidents — days of guessing
Alerts only on static thresholds (CPU 80%)False positives under varying load — switch to SLO burn rate
Targeting 100% availabilityInfinite cost — buy speed with the error budget
One veteran handles incidentsCollapses when they leave — Runbook-as-code mandatory
Postmortems as blame huntsInformation hidden — Blameless is the rule
Big releases without feature flagsThe Knight Capital 2012 pattern ($440M loss in 45 minutes)
DB change and code deploy togetherUse expand/contract — keep rollback possible
SRE in name only, manual opsToil > 50% is the danger zone — without changing how hands work, it isn’t SRE
Dev team ships without knowing opsFirst breaks in production — joining the on-call rotation is the fastest school
CI runs but isn’t a gateDecoration if merges go through anyway — blocking is the assumption
”Let the field decide the dev process freely” and leave it aloneAdding people increases incidents — inverse correlation
”Outages should be zero” — pursuing perfectionInvesting in MTTR reduction yields better reliability and economics

DevOps is decided before you start building, not after you finish. Bolting on costs 10x.

AI decision axes

AI-era favorableAI-era unfavorable
Structured logs (JSON), OpenTelemetryHuman-targeted prose logs
Declarative management via IaC / GitOpsManual SSH operations
Runbooks in Markdown, Git-managedConfluence and oral tradition
Prometheus, standard metricsCustom monitoring schemas
  1. Design dev process and ops as one — DORA measures both with one formula.
  2. Decide monitoring, logs, and SLOs at the most upstream point — bolting on costs 10x.
  3. Use SLOs and error budgets to agree on speed × reliability — 100% is not a goal.
  4. Produce machine-readable operational data — structured logs, IaC, Markdown runbooks.

Author’s note — both “no monitoring” and “DevOps team” are landmines

Two canonical scenes you hear about:

First — no-monitoring operations. Inheriting a production environment with no monitoring or metrics, getting paged at midnight, SSH-ing in to stare at top and tail -f by intuition, three hours of guessing — not unusual. A problem a dashboard would catch in 5 minutes takes hours; that future is locked in the moment ops design is decided as “later.” The 2017 February AWS S3 outage (us-east-1) was a classic case where a debugging command typo took down a wide swath of SaaS — the industry’s poster child for the landmines of manual ops.

Second — DevOps team landmine. Orgs that stand up a dedicated “DevOps team” and declare “DevOps adoption” create a new silo within months, almost certainly. Dev says “the DevOps team has it”; the DevOps team says “dev won’t fix the CI”; one more wall. This is widely known as a canonical anti-pattern. DevOps is about tearing down walls, not redistributing roles. Misreading this halts the actual improvements.

Both fail by “relying on a person” or “trying to solve with the org chart.” The answer is design through code and process.

Summary

This article covered the big picture of DevOps and operations architecture — DevOps and SRE as one thing, the DORA four metrics, SLO + error budget, and AI-era machine-readable operational data.

Design dev and ops as one, decide monitoring/logs/SLOs upstream, agree on speed × reliability via the error budget, and produce machine-readable operational data. The realistic answer for 2026.

The next article covers DevOps & SRE: The Big Picture (the DORA four metrics and org strategy).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.