DevOps Architecture Overview — One Pipeline for Build, Ship, and Run

About this article

This article is the first article in the “DevOps Architecture” category of the Architecture Crash Course for the Generative-AI Era series. It covers the big picture of DevOps and operations architecture.

The “build machinery (VCS, CI/CD, test, review, dev environment)” and the “keep-it-running machinery (monitoring, logs, SLO, incident, SRE)” are treated as a single connected lifecycle. With DevOps and SRE adoption, the line between dev and ops has dissolved; designing them as separate jobs is obsolete as of 2026. This article works as the map for all 16 articles in the category.

What is DevOps architecture in the first place

Picture a factory production line. If the product-design department and the line-operations department were completely separated, problems like “can’t build to spec” and “the line is down but design doesn’t know” would happen constantly. Unifying both into a single line is modern factory management.

DevOps architecture is the same idea. It is the discipline of designing the code-writing machinery (VCS, CI/CD, test, review) and the keep-it-running machinery (monitoring, logs, SLO, incident response) as a single connected lifecycle.

If dev and ops are separate, every release triggers a tug-of-war between “dev wants to ship” and “ops wants to stop”, and incidents devolve into blame games.

Why design dev and ops as one

The same code flows from dev to production in a straight line

VCS -> CI -> deploy -> monitoring all sit on a single pipeline. Optimizing only one side has no point — it’s that integrated in modern software.

We measure on the same metrics now

The DORA (DevOps Research and Assessment, Google’s published team study) four metrics — deploy frequency, lead time, MTTR, change-failure rate — assume that dev speed and ops stability get measured by the same formula. Improving only one side won’t move the numbers.

”Operations as code” is the AI-era assumption

IaC (Infrastructure as Code) and GitOps (Git-driven operations automation) are the main field, and operations runs on the same skill set as development. Operations that involves manually SSH-editing config files is debt in the AI era.

The dev/ops dichotomy is a residual image from the old org chart. In practice it’s now one river.

The full lifecycle covered in this chapter

flowchart TB
    subgraph DEV["Development"]
        VCS[VCS] --> ENV[Dev environment] --> REV[Code review] --> TEST[Test] --> CI[CI]
    end
    subgraph REL["Release"]
        DEPLOY[Deploy strategy]
    end
    subgraph OPS["Operations"]
        OBS[Monitoring &<br/>Observability] --> LOG[Logs] --> SLO[SLO/SLI] --> INC[Incident response]
    end
    subgraph IMP["Continuous improvement"]
        SRE[SRE practices]
    end
    subgraph CROSS["Cross-cutting"]
        DOC[Documentation]
        TICKET[Tickets]
    end
    DEV --> REL --> OPS --> IMP
    CROSS -.-> DEV
    CROSS -.-> OPS
    classDef dev fill:#dbeafe,stroke:#2563eb;
    classDef rel fill:#fef3c7,stroke:#d97706;
    classDef ops fill:#fae8ff,stroke:#a21caf;
    classDef imp fill:#dcfce7,stroke:#16a34a;
    classDef cross fill:#f0f9ff,stroke:#0369a1;
    class DEV,VCS,ENV,REV,TEST,CI dev;
    class REL,DEPLOY rel;
    class OPS,OBS,LOG,SLO,INC ops;
    class IMP,SRE imp;
    class CROSS,DOC,TICKET cross;

Read this chapter left to right and you have the full path from “becomes code” to “is delivered” to “keeps running” for one application. Each article stands alone, but starting with DevOps & SRE: The Big Picture makes the ordering click.

Article ordering

#	Article	Stage
01	DevOps & SRE: The Big Picture	Map of the chapter
02	VCS	Git, branching strategy
03	Dev environment & local execution	Developer experience
04	Code review	PR operation
05	Test design	Automated test strategy
06	CI/CD	Pipeline design
07	Deploy strategy	Canary, Blue-Green
08	Monitoring & observability	Metrics, traces
09	Logging design	Structured logs
10	SLOs and SLIs	Reliability targets
11	Incident response	On-call, postmortems
12	SRE practices	Continuous improvement, toil reduction
13	Documentation	Cross-cutting, long-lived
14	Tickets and project management	Cross-cutting, decisions

What you must decide 1: development process

Item	Examples
Git hosting	GitHub / GitLab / Bitbucket
Branching	GitHub Flow / Trunk-Based / GitFlow
CI/CD	GitHub Actions / GitLab CI / CircleCI
Test pyramid	Unit / integration / E2E ratios
Review policy	2-approver / CODEOWNERS / merge queue
Dev environment	Docker Compose / Dev Container / cloud IDE
Documentation home	In-repo md / Notion / Confluence

What you must decide 2: operations

Item	Examples
Monitoring tool	Prometheus / Datadog / New Relic
Logging	CloudWatch Logs / Loki / Splunk
Distributed tracing	OpenTelemetry / Jaeger / X-Ray
SLO/SLI	99.9% availability / p99 (response time excluding the slowest 1%)
Alerting	Static thresholds / anomaly detection / SLO burn rate
Notifications	PagerDuty / Slack / Opsgenie
On-call	24/7 / business hours / weekly rotation
Error-budget operation	Freeze releases on overrun?

What you must decide 3: release & cross-cutting

Item	Examples
Deploy strategy	Blue-Green / Canary / Rolling
Feature flags	LaunchDarkly / Unleash / DIY
Rollback policy	Auto / manual / not possible
Backup	Frequency / retention / generations
Restore drills	Annual / quarterly / monthly
Capacity planning	Auto-scale / manual review
Tickets	Jira / Linear / GitHub Projects

Service-type × maturity ladder

Note: industry rates as of April 2026. Periodic refresh required.

DevOps investment levels vary heavily by service type. Both running finance-grade SRE on an MVP and leaving manual deploys on a payment system are sources of incidents.

Service type	SLO	Deploy	Monitoring	On-call	Annual ops cost
Internal tool	99%	Manual or light CD	CloudWatch standard	Business hours only	~$1k
General B2C web	99.9%	CD + Canary	Datadog free / Grafana Cloud	2-3 part-time + PagerDuty	~$10k
B2B SaaS	99.95%	Multiple/day + feature flags	Datadog / New Relic	2-3 dedicated SREs	~$100k
Finance / payment	99.99%	Strict staged release	SIEM + UEBA + APM	24/7 SRE + SOC	~$1M+
Telco / utilities	99.999%	Quarterly / annual	Enterprise integrated	Follow-the-Sun	~$10M+

The construction cost between 99.9% and 99.99% differs by several multiples. Without a numeric agreement with the business, “as high as possible” is the road to bankruptcy. 100% is not a goal — that’s the ideology threading through the whole chapter.

SLO is a numeric agreement with the business. “Don’t go down” never converges as a sentence.

The three pillars of operations design

The core of operations is monitoring, logs, distributed tracing. The framing of treating them as one is observability (a design philosophy that lets you investigate unknown problems after the fact). Missing any one turns the system into a black box.

Pillar	Role	Tools
Monitoring	Visualize state in numbers	Prometheus / Datadog / CloudWatch
Logging	Record events as text	Loki / Splunk / CloudWatch Logs
Distributed tracing	Trace request paths	Jaeger / Tempo / X-Ray

The current standard is to send unified data through OpenTelemetry (the standard spec for monitoring data) and view it across tools in Grafana or Datadog. The first decision is not the tool — it’s standardizing instrumentation.

SRE’s core — agree numerically on “how much breakage is okay”

The substance of SRE comes down to SLO (Service Level Objective — internally agreed reliability target) and error budget (the “how much breakage is OK” within the SLO). With a 99.9% monthly availability SLO, ~43 minutes/month of downtime is allowed; that is the error budget.

Within budget, push releases. Past budget, freeze releases and focus on stabilization. That’s the SRE method of “running speed and reliability on the same metric.”

Concept	Meaning
SLI (Service Level Indicator)	Measured value (response time, success rate, …)
SLO (Service Level Objective)	Internally agreed target
SLA (Service Level Agreement)	Customer contract (compensation if missed)
Error budget	”How much breakage is OK” within the SLO

100% availability is impossible. Agree numerically, and trade off speed and reliability — the core of SRE.

DORA four metrics — team health check

Google’s DevOps Research & Assessment narrowed the gap between strong and weak teams to four numbers. The fact that DevOps and SRE are measured in the same formula is the foundation of this chapter’s framing.

Metric	Elite (top 10%)	Low
Deploy frequency	Multiple per day	Less than monthly
Change lead time	< 1 hour	> 1 month
MTTR (Mean Time To Recovery)	< 1 hour	> 1 month
Change-failure rate	0-15%	46-60%

Detail and improvement priority are in DevOps & SRE: The Big Picture. Read each chapter article as a piece moving one of the DORA metrics.

Architecture-level traps

Forbidden move	Why
Monitoring/logs bolted on after	Cannot identify causes during incidents — days of guessing
Alerts only on static thresholds (CPU 80%)	False positives under varying load — switch to SLO burn rate
Targeting 100% availability	Infinite cost — buy speed with the error budget
One veteran handles incidents	Collapses when they leave — Runbook-as-code mandatory
Postmortems as blame hunts	Information hidden — Blameless is the rule
Big releases without feature flags	The Knight Capital 2012 pattern ($440M loss in 45 minutes)
DB change and code deploy together	Use expand/contract — keep rollback possible
SRE in name only, manual ops	Toil > 50% is the danger zone — without changing how hands work, it isn’t SRE
Dev team ships without knowing ops	First breaks in production — joining the on-call rotation is the fastest school
CI runs but isn’t a gate	Decoration if merges go through anyway — blocking is the assumption
”Let the field decide the dev process freely” and leave it alone	Adding people increases incidents — inverse correlation
”Outages should be zero” — pursuing perfection	Investing in MTTR reduction yields better reliability and economics

DevOps is decided before you start building, not after you finish. Bolting on costs 10x.

AI decision axes

AI-era favorable	AI-era unfavorable
Structured logs (JSON), OpenTelemetry	Human-targeted prose logs
Declarative management via IaC / GitOps	Manual SSH operations
Runbooks in Markdown, Git-managed	Confluence and oral tradition
Prometheus, standard metrics	Custom monitoring schemas

Design dev process and ops as one — DORA measures both with one formula.
Decide monitoring, logs, and SLOs at the most upstream point — bolting on costs 10x.
Use SLOs and error budgets to agree on speed × reliability — 100% is not a goal.
Produce machine-readable operational data — structured logs, IaC, Markdown runbooks.

Author’s note — both “no monitoring” and “DevOps team” are landmines

Two canonical scenes you hear about:

First — no-monitoring operations. Inheriting a production environment with no monitoring or metrics, getting paged at midnight, SSH-ing in to stare at top and tail -f by intuition, three hours of guessing — not unusual. A problem a dashboard would catch in 5 minutes takes hours; that future is locked in the moment ops design is decided as “later.” The 2017 February AWS S3 outage (us-east-1) was a classic case where a debugging command typo took down a wide swath of SaaS — the industry’s poster child for the landmines of manual ops.

Second — DevOps team landmine. Orgs that stand up a dedicated “DevOps team” and declare “DevOps adoption” create a new silo within months, almost certainly. Dev says “the DevOps team has it”; the DevOps team says “dev won’t fix the CI”; one more wall. This is widely known as a canonical anti-pattern. DevOps is about tearing down walls, not redistributing roles. Misreading this halts the actual improvements.

Both fail by “relying on a person” or “trying to solve with the org chart.” The answer is design through code and process.

Summary

This article covered the big picture of DevOps and operations architecture — DevOps and SRE as one thing, the DORA four metrics, SLO + error budget, and AI-era machine-readable operational data.

Design dev and ops as one, decide monitoring/logs/SLOs upstream, agree on speed × reliability via the error budget, and produce machine-readable operational data. The realistic answer for 2026.

The next article covers DevOps & SRE: The Big Picture (the DORA four metrics and org strategy).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.