About this article
This article is the first article in the “DevOps Architecture” category of the Architecture Crash Course for the Generative-AI Era series. It covers the big picture of DevOps and operations architecture.
The “build machinery (VCS, CI/CD, test, review, dev environment)” and the “keep-it-running machinery (monitoring, logs, SLO, incident, SRE)” are treated as a single connected lifecycle. With DevOps and SRE adoption, the line between dev and ops has dissolved; designing them as separate jobs is obsolete as of 2026. This article works as the map for all 16 articles in the category.
What is DevOps architecture in the first place
Picture a factory production line. If the product-design department and the line-operations department were completely separated, problems like “can’t build to spec” and “the line is down but design doesn’t know” would happen constantly. Unifying both into a single line is modern factory management.
DevOps architecture is the same idea. It is the discipline of designing the code-writing machinery (VCS, CI/CD, test, review) and the keep-it-running machinery (monitoring, logs, SLO, incident response) as a single connected lifecycle.
If dev and ops are separate, every release triggers a tug-of-war between “dev wants to ship” and “ops wants to stop”, and incidents devolve into blame games.
Why design dev and ops as one
The same code flows from dev to production in a straight line
VCS -> CI -> deploy -> monitoring all sit on a single pipeline. Optimizing only one side has no point — it’s that integrated in modern software.
We measure on the same metrics now
The DORA (DevOps Research and Assessment, Google’s published team study) four metrics — deploy frequency, lead time, MTTR, change-failure rate — assume that dev speed and ops stability get measured by the same formula. Improving only one side won’t move the numbers.
”Operations as code” is the AI-era assumption
IaC (Infrastructure as Code) and GitOps (Git-driven operations automation) are the main field, and operations runs on the same skill set as development. Operations that involves manually SSH-editing config files is debt in the AI era.
The dev/ops dichotomy is a residual image from the old org chart. In practice it’s now one river.
The full lifecycle covered in this chapter
flowchart TB
subgraph DEV["Development"]
VCS[VCS] --> ENV[Dev environment] --> REV[Code review] --> TEST[Test] --> CI[CI]
end
subgraph REL["Release"]
DEPLOY[Deploy strategy]
end
subgraph OPS["Operations"]
OBS[Monitoring &<br/>Observability] --> LOG[Logs] --> SLO[SLO/SLI] --> INC[Incident response]
end
subgraph IMP["Continuous improvement"]
SRE[SRE practices]
end
subgraph CROSS["Cross-cutting"]
DOC[Documentation]
TICKET[Tickets]
end
DEV --> REL --> OPS --> IMP
CROSS -.-> DEV
CROSS -.-> OPS
classDef dev fill:#dbeafe,stroke:#2563eb;
classDef rel fill:#fef3c7,stroke:#d97706;
classDef ops fill:#fae8ff,stroke:#a21caf;
classDef imp fill:#dcfce7,stroke:#16a34a;
classDef cross fill:#f0f9ff,stroke:#0369a1;
class DEV,VCS,ENV,REV,TEST,CI dev;
class REL,DEPLOY rel;
class OPS,OBS,LOG,SLO,INC ops;
class IMP,SRE imp;
class CROSS,DOC,TICKET cross;
Read this chapter left to right and you have the full path from “becomes code” to “is delivered” to “keeps running” for one application. Each article stands alone, but starting with DevOps & SRE: The Big Picture makes the ordering click.
Article ordering
| # | Article | Stage |
|---|---|---|
| 01 | DevOps & SRE: The Big Picture | Map of the chapter |
| 02 | VCS | Git, branching strategy |
| 03 | Dev environment & local execution | Developer experience |
| 04 | Code review | PR operation |
| 05 | Test design | Automated test strategy |
| 06 | CI/CD | Pipeline design |
| 07 | Deploy strategy | Canary, Blue-Green |
| 08 | Monitoring & observability | Metrics, traces |
| 09 | Logging design | Structured logs |
| 10 | SLOs and SLIs | Reliability targets |
| 11 | Incident response | On-call, postmortems |
| 12 | SRE practices | Continuous improvement, toil reduction |
| 13 | Documentation | Cross-cutting, long-lived |
| 14 | Tickets and project management | Cross-cutting, decisions |
What you must decide 1: development process
| Item | Examples |
|---|---|
| Git hosting | GitHub / GitLab / Bitbucket |
| Branching | GitHub Flow / Trunk-Based / GitFlow |
| CI/CD | GitHub Actions / GitLab CI / CircleCI |
| Test pyramid | Unit / integration / E2E ratios |
| Review policy | 2-approver / CODEOWNERS / merge queue |
| Dev environment | Docker Compose / Dev Container / cloud IDE |
| Documentation home | In-repo md / Notion / Confluence |
What you must decide 2: operations
| Item | Examples |
|---|---|
| Monitoring tool | Prometheus / Datadog / New Relic |
| Logging | CloudWatch Logs / Loki / Splunk |
| Distributed tracing | OpenTelemetry / Jaeger / X-Ray |
| SLO/SLI | 99.9% availability / p99 (response time excluding the slowest 1%) |
| Alerting | Static thresholds / anomaly detection / SLO burn rate |
| Notifications | PagerDuty / Slack / Opsgenie |
| On-call | 24/7 / business hours / weekly rotation |
| Error-budget operation | Freeze releases on overrun? |
What you must decide 3: release & cross-cutting
| Item | Examples |
|---|---|
| Deploy strategy | Blue-Green / Canary / Rolling |
| Feature flags | LaunchDarkly / Unleash / DIY |
| Rollback policy | Auto / manual / not possible |
| Backup | Frequency / retention / generations |
| Restore drills | Annual / quarterly / monthly |
| Capacity planning | Auto-scale / manual review |
| Tickets | Jira / Linear / GitHub Projects |
Service-type × maturity ladder
Note: industry rates as of April 2026. Periodic refresh required.
DevOps investment levels vary heavily by service type. Both running finance-grade SRE on an MVP and leaving manual deploys on a payment system are sources of incidents.
| Service type | SLO | Deploy | Monitoring | On-call | Annual ops cost |
|---|---|---|---|---|---|
| Internal tool | 99% | Manual or light CD | CloudWatch standard | Business hours only | ~$1k |
| General B2C web | 99.9% | CD + Canary | Datadog free / Grafana Cloud | 2-3 part-time + PagerDuty | ~$10k |
| B2B SaaS | 99.95% | Multiple/day + feature flags | Datadog / New Relic | 2-3 dedicated SREs | ~$100k |
| Finance / payment | 99.99% | Strict staged release | SIEM + UEBA + APM | 24/7 SRE + SOC | ~$1M+ |
| Telco / utilities | 99.999% | Quarterly / annual | Enterprise integrated | Follow-the-Sun | ~$10M+ |
The construction cost between 99.9% and 99.99% differs by several multiples. Without a numeric agreement with the business, “as high as possible” is the road to bankruptcy. 100% is not a goal — that’s the ideology threading through the whole chapter.
SLO is a numeric agreement with the business. “Don’t go down” never converges as a sentence.
The three pillars of operations design
The core of operations is monitoring, logs, distributed tracing. The framing of treating them as one is observability (a design philosophy that lets you investigate unknown problems after the fact). Missing any one turns the system into a black box.
| Pillar | Role | Tools |
|---|---|---|
| Monitoring | Visualize state in numbers | Prometheus / Datadog / CloudWatch |
| Logging | Record events as text | Loki / Splunk / CloudWatch Logs |
| Distributed tracing | Trace request paths | Jaeger / Tempo / X-Ray |
The current standard is to send unified data through OpenTelemetry (the standard spec for monitoring data) and view it across tools in Grafana or Datadog. The first decision is not the tool — it’s standardizing instrumentation.
SRE’s core — agree numerically on “how much breakage is okay”
The substance of SRE comes down to SLO (Service Level Objective — internally agreed reliability target) and error budget (the “how much breakage is OK” within the SLO). With a 99.9% monthly availability SLO, ~43 minutes/month of downtime is allowed; that is the error budget.
Within budget, push releases. Past budget, freeze releases and focus on stabilization. That’s the SRE method of “running speed and reliability on the same metric.”
| Concept | Meaning |
|---|---|
| SLI (Service Level Indicator) | Measured value (response time, success rate, …) |
| SLO (Service Level Objective) | Internally agreed target |
| SLA (Service Level Agreement) | Customer contract (compensation if missed) |
| Error budget | ”How much breakage is OK” within the SLO |
100% availability is impossible. Agree numerically, and trade off speed and reliability — the core of SRE.
DORA four metrics — team health check
Google’s DevOps Research & Assessment narrowed the gap between strong and weak teams to four numbers. The fact that DevOps and SRE are measured in the same formula is the foundation of this chapter’s framing.
| Metric | Elite (top 10%) | Low |
|---|---|---|
| Deploy frequency | Multiple per day | Less than monthly |
| Change lead time | < 1 hour | > 1 month |
| MTTR (Mean Time To Recovery) | < 1 hour | > 1 month |
| Change-failure rate | 0-15% | 46-60% |
Detail and improvement priority are in DevOps & SRE: The Big Picture. Read each chapter article as a piece moving one of the DORA metrics.
Architecture-level traps
| Forbidden move | Why |
|---|---|
| Monitoring/logs bolted on after | Cannot identify causes during incidents — days of guessing |
| Alerts only on static thresholds (CPU 80%) | False positives under varying load — switch to SLO burn rate |
| Targeting 100% availability | Infinite cost — buy speed with the error budget |
| One veteran handles incidents | Collapses when they leave — Runbook-as-code mandatory |
| Postmortems as blame hunts | Information hidden — Blameless is the rule |
| Big releases without feature flags | The Knight Capital 2012 pattern ($440M loss in 45 minutes) |
| DB change and code deploy together | Use expand/contract — keep rollback possible |
| SRE in name only, manual ops | Toil > 50% is the danger zone — without changing how hands work, it isn’t SRE |
| Dev team ships without knowing ops | First breaks in production — joining the on-call rotation is the fastest school |
| CI runs but isn’t a gate | Decoration if merges go through anyway — blocking is the assumption |
| ”Let the field decide the dev process freely” and leave it alone | Adding people increases incidents — inverse correlation |
| ”Outages should be zero” — pursuing perfection | Investing in MTTR reduction yields better reliability and economics |
DevOps is decided before you start building, not after you finish. Bolting on costs 10x.
AI decision axes
| AI-era favorable | AI-era unfavorable |
|---|---|
| Structured logs (JSON), OpenTelemetry | Human-targeted prose logs |
| Declarative management via IaC / GitOps | Manual SSH operations |
| Runbooks in Markdown, Git-managed | Confluence and oral tradition |
| Prometheus, standard metrics | Custom monitoring schemas |
- Design dev process and ops as one — DORA measures both with one formula.
- Decide monitoring, logs, and SLOs at the most upstream point — bolting on costs 10x.
- Use SLOs and error budgets to agree on speed × reliability — 100% is not a goal.
- Produce machine-readable operational data — structured logs, IaC, Markdown runbooks.
Author’s note — both “no monitoring” and “DevOps team” are landmines
Two canonical scenes you hear about:
First — no-monitoring operations. Inheriting a production environment with no monitoring or metrics, getting paged at midnight, SSH-ing in to stare at top and tail -f by intuition, three hours of guessing — not unusual. A problem a dashboard would catch in 5 minutes takes hours; that future is locked in the moment ops design is decided as “later.” The 2017 February AWS S3 outage (us-east-1) was a classic case where a debugging command typo took down a wide swath of SaaS — the industry’s poster child for the landmines of manual ops.
Second — DevOps team landmine. Orgs that stand up a dedicated “DevOps team” and declare “DevOps adoption” create a new silo within months, almost certainly. Dev says “the DevOps team has it”; the DevOps team says “dev won’t fix the CI”; one more wall. This is widely known as a canonical anti-pattern. DevOps is about tearing down walls, not redistributing roles. Misreading this halts the actual improvements.
Both fail by “relying on a person” or “trying to solve with the org chart.” The answer is design through code and process.
Summary
This article covered the big picture of DevOps and operations architecture — DevOps and SRE as one thing, the DORA four metrics, SLO + error budget, and AI-era machine-readable operational data.
Design dev and ops as one, decide monitoring/logs/SLOs upstream, agree on speed × reliability via the error budget, and produce machine-readable operational data. The realistic answer for 2026.
The next article covers DevOps & SRE: The Big Picture (the DORA four metrics and org strategy).
Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book
I hope you’ll read the next article as well.
📚 Series: Architecture Crash Course for the Generative-AI Era (54/89)