About this article
This article is the first article in the “DevOps Architecture” category of the Architecture Crash Course for the Generative-AI Era series. It covers the big picture of DevOps and operations architecture.
The “build machinery (VCS, CI/CD, test, review, dev environment)” and the “keep-it-running machinery (monitoring, logs, SLO, incident, SRE)” are treated as a single connected lifecycle. With DevOps and SRE adoption, the line between dev and ops has dissolved; designing them as separate jobs is obsolete as of 2026. This article works as the map for all 15 articles in the category.
A full list of all articles in this category, with summaries and learning points, is available at the following page.
What is DevOps architecture in the first place
Picture a factory production line. If the product-design department and the line-operations department were completely separated, problems like “can’t build to spec” and “the line is down but design doesn’t know” would happen constantly. Unifying both into a single line is modern factory management.
DevOps architecture is the same idea. It is the discipline of designing the code-writing machinery (VCS, CI/CD, test, review) and the keep-it-running machinery (monitoring, logs, SLO, incident response) as a single connected lifecycle.
If dev and ops are separate, every release triggers a tug-of-war between “dev wants to ship” and “ops wants to stop”, and incidents devolve into blame games.
Why design dev and ops as one
The same code flows from dev to production in a straight line
VCS -> CI -> deploy -> monitoring all sit on a single pipeline. Optimizing only one side has no point — it’s that integrated in modern software.
We measure on the same metrics now
The DORA four metrics — deploy frequency, lead time, MTTR, change-failure rate — assume that dev speed and ops stability get measured by the same formula. Improving only one side won’t move the numbers.
”Operations as code” is the AI-era assumption
IaC and GitOps (Git-driven operations automation) are the main field, and operations runs on the same skill set as development. Operations that involves manually SSH-editing config files is debt in the AI era.
The dev/ops dichotomy is a residual image from the old org chart. In practice it’s now one river.
The full lifecycle covered in this chapter
Read this chapter left to right and you have the full path from “becomes code” to “is delivered” to “keeps running” for one application. Each article stands alone, but starting with DevOps & SRE: The Big Picture makes the ordering click.
Why this order
All 15 articles are organized into 4 phases + 2 cross-cutting themes.
- Development phase (VCS -> dev environment -> review -> test -> CI): the flow from writing code to ensuring quality.
- Release phase (deploy strategy): delivering quality-assured code to production.
- Operations phase (monitoring -> logs -> SLO -> incident response): the machinery that keeps production running.
- Continuous-improvement phase (SRE practices): feeding ops insights back to dev, closing the cycle.
- Cross-cutting themes (documentation, ticket management): the process foundation spanning all phases.
These 4 phases are not one-directional — they cycle. Problems found in operations flow back to development; SRE practices improve the dev process. The DORA 4 metrics measure the speed and quality of this cycle with a single formula.
Article ordering
| # | Article | Stage |
|---|---|---|
| 01 | DevOps & SRE: The Big Picture | Map of the chapter |
| 02 | VCS | Git, branching strategy |
| 03 | Dev environment & local execution | Developer experience |
| 04 | Code review | PR operation |
| 05 | Test design | Automated test strategy |
| 06 | CI/CD | Pipeline design |
| 07 | Deploy strategy | Canary, Blue-Green |
| 08 | Monitoring & observability | Metrics, traces |
| 09 | Logging design | Structured logs |
| 10 | SLOs and SLIs | Reliability targets |
| 11 | Incident response | On-call, postmortems |
| 12 | SRE practices | Continuous improvement, toil reduction |
| 13 | Documentation | Cross-cutting, long-lived |
| 14 | Tickets and project management | Cross-cutting, decisions |
What you must decide 1: development process
| Item | Examples |
|---|---|
| Git hosting | GitHub / GitLab / Bitbucket |
| Branching | GitHub Flow / Trunk-Based / GitFlow |
| CI/CD | GitHub Actions / GitLab CI / CircleCI |
| Test pyramid | Unit / integration / E2E ratios |
| Review policy | 2-approver / CODEOWNERS / merge queue |
| Dev environment | Docker Compose / Dev Container / cloud IDE |
| Documentation home | In-repo md / Notion / Confluence |
What you must decide 2: operations
| Item | Examples |
|---|---|
| Monitoring tool | Prometheus / Datadog / New Relic |
| Logging | CloudWatch Logs / Loki / Splunk |
| Distributed tracing | OpenTelemetry / Jaeger / X-Ray |
| SLO/SLI | 99.9% availability / p99 (response time excluding the slowest 1%) |
| Alerting | Static thresholds / anomaly detection / SLO burn rate |
| Notifications | PagerDuty / Slack / Opsgenie |
| On-call | 24/7 / business hours / weekly rotation |
| Error-budget operation | Freeze releases on overrun? |
What you must decide 3: release & cross-cutting
| Item | Examples |
|---|---|
| Deploy strategy | Blue-Green / Canary / Rolling |
| Feature flags | LaunchDarkly / Unleash / DIY |
| Rollback policy | Auto / manual / not possible |
| Backup | Frequency / retention / generations |
| Restore drills | Annual / quarterly / monthly |
| Capacity planning | Auto-scale / manual review |
| Tickets | Jira / Linear / GitHub Projects |
Service-type × maturity ladder
Note: industry rates as of April 2026. Periodic refresh required.
DevOps investment levels vary heavily by service type. Both running finance-grade SRE on an MVP and leaving manual deploys on a payment system are sources of incidents.
| Service type | SLO | Deploy | Monitoring | On-call | Annual ops cost |
|---|---|---|---|---|---|
| Internal tool | 99% | Manual or light CD | CloudWatch standard | Business hours only | ~$1k |
| General B2C web | 99.9% | CD + Canary | Datadog free / Grafana Cloud | 2-3 part-time + PagerDuty | ~$10k |
| B2B SaaS | 99.95% | Multiple/day + feature flags | Datadog / New Relic | 2-3 dedicated SREs | ~$100k |
| Finance / payment | 99.99% | Strict staged release | SIEM + UEBA + APM | 24/7 SRE + SOC | ~$1M+ |
| Telco / utilities | 99.999% | Quarterly / annual | Enterprise integrated | Follow-the-Sun | ~$10M+ |
The construction cost between 99.9% and 99.99% differs by several multiples. Without a numeric agreement with the business, “as high as possible” is the road to bankruptcy. 100% is not a goal — that’s the ideology threading through the whole chapter.
SLO is a numeric agreement with the business. “Don’t go down” never converges as a sentence.
The three pillars of operations design
The core of operations is monitoring, logs, distributed tracing. The framing of treating them as one is observability (a design philosophy that lets you investigate unknown problems after the fact). Missing any one turns the system into a black box.
| Pillar | Role | Tools |
|---|---|---|
| Monitoring | Visualize state in numbers | Prometheus / Datadog / CloudWatch |
| Logging | Record events as text | Loki / Splunk / CloudWatch Logs |
| Distributed tracing | Trace request paths | Jaeger / Tempo / X-Ray |
The current standard is to send unified data through OpenTelemetry (the standard spec for monitoring data) and view it across tools in Grafana or Datadog. The first decision is not the tool — it’s standardizing instrumentation.
SRE’s core — agree numerically on “how much breakage is okay”
The substance of SRE comes down to SLO and error budget (the “how much breakage is OK” within the SLO). With a 99.9% monthly availability SLO, ~43 minutes/month of downtime is allowed; that is the error budget.
Within budget, push releases. Past budget, freeze releases and focus on stabilization. That’s the SRE method of “running speed and reliability on the same metric.”
| Concept | Meaning |
|---|---|
| SLI | Measured value (response time, success rate, …) |
| SLO | Internally agreed target |
| SLA | Customer contract (compensation if missed) |
| Error budget | ”How much breakage is OK” within the SLO |
100% availability is impossible. Agree numerically, and trade off speed and reliability — the core of SRE.
DORA four metrics — team health check
Google’s DevOps Research & Assessment narrowed the gap between strong and weak teams to four numbers. The fact that DevOps and SRE are measured in the same formula is the foundation of this chapter’s framing.
| Metric | Elite (top 10%) | Low |
|---|---|---|
| Deploy frequency | Multiple per day | Less than monthly |
| Change lead time | < 1 hour | > 1 month |
| MTTR (Mean Time To Recovery) | < 1 hour | > 1 month |
| Change-failure rate | 0-15% | 46-60% |
Detail and improvement priority are in DevOps & SRE: The Big Picture. Read each chapter article as a piece moving one of the DORA metrics.
Architecture-level traps
| Forbidden move | Why |
|---|---|
| Monitoring/logs bolted on after | Cannot identify causes during incidents — days of guessing |
| Alerts only on static thresholds (CPU 80%) | False positives under varying load — switch to SLO burn rate |
| Targeting 100% availability | Infinite cost — buy speed with the error budget |
| One veteran handles incidents | Collapses when they leave — Runbook-as-code mandatory |
| Postmortems as blame hunts | Information hidden — Blameless is the rule |
| Big releases without feature flags | The Knight Capital 2012 pattern ($440M loss in 45 minutes) |
| DB change and code deploy together | Use expand/contract — keep rollback possible |
| SRE in name only, manual ops | Toil > 50% is the danger zone — without changing how hands work, it isn’t SRE |
| Dev team ships without knowing ops | First breaks in production — joining the on-call rotation is the fastest school |
| CI runs but isn’t a gate | Decoration if merges go through anyway — blocking is the assumption |
| ”Let the field decide the dev process freely” and leave it alone | Adding people increases incidents — inverse correlation |
| ”Outages should be zero” — pursuing perfection | Investing in MTTR reduction yields better reliability and economics |
DevOps is decided before you start building, not after you finish. Bolting on costs 10x.
AI decision axes
| AI-era favorable | AI-era unfavorable |
|---|---|
| Structured logs (JSON), OpenTelemetry | Human-targeted prose logs |
| Declarative management via IaC / GitOps | Manual SSH operations |
| Runbooks in Markdown, Git-managed | Confluence and oral tradition |
| Prometheus, standard metrics | Custom monitoring schemas |
- Design dev process and ops as one — DORA measures both with one formula.
- Decide monitoring, logs, and SLOs at the most upstream point — bolting on costs 10x.
- Use SLOs and error budgets to agree on speed × reliability — 100% is not a goal.
- Produce machine-readable operational data — structured logs, IaC, Markdown runbooks.
Machine-readable operational data is the precondition for AI utilization
To delegate operational tasks to AI, data must exist in a form AI can read. Specifically: structured logs (JSON), IaC code, Markdown runbooks, and OpenTelemetry metrics. When all of these are retrievable via Git or APIs, AI can automate the sequence from failure detection to root-cause analysis to recovery proposals to execution.
Conversely, when procedures exist only in Confluence pages with embedded images or in Slack history, AI cannot reference them. Deciding at the DevOps design stage that “all processes and knowledge are managed as code or structured text” is the shortest path to AI-era operations automation.
AI directly supports improving DORA metrics
The four metrics — deploy frequency, change lead time, change-failure rate, and MTTR — can be auto-measured from CI pipelines and Git logs. AI can analyze these metrics and surface patterns like “weeks with larger pull requests show lower deploy frequency” or “a specific service has a higher change-failure rate,” then propose improvements.
Author’s note — both “no monitoring” and “DevOps team” are landmines
Two canonical scenes you hear about:
First — no-monitoring operations. Inheriting a production environment with no monitoring or metrics, getting paged at midnight, SSH-ing in to stare at top and tail -f by intuition, three hours of guessing — not unusual. A problem a dashboard would catch in 5 minutes takes hours; that future is locked in the moment ops design is decided as “later.” The 2017 February AWS S3 outage (us-east-1) was a classic case where a debugging command typo took down a wide swath of SaaS — the industry’s poster child for the landmines of manual ops.
Second — DevOps team landmine. Orgs that stand up a dedicated “DevOps team” and declare “DevOps adoption” create a new silo within months, almost certainly. Dev says “the DevOps team has it”; the DevOps team says “dev won’t fix the CI”; one more wall. This is widely known as a canonical anti-pattern. DevOps is about tearing down walls, not redistributing roles. Misreading this halts the actual improvements.
Both fail by “relying on a person” or “trying to solve with the org chart.” The answer is design through code and process.
Summary
This article covered the big picture of DevOps and operations architecture — DevOps and SRE as one thing, the DORA four metrics, SLO + error budget, and AI-era machine-readable operational data.
Design dev and ops as one, decide monitoring/logs/SLOs upstream, agree on speed × reliability via the error budget, and produce machine-readable operational data. The realistic answer for 2026.
The next article covers DevOps & SRE: The Big Picture (the DORA four metrics and org strategy).
Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book
I hope you’ll read the next article as well.
📚 Series: Architecture Crash Course for the Generative-AI Era (54/89)