About this article
As the eighth installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains deploy strategy.
The mainstream culture today is deploying frequently, small, safely. Deploy frequency, failure rate, and recovery speed are the core of DORA 4 metrics. This article covers strategies like Rolling / Blue-Green / Canary / Feature Flag / Shadow Deployment / Dark Launch, rollback design, and AI-era auto Canary judgment.
What is deploy strategy
In a nutshell, deploy strategy is “the playbook for how you deliver a new version of software to the production environment.”
Think of road construction. You could shut down every lane at once and do the work (all-at-once deploy), or you could close one lane at a time and keep the rest open (rolling update). For extra caution, you could repave a side road first to check for problems before moving to the main road (canary release). Software is the same — deciding “to what scope, in what order, and with how much safety margin” you roll out a new version ahead of time is deploy strategy.
Why deploy strategy is needed
Deploy failures are the biggest incident factor
Per Google SRE reports, 70%+ of production incidents originate in changes (deploys, config changes). Deploy strategy reduces accidents. The 2012 Knight Capital incident is the symbolic example of “1 of 8 servers left behind + $440M loss in 45 minutes + bankruptcy” - a deploy accident that this article’s deploy strategies are precisely meant to prevent (details in appendix “Critical Incident Cases”).
Frequent deploys raise quality
“Big and rare” deploys make the impact range on failure huge. Small and frequent means each damage is small and rollback easy too.
Supporting business speed
Building new features but not being able to release for a month is meaningless. Deploy strategy directly links to business outcomes as much as technical decisions.
Main strategies
There are multiple patterns in deploy strategies. Each differs in risk, cost, and complexity, so use according to system.
flowchart TB
subgraph BG["Blue-Green"]
BLUE[Blue<br/>v1 old] --> ROUTER1[Router] -.switch.-> GREEN[Green<br/>v2 new]
end
subgraph CN["Canary"]
ROUTER2[Router]
ROUTER2 -->|95%| OLD[v1]
ROUTER2 -->|5%->25%->100%| NEW[v2]
end
subgraph FF["Feature Flag"]
APP[Production app v2<br/>feature OFF]
APP -.|flag ON| FEATURE[unlock new feature]
end
subgraph SH["Shadow"]
REQ[production request] --> V1[v1<br/>response]
REQ -.|parallel| V2[v2<br/>discard result]
end
classDef bg fill:#dbeafe,stroke:#2563eb;
classDef cn fill:#dcfce7,stroke:#16a34a;
classDef ff fill:#fef3c7,stroke:#d97706;
classDef sh fill:#fae8ff,stroke:#a21caf;
class BG,BLUE,GREEN,ROUTER1 bg;
class CN,OLD,NEW,ROUTER2 cn;
class FF,APP,FEATURE ff;
class SH,REQ,V1,V2 sh;
| Strategy | Content | Risk |
|---|---|---|
| Rolling Update | Sequential replacement | Mid |
| Blue-Green | Switch between 2 environments | Low, 2x cost |
| Canary | Only some users on new version | Lowest |
| Feature Flag | Code in production, feature OFF | Lowest |
| Recreate | Stop all → start new | High |
| Shadow | Run new in parallel with production | Low, verification-oriented |
Among them, Canary matters because it can catch problems visible only in production early. Expose the new version a little to real user behavior, real data distribution, and real traffic patterns unreproducible in test env, and on anomaly, retreat with minimum affected users. Cases of breaking only in production despite no problems in staging aren’t rare, and Canary is the last safety net.
But Canary has corresponding implementation costs. Traffic routing splitting requests by ratio (Service Mesh or Ingress weighting), and metric auto-judgment comparing new and old error rates and latencies (Flagger, Argo Rollouts, etc.) are needed - starting from Canary at the stage without monitoring foundation is hard. That’s why “Canary + Feature Flag” is told as the modern standard, adopted as the equipment set for fine control and minimizing problem impact.
Rolling Update
The method of sequentially replacing old version with new. Kubernetes’s default strategy, replacing some pods to new version. Simple with no additional infrastructure, but the downside of slow rollback on problems.
| Pros | Cons |
|---|---|
| No additional infrastructure | Slow rollback |
| Simple | Long mixed old-new state |
| K8s standard | Trouble on schema changes |
Suited for minor updates but unsuited for schema changes or incompatible changes.
Blue-Green Deployment
The method of preparing 2 production environments (Blue, Green) and switching with load balancer. Blue is current, Green is new - if no problems, switch traffic to Green; on problems, just return to Blue.
flowchart TB
LB([Load Balancer])
B1["Blue: v1.0<br/>all traffic"]
G1["Green: v1.1<br/>standby"]
SW{switch}
B2["Blue: v1.0<br/>standby"]
G2["Green: v1.1<br/>all traffic"]
LB --> B1
LB -.standby.-> G1
B1 --> SW
G1 --> SW
SW --> B2
SW --> G2
classDef lb fill:#fef3c7,stroke:#d97706;
classDef active fill:#dcfce7,stroke:#16a34a;
classDef standby fill:#f1f5f9,stroke:#64748b;
classDef sw fill:#fae8ff,stroke:#a21caf;
class LB lb;
class B1,G2 active;
class G1,B2 standby;
class SW sw;
| Pros | Cons |
|---|---|
| Instant switch / rollback | 2x infrastructure cost |
| Can handle DB schema changes | Care needed with shared DB |
| Usable as verification env | Data integrity issues |
Canary Deployment
The method of phased deployment of new version to some users (5% → 20% → 50% → 100%). Problems can be detected early, minimizing affected users. As the safest deploy strategy, standardized in SRE organizations.
| Phase | Ratio | Observation period |
|---|---|---|
| Phase 1 | 1-5% | 15 min |
| Phase 2 | 20% | 30 min |
| Phase 3 | 50% | 1 hour |
| Phase 4 | 100% | - |
Auto-monitor error rate and latency at each phase, ideally auto-rolling back on anomaly. Tools like Argo Rollouts, Flagger, Spinnaker support this.
Feature Flag
The method of deploying code to production while dynamically controlling feature ON/OFF. Separates deploy and release, creating the state of “deployed but invisible to anyone.” Usable for A/B tests, phased releases, emergency stops too.
| Tool | Characteristics |
|---|---|
| LaunchDarkly | Enterprise standard |
| Flagsmith | OSS version available |
| Unleash | OSS, GitLab-integrated |
| PostHog | Integrated with analytics |
| Custom implementation | Config files or DB |
Separating “code deploy” and “feature release” is the modern thinking, dramatically lowering risk.
Database migration
Deploys with DB schema changes are the most difficult area. Special design is needed to maintain code-DB integrity while changing without downtime. Expand and Contract pattern is standard.
| Step | Content |
|---|---|
| Expand | Extend schema to dual-support old and new |
| Code update | Start using new column |
| Backfill | Fill new column with old data |
| Contract | Drop old column |
Trying to do everything in one release stops production. Phased migration over multiple releases is the modern best practice.
Auto rollback
Mechanism to auto-revert to previous version on detecting anomaly after deploy. No waiting for human judgment, minimizing damage. Auto-judges by monitoring metrics (error rate, latency, SLO).
| Trigger | Example |
|---|---|
| Sudden error-rate spike | 5xx rate doubles |
| Latency degradation | P95 exceeds threshold |
| SLO violation | Burn rate exceeds 10x |
| Manual judgment | Roll back with 1 button |
Not depending on human operators is modern - creating a 24-hour auto-protected state.
CI/CD pipeline
Supporting deploy strategy is CI/CD (Continuous Integration / Continuous Deployment). Build pipelines that auto-build, test, and deploy when code is committed.
| Tool | Characteristics |
|---|---|
| GitHub Actions | GitHub-integrated, popular |
| GitLab CI | GitLab-integrated |
| CircleCI | Early SaaS |
| ArgoCD / Flux | GitOps K8s-specialized |
| Jenkins | Veteran, high customization |
GitOps (deploy with Git as single source of truth) is the modern trend, with operations where commits become deploys becoming mainstream.
Progressive Delivery
Next-generation deploy strategy combining Canary, Feature Flag, and auto-judgment. Migrate users to new version “progressively” with auto-judgment from observed data. Argo Rollouts and Flagger embody this thinking.
| Element | Role |
|---|---|
| Phased deployment | Gradually increase user ratio |
| Metric analysis | Auto-judge by SLI |
| Auto rollback | Instantly revert on anomaly |
| A/B test integration | Effect measurement simultaneously |
Realizes deploy without human intervention - the standard for 100%-automated SRE organizations.
Decision criterion 1: traffic volume
Deploy strategy varies by traffic. Small volume - Rolling is enough; large scale - Canary + Progressive required.
| Traffic | Recommended |
|---|---|
| Low (in-house tools) | Rolling Update |
| Mid (general B2C) | Blue-Green or Canary |
| Large (millions of RPS = Requests Per Second) | Canary + Feature Flag |
| Super-large (GAFA scale) | Progressive + auto-judgment |
Decision criterion 2: risk tolerance
Industries with extremely high failure cost like finance, medical, payments need the most cautious deploys. Conversely, experimental B2C suits high-speed deploy culture.
| Risk tolerance | Recommended |
|---|---|
| High (SaaS experimental features) | Feature Flag-centric, high frequency |
| Mid (general B2C) | Canary + monitoring |
| Low (finance, medical) | Blue-Green + prior approval |
| Lowest (aviation, nuclear) | Phased release + long-term verification |
How to choose by case
Personal dev / small web service
GitHub Actions + Rolling Update (K8s standard). Custom-build Feature Flag with config files or DB columns - no additional infra. Auto-deploy to production on CI pass, manually return to previous commit on problems - sufficient operation.
Startup / growth-stage B2C SaaS
GitHub Actions + Canary (Argo Rollouts) + Feature Flag (Unleash / PostHog). 3-stage 5%→20%→100% deployment, auto-monitor error rate and P95 latency, auto-rollback on anomaly. DB migrates non-stop via Expand and Contract.
Mid-size enterprise / many microservices
ArgoCD (GitOps) + Flagger + LaunchDarkly. Independent deploy per service, SLI-based auto-judgment via Progressive Delivery, Feature Flag unified org-wide. Hundreds of monthly deploys become realistic.
Finance / medical / payments
Blue-Green + prior-approval workflow + Change Advisory Board. Release time fixed outside business hours, DB schema changes pre-applied in separate releases, audit logs stored permanently. Auto-handling by AI agents considered per requirements.
DORA 4-metric numerical gates
Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.
The world standard for measuring deploy-strategy quality is the DORA 4 metrics.
| DORA metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Deploy frequency | Multiple/day | Weekly-daily | Monthly-weekly | Less than monthly |
| Lead time for changes (commit → prod) | Within 1 hour | 1 day-1 week | 1 week-1 month | 1-6 months |
| Change failure rate | 0-15% | 16-30% | 16-30% | 46-60% |
| Recovery time (MTTR) | Within 1 hour | Within 1 day | 1 day-1 week | 1 week-1 month |
Canary deployment numerical gates: Phase 1: 1-5% / 15 min observation, Phase 2: 20% / 30 min, Phase 3: 50% / 1 hour, Phase 4: 100%. Auto-rollback on error rate exceeding old version + 0.5% or P99 > old version x 1.2. This is the Argo Rollouts / Flagger standard judgment criterion.
To aim for Elite level, multiple daily deploys + within-1-hour recovery. Canary + Feature Flag is the premise.
Deploy-strategy pitfalls and forbidden moves
Typical accident patterns in deploy. All have “company-tilting class” destructive power.
| Forbidden move | Why it’s bad |
|---|---|
| Deploy new version to all servers at once | The 2012 Knight Capital pattern (old code on 1 of 8, $440M loss in 45 min) |
| Public release without Feature Flag | All users affected on problems, no rollback |
| DB migration and code deploy simultaneously | No rollback on inconsistency. Separate via expand/contract |
| Leave Feature Flags unattended | 200 piled up, unclear which are alive, half-day onboarding |
| CI tests pass = production 100% deploy | Reckless without Canary, phased deployment required |
| No auto-rollback | Damage expands waiting for human judgment, SLI-based auto-judgment required |
| Set deploy time at business peak | Maximizes incident impact, outside business hours recommended |
| Don’t leave release notes / change history | Cause-identification impossible on incidents, always record in Git and PR |
| Don’t conduct rollback drills | Can’t move when needed, quarterly drills required |
| ”Big and rare” monthly Big Bang releases | DORA Low level, one failure is fatal |
| Deploy steps in one person’s head | Full stop on owner vacation, Runbook + IaC-ization required |
| ”Lower release frequency is safer” — being cautious | Lower frequency means larger change volume per release, making failure impact huge |
| ”Pass CI tests = OK to ship to production” — going straight | CI pass is a necessary condition for quality; without Canary + phased deployment, you miss bugs only visible in production |
The August 2012 Knight Capital incident (old code remained on 1 of 8 servers, unintended auto-trading lost $440M in 45 min, company disappeared), the June 2021 Fastly global outage (one customer config change stopped major sites worldwide for 1 hour) - “deploy-strategy laxness” deletes companies - typical cases.
The worst is “big and rare” deploys. Small and frequent + Canary + auto-rollback - the rule.
AI decision axes
| AI-era favorable | AI-era unfavorable |
|---|---|
| Small changes, high-frequency deploys | Large batch deploys |
| Canary + auto-judgment | Manual judgment only |
| Feature Flag premise | Instant full publication |
| Rich auto-tests | Manual-test-centric |
- Deploy small and frequently — limit per-failure impact, train muscle
- Make Canary + Feature Flag standard — separate deploy and release, minimize impact
- Auto-rollback by SLI — don’t wait for human judgment, stop damage in seconds
- DB changes via Expand and Contract — don’t do all in one release, phased migration to non-stop
What to decide - what is your project’s answer?
For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”
- Deploy strategy (Rolling / Blue-Green / Canary)
- CI/CD tool (GitHub Actions etc.)
- Feature Flag adoption (LaunchDarkly etc. or custom)
- Phased-release ratio (5→20→50→100)
- Auto-rollback criteria (SLI threshold)
- DB-migration strategy (Expand / Contract)
- Deploy-frequency target (daily / weekly / monthly)
Author’s note - “just one deploy” that erased a company
Cases of breaking down by underestimating deploy strategy are repeatedly carved into industry history.
The August 2012 Knight Capital incident is the extreme north of these lessons. Knight Capital, a major US market maker, on releasing new auto-trading features, put production live with old code remaining on 1 of 8 trading servers. The old code mistook a new flag for a different feature, triggering massive unintended auto-trades, losing about $440M in just 45 minutes (exceeding then-equity), with the company effectively bankrupt. A symbolic case where one human error of “missing just 1 server in deploy steps” erased a public company.
Another famous one is the June 2021 Fastly global outage. At major CDN vendor Fastly, one customer config change tripped a latent bug, simultaneously dropping major worldwide sites - Reddit, Amazon, UK government sites, NYT, CNN - for about 1 hour. A case told to show modern deploy-risk depth where “even protecting your own production, one upstream config change stops the world.”
Both have “deploy-strategy laxness” as the lethal blow, slapping home that without Canary, Feature Flag, and auto-rollback equipment, human errors directly link to corporate life-or-death.
Recording decision rationale
Deploy-strategy selection directly impacts incident risk and release speed, so recording why you chose that strategy as an ADR (Architecture Decision Record) is important.
| Item | Content |
|---|---|
| Title | Adopt Canary Release as deploy strategy |
| Status | Approved |
| Context | An EC site with 500K monthly active users experienced 2 full-deploy incidents over the past 6 months (total revenue impact: ~$60K). Want to limit incident blast radius while raising release frequency from weekly to daily |
| Decision | Use Canary Release (initial traffic 5% → phased expansion) as the standard deploy strategy |
| Rationale | - Incident impact limited to 5% of users, minimizing revenue loss - Monitor error rate and latency SLIs, auto-rollback on threshold breach - ~Half the infra cost vs Blue-Green (no need to duplicate entire environment) |
| Rejected alternatives | Blue-Green → Maintaining 2 full production environments adds ~$60K/year. Rolling Update → Risk of propagating to all nodes on failure is unacceptable at 500K-user scale |
| Outcome | Introduce Argo Rollouts for Canary phase control. SLI dashboard and auto-rollback rule setup are prerequisite tasks |
Store ADRs in docs/adr/ as Markdown, with a rule to always file a new ADR when changing deploy strategy - this keeps decision history traceable. The greatest value of ADRs is that when you look back later, “why we made this choice” is immediately clear.
Summary
This article covered deploy strategy, including Rolling, Blue-Green, Canary, Feature Flag, Progressive Delivery, auto-rollback, and zero-downtime DB migration.
Deploy small and frequently, make Canary+Feature Flag standard, auto-rollback by SLI, non-stop DB changes via Expand and Contract. That is the practical answer for deploy strategy in 2026.
Next time we’ll cover monitoring and observability (metrics, traces, log integration).
Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book
I hope you’ll read the next article as well.
📚 Series: Architecture Crash Course for the Generative-AI Era (61/89)