About this article
As the seventh installment of the âDevOps Architectureâ category in the series âArchitecture Crash Course for the Generative-AI Era,â this article explains CI/CD.
The gap between Netflix deploying thousands of times per day and a team timidly proceeding once a week is competitiveness itself. This article covers phased CI/CD practice (from pre-commit to production canary), the within-10-min PR rule, expand/contract for DB migrations, phased releases, and IaC, GitOps, DevSecOps, and OIDC integration.
What is CI/CD
In a nutshell, CI/CD is âa system that automatically tests, builds, and distributes your code the moment you write it.â
Think of the publishing process as an analogy. An author submits a manuscript, the proofreader checks for errors, and if everything passes, it goes to the printing press and arrives at bookstores. CI/CD reproduces this flow for software development â the instant a developer saves code, tests run automatically, and if nothing fails, the code is delivered to production.
| Acronym | Full name | Content |
|---|---|---|
| CI | Continuous Integration | Auto-build/auto-test on every push |
| CD | Continuous Delivery | Maintain deploy-ready state (humans approve) |
| CD | Continuous Deployment | Full automation through production deploy |
The difference between Delivery and Deployment is âwhether human approval enters.â Regulated industries like finance/medical use approval-included Delivery type, while web services / SaaS mainstream the fully-automated Deployment type.
Why CI/CD matters
In a world without CI/CD, developers manually SSH into servers, place files one by one, and run tests while eyeballing a paper checklist. A midnight production deploy skips one step in the procedure, and the next morning an outage is discovered â âaccidents caused by human manual workâ have repeated themselves endlessly at shops without CI/CD.
Benefits of adopting CI/CD arenât limited to time-saving. Itâs the fundamental mechanism deciding overall business speed.
| Benefit | Content |
|---|---|
| Early bug discovery | Tests run on every push, problem detected in minutes |
| Eliminating deploy person-locking | Same result reproduced regardless of who pushes |
| Higher deploy frequency | Multiple releases per day become possible |
| Easier rollback | Roll back to previous version in minutes on problems |
| Reproducibility of production | No misreading of manuals |
The goal is creating the state of âmain merge = production-ready.â With this, feature additions and bug fixes are reflected in production within hours, making the dev-speed gap with competitors decisive. Netflix and Amazon being able to deploy thousands of times per day is impossible without CI/CD.
Deploy speed directly links to business competitiveness.
Main CI/CD platforms
Multiple services/tools provide CI/CD, mostly decided by the combination with source-code hosting.
| Product | Characteristics | Suited for |
|---|---|---|
| GitHub Actions | GitHub-integrated, YAML config, overwhelming adoption | GitHub use, top candidate for new adoption |
| GitLab CI | GitLab-integrated, easy self-hosting | GitLab env, enterprise |
| CircleCI | Fast, flexible config | Large OSS, independent operation |
| AWS CodePipeline | AWS-integrated, IAM linkage | AWS-specialized env |
| Jenkins | OSS veteran, high customization | Legacy continuation, special requirements |
| Azure DevOps | Microsoft-integrated | .NET, Azure env |
For new adoption, GitHub Actions is the overwhelming front-runner. Tightly coupled with GitHub source, intuitive in YAML, with Marketplace-rich reusable actions. Jenkins is veteran but heavy in operational effort, with almost no remaining reason to choose for new adoption.
Typical pipeline
A CI/CD pipeline runs multiple steps in order. Failing a step doesnât proceed, instantly notifying.
flowchart TB
A([git push]) --> B[Lint / Format<br/>seconds]
B --> C[Unit Test<br/>minutes]
C --> D[Build Image<br/>minutes]
D --> E[Integration Test<br/>tens of minutes]
E --> F[Security Scan<br/>SAST / SCA]
F --> G[Deploy to Staging<br/>auto]
G -->|manual approval/<br/>gate| H[Deploy to Production<br/>Canary -> 100%]
B -. fail .-> X([instant notification])
C -. fail .-> X
D -. fail .-> X
E -. fail .-> X
F -. fail .-> X
classDef step fill:#dbeafe,stroke:#2563eb;
classDef prod fill:#fef3c7,stroke:#d97706,stroke-width:2px;
classDef fail fill:#fee2e2,stroke:#dc2626;
class B,C,D,E,F,G step;
class H prod;
class X fail;
The principle is failing in early steps (Fail Fast). Running 30-minute E2E tests first means waiting 30 minutes on a trivial code-convention violation. Place in the order seconds Lint â minutes Unit Test â tens of minutes Integration Test, with slower tests later.
What to run in CI - phased practice
CI isnât ârunning everything at once on push.â Realistic to split into 4 phases by timing of human contact with code, varying roles per phase. Earlier phases are faster, more frequent, local; later phases are heavier and broader.
| Phase | When it runs | What to run | Target time |
|---|---|---|---|
| 1. pre-commit | At commit creation (local) | Formatter, Lint, secret detection | Within 5s |
| 2. pre-push | Just before push (local or server) | Unit Tests near changed files | Within 30s |
| 3. PR creation/update | When pushed to GitHub | All Unit + type check + coverage + SAST + SCA | Within 10 min |
| 4. At merge | Moment merged to main | Integration Test + build + image push + IaC plan | Within 20 min |
The trick for pre-commit (implemented via Husky, lefthook, pre-commit framework) is narrowing to only coding conventions and secret leaks. Heavy ones get skipped by developers, so donât put anything over 5 seconds. Standards at this stage: Prettier / Biome / ESLint / gitleaks.
Keep CI runtime within 10 minutes
When PR-creation CI (phase 3) exceeds 10 minutes, dev speed plummets. The rhythm of âsubmit PR â do another task â CI finishes â request reviewâ breaks, and developers lose context while waiting. Google internal research also shows daily commit count decreasing as CI response time lengthens.
| Approach | Effect | Caveats |
|---|---|---|
| Job parallelization | Lint, Unit, type check simultaneously, 3x faster | Runner billing increases linearly |
| Dependency caching | node_modules etc. become seconds from 2nd time | Wrong cache-key design pulls old dependencies |
| Run only change-scope tests | Donât run all in monorepo | Test-dependency analysis needed (Nx, Turborepo, Bazel) |
| Docker layer caching | Image build several times faster | Buildx, GHA cache config required |
| Separate slow tests into different jobs | E2E, load tests to nightly batch | Doesnât run on PR, separate route to guarantee coverage needed |
The order parallelization â caching â change-scope limit works first. If you canât get under 10 min, first suspect whether non-test heavy processing (full clean-build every time, unneeded integration tests) is mixed in CI.
CI on PR within 10 minutes. Organizations that canât keep this definitely see dev speed decay.
What to run in CD - per-environment responsibilities
CD isnât just âdelivering to productionâ - itâs the process of phased verification while passing through environments. Vague per-environment roles fall into the thought-stop of âworked in stg so prod is fine too.â Prepare at least 3 environments, ideally 4.
| Env | Purpose | Data | Access | Auto/manual |
|---|---|---|---|---|
| dev | Developer verification | Synthetic data | Dev team | Auto on main merge |
| staging | Production-equivalent integration verification | Production masked copy | QA, PM, stakeholders | Auto on main merge |
| pre-prod (canary) | Real-trial verification with partial production traffic | Production itself | Internal users, partial customers | Approval + auto |
| production | All users | Production | All customers | Auto-expansion after canary passage |
Making staging âan env only developers touchâ formalizes it. The reason for stagingâs existence is that QA and PM regularly touch it, operating with content close to production data. Daily-syncing production-masked copies (PII removed, payment info synthetic) is the rule. Pre-prod (canary) tends to be omitted, but staging alone canât catch problems with âproductionâs actual traffic, data volume, and 3rd-party integration,â so split it for critical systems.
DB migration is the CD pitfall
The most accident-prone in CD is DB-schema changes. Even with code instantly rollback-able via Blue/Green, schema changes are one-way and run while holding production data. So progress them in a different lifecycle from code, phased while preserving backward compatibility. This is called the expand/contract pattern.
[Renaming column user_name to full_name]
1. expand : add full_name column (write both, read user_name) â deploy 1
2. backfill: copy user_name values to full_name (batch)
3. switch : switch reading to full_name â deploy 2
4. contract: drop user_name column â deploy 3
Concrete migration fragment example (PostgreSQL):
-- 1. expand: add new column NULL-allowed (instant, avoid lock)
ALTER TABLE users ADD COLUMN full_name TEXT;
-- 2. backfill: put values in existing rows (split batches to suppress prod load)
UPDATE users SET full_name = user_name
WHERE full_name IS NULL AND id BETWEEN $1 AND $2;
-- 3. switch: after app-side release, guarantee integrity
ALTER TABLE users ALTER COLUMN full_name SET NOT NULL;
-- 4. contract: drop old column (after confirming new code 100% running)
ALTER TABLE users DROP COLUMN user_name;
Sync rewrite the app side too.
// just after expand: dual writes (won't break even if read by old code)
await db.update(users).set({
user_name: name,
full_name: name,
}).where(eq(users.id, userId));
// after switch deploy: write only to new column
await db.update(users).set({ full_name: name }).where(eq(users.id, userId));
The reason to split into 4 stages is to avoid breaking at any moment when old and new code run simultaneously. During rolling updates, new and old instances coexist for minutes-tens of minutes, so without coexisting âcolumns old code readsâ and âcolumns new code writes,â production dies.
Donât finish schema changes in one shot. The rule is splitting into 3 deploys.
DB-migration forbidden moves and operation
| Forbidden move | Why itâs bad |
|---|---|
| One-shot column rename | Instances running on old code instantly die |
| Immediate add of NOT NULL column | Constraint violation on existing records, migration failure |
| Add index without online | Writes locked on large tables |
| Put breaking changes and code changes in same deploy | Schema canât roll back on rollback |
| âCI/CD adoption reduces bugsâ â expecting too much | Without tests, bugs just flood production at high speed |
| âIAM keys in Secrets are safeâ â complacency | Secrets also leak. Make long-term keys themselves unnecessary via OIDC integration |
Migration is run as an independent step earlier in the CD pipeline, completing before code deploy. PostgreSQL uses CREATE INDEX CONCURRENTLY, MySQL uses pt-online-schema-change or gh-ost - operations not locking production tables are required.
Migration tools - Flyway, Liquibase, Prisma Migrate - are standard, with changes managed as SQL files (or declarative schema) in code. Putting on Git and making PR-reviewable is the modern standard, avoiding DBA GUI work that leaves no audit trail. On large tables, backfill batch runtime can become hours-days, so the practice is managing migration schedules separately on a release calendar.
AI decision axes
| AI-era favorable | AI-era unfavorable |
|---|---|
| GitHub Actions + standard YAML | Jenkins custom DSL / commercial custom config |
| OIDC integration (no long-term keys) | IAM keys saved in Secrets |
| Trunk-Based + Feature Flag | Long-life branches / manual release |
| AI-supporting code-generation/review integration | Private CI untouchable by AI agents |
- Choose CI/CD tightly coupled with source-code hosting â GitHub Actions for GitHub
- Make OIDC integration mandatory, exclude long-term IAM keys from Secrets
- Trunk-Based + Feature Flag to separate release and deploy
- Build IaC + DevSecOps into pipeline (shift-left)
Phased release - concrete percentages and gates
Canary tends to be explained as ârelease to a portion first and watch,â but if âwhat % / what to look at / how to advanceâ is vague, it doesnât function. In practice, auto-progress at the following ratios, judging metric-based at each stage on error rate, latency, business metrics.
| Phase | Ratio | Observation time | Auto-rollback condition examples |
|---|---|---|---|
| Canary 1 | 1% | 15 min | Error rate > old version + 0.5% |
| Canary 2 | 5% | 30 min | p99 latency > old version x 1.2 |
| Canary 3 | 25% | 1 hour | Business metric (CVR etc.) > old version - 5% |
| Canary 4 | 50% | 2 hours | Any of the above 3 |
| Full deploy | 100% | - | - |
Setting auto-rollback thresholds as relative comparison with old version (baseline) is the rule. Setting absolute values causes false-firing on time-of-day variations. AWS CodeDeploy, Argo Rollouts, Flagger come standard-equipped with this âmetric-linked auto-phased deployment.â
Relying only on human visual judgment means no one can stop it on midnight incidents. Mechanize gates, donât omit approval but have machines judge first - the modern CD design.
Branch strategy
Closely related to CI/CD is branch strategy. Use by team scale, release frequency, and product nature.
| Strategy | Characteristics | Suited for |
|---|---|---|
| GitHub Flow | main + feature branches | Web services / continuous deploy |
| Git Flow | main + develop + release + hotfix | Product type / version-management type |
| Trunk-Based Development | Short-lived feature, direct push to main | High-grade CI/CD / expert teams |
In modern web dev, GitHub Flow or Trunk-Based Development is mainstream. Complex Git Flow is effective only in cases like packaged products / on-prem distribution where âversion-unit management is needed,â and is just excessive complexity for continuously-deployed SaaS.
Branch strategy is simplicity-first. Complex Git Flow is now a minority.
Test-automation levels
Tests have multiple levels with different roles. Donât write all kinds - distribute appropriate amounts per use case is important.
| Test | Speed | Scope | Ratio guideline |
|---|---|---|---|
| Unit Test | Fast (seconds) | Function/class unit | 70% |
| Integration Test | Mid (minutes) | Module integration | 20% |
| E2E Test | Slow (minutes-tens of minutes) | User-operation reproduction | 10% |
This ratio is called the test pyramid. The principle is more Unit Tests, fewer E2E Tests. E2E is slow, unstable, and brittle, so narrow to essential user flows. Always run Unit + Integration in CI; the typical composition runs E2E nightly in staging environment.
Deploy strategy
There are multiple ways to âdeploy.â The bigger the impact of a release, the more phased and cautious to do it - the rule.
| Strategy | Mechanism | Characteristics |
|---|---|---|
| Rolling Update | Sequential replacement | Simple, no downtime |
| Blue/Green | Verify in new env, switch all at once | Instant rollback, 2x cost |
| Canary Release | Pre-deploy to partial users | Localize risk |
| Feature Flag | Code in production but switch feature ON/OFF | Separate deploy and release |
Large-scale features and irreversible changes (DB migrations etc.) are done phased with Blue/Green or Canary. On problems, return traffic, not code - the point.
Canary and Feature Flag
Canary Release and Feature Flag are currently the most-emphasized 2 risk-management approaches.
Canary Release:
Users
|- 95% --> [old version v1.0]
|- 5% --> [new version v1.1] <- expand gradually if no problems
Feature Flag:
if (flag.newCheckout) {
renderV2()
} else {
renderV1()
}
Feature Flag is the mechanism deploying code to production while making feature ON/OFF switchable by flag. It severs the joined-fate relationship of âdeploy = release,â separating code distribution from feature publication. Dedicated SaaS like LaunchDarkly, Unleash, GrowthBook are widely used.
The essence of phased deployment is avoiding âall break togetherâ and enabling âdetect partially and instantly roll back.â
Infrastructure as Code (IaC)
IaC is the approach of defining and managing infrastructure (servers, network, DB etc.) as code. Abolishing manual construction dramatically improves reproducibility, audit-ability, and change management.
| Tool | Characteristics |
|---|---|
| Terraform | Multi-cloud support, de facto standard |
| OpenTofu | Terraformâs OSS fork (independent in 2024) |
| AWS CloudFormation | AWS-native, JSON/YAML |
| Azure Bicep | Azure-native, ARM successor |
| Pulumi | Written in TypeScript / Python etc. |
| AWS CDK | Define AWS in TypeScript etc. |
Standard is Terraform (or successor OpenTofu). Multi-cloud-supporting with a large community, can manage almost all cloud resources as code. Build into CI/CD pipeline and review infrastructure changes via Pull Request too - the modern rule.
GitOps
GitOps is the method where Git repos are the Single Source of Truth, with pushes to Git auto-reflecting to infrastructure. Especially in Kubernetes operations, itâs rapidly standardizing.
[Git push]
|
[ArgoCD / Flux] <- continuously monitors Git repos
|
[Kubernetes] <- auto-converges to Git contents
| Feature | Content |
|---|---|
| Declarative | Define âshould-be stateâ and converge there |
| Audit-ability | All changes in Git history, clear who changed what when |
| Easy rollback | Revert with just git revert |
| Permission management | Git access permissions become deploy permissions |
In Kubernetes envs, ArgoCD and Flux are the 2 major GitOps tools. In current K8s operations almost standard, with the move spreading to abolish manual kubectl apply.
DevSecOps (security integration)
DevSecOps is the thinking of building security checks into the CI/CD pipeline. Through shift-left (find problems at early stages), crush vulnerabilities before production deploy.
| Target | Tool example |
|---|---|
| SAST (static analysis) | SonarQube / CodeQL / Semgrep |
| SCA (dependencies) | Dependabot / Snyk / Renovate |
| DAST (dynamic analysis) | OWASP ZAP / Burp Suite |
| Container scan | Trivy / Grype / Docker Scout |
| IaC scan | Checkov / tfsec / Terrascan |
GitHub Dependabot is free, auto-detecting dependency-library vulnerabilities and even creating PRs. This alone has high adoption value, at the level of required as minimum vulnerability countermeasures.
Security is built into the pipeline - the modern basis.
Secret management
Putting API keys and passwords in CI/CD env in plaintext absolutely donât do. They get instantly abused if leaked. Use appropriate management methods.
| Method | Characteristics |
|---|---|
| GitHub Actions Secrets | Standard feature, simple |
| OIDC integration (GitHub <-> AWS / GCP) | No long-term keys needed, top recommendation |
| HashiCorp Vault | Multi-env integration |
| AWS Secrets Manager | AWS-native, auto-rotation |
The current front-runner is OIDC (OpenID Connect, the standard auth protocol on OAuth2) integration. When accessing AWS from GitHub Actions, instead of storing long-term IAM access keys, the method of fetching AWS temporary tokens with GitHub credentials. Since IAM-key leak risk disappears, overwhelming superiority on the security side.
OIDC is the current standard. Putting long-term keys in Secrets is old operation.
What to decide - what is your projectâs answer?
For each of the following, try to articulate your projectâs answer in 1-2 sentences. Starting work with these vague always invites later questions like âwhy did we decide this again?â
- CI/CD platform (GitHub Actions / GitLab CI etc.)
- Branch strategy (GitHub Flow / Trunk-Based / Git Flow)
- What to run at CI stages (pre-commit / pre-push / PR / merge)
- CI target time on PR (within 10 min recommended)
- Env composition (dev / staging / canary / production) and per-env responsibilities
- DB-migration strategy (expand / contract, tool selection)
- Phased-release ratios, observation times, auto-rollback gates
- IaC tool (Terraform / CloudFormation / CDK)
- Secret-management method (OIDC recommended)
- Rollback procedures and drill frequency
Authorâs note - Knight Capitalâs 45 minutes
On August 1, 2012, US securities firm Knight Capital introduced a new trading system, but old code remained on 1 of 8 servers when running, mass-issuing erroneous orders. About $460M lost in 45 minutes - the company effectively disappeared. A widely-told industry case.
The lesson: the very operation of allowing âpartial different stateâ is the risk. Leaving deploy steps to human manual work always causes accidents like server-1-mistakes. Implementing the basics of âdeploy all servers with same artifact and same procedure via CI/CDâ is the minimum defense line.
Humans are creatures who occasionally forget just 1 server.
Summary
This article covered CI/CD, including phased practice, within-10-min PR, DB migration, phased releases, IaC, GitOps, DevSecOps, and OIDC integration.
Anchor on GitHub Actions + Trunk-Based + OIDC + IaC + Feature Flag, CI within 10 min on PR, DB migration via expand/contract, mechanize phased-release gates. That is the practical answer for CI/CD design in 2026.
Next time weâll cover deploy strategy (Blue-Green, Canary, Feature Flag in detail).
I hope youâll read the next article as well.
đ Series: Architecture Crash Course for the Generative-AI Era (60/89)