[DevOps Architecture] CI/CD - GitHub Actions + OIDC + Feature Flag Is the Standard

About this article

As the seventh installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains CI/CD.

The gap between Netflix deploying thousands of times per day and a team timidly proceeding once a week is competitiveness itself. This article covers phased CI/CD practice (from pre-commit to production canary), the within-10-min PR rule, expand/contract for DB migrations, phased releases, and IaC, GitOps, DevSecOps, and OIDC integration.

What is CI/CD

In a nutshell, CI/CD is “a system that automatically tests, builds, and distributes your code the moment you write it.”

Think of the publishing process as an analogy. An author submits a manuscript, the proofreader checks for errors, and if everything passes, it goes to the printing press and arrives at bookstores. CI/CD reproduces this flow for software development — the instant a developer saves code, tests run automatically, and if nothing fails, the code is delivered to production.

Acronym	Full name	Content
CI	Continuous Integration	Auto-build/auto-test on every push
CD	Continuous Delivery	Maintain deploy-ready state (humans approve)
CD	Continuous Deployment	Full automation through production deploy

The difference between Delivery and Deployment is “whether human approval enters.” Regulated industries like finance/medical use approval-included Delivery type, while web services / SaaS mainstream the fully-automated Deployment type.

Why CI/CD matters

In a world without CI/CD, developers manually SSH into servers, place files one by one, and run tests while eyeballing a paper checklist. A midnight production deploy skips one step in the procedure, and the next morning an outage is discovered — “accidents caused by human manual work” have repeated themselves endlessly at shops without CI/CD.

Benefits of adopting CI/CD aren’t limited to time-saving. It’s the fundamental mechanism deciding overall business speed.

Benefit	Content
Early bug discovery	Tests run on every push, problem detected in minutes
Eliminating deploy person-locking	Same result reproduced regardless of who pushes
Higher deploy frequency	Multiple releases per day become possible
Easier rollback	Roll back to previous version in minutes on problems
Reproducibility of production	No misreading of manuals

The goal is creating the state of “main merge = production-ready.” With this, feature additions and bug fixes are reflected in production within hours, making the dev-speed gap with competitors decisive. Netflix and Amazon being able to deploy thousands of times per day is impossible without CI/CD.

Deploy speed directly links to business competitiveness.

Main CI/CD platforms

Multiple services/tools provide CI/CD, mostly decided by the combination with source-code hosting.

Product	Characteristics	Suited for
GitHub Actions	GitHub-integrated, YAML config, overwhelming adoption	GitHub use, top candidate for new adoption
GitLab CI	GitLab-integrated, easy self-hosting	GitLab env, enterprise
CircleCI	Fast, flexible config	Large OSS, independent operation
AWS CodePipeline	AWS-integrated, IAM linkage	AWS-specialized env
Jenkins	OSS veteran, high customization	Legacy continuation, special requirements
Azure DevOps	Microsoft-integrated	.NET, Azure env

For new adoption, GitHub Actions is the overwhelming front-runner. Tightly coupled with GitHub source, intuitive in YAML, with Marketplace-rich reusable actions. Jenkins is veteran but heavy in operational effort, with almost no remaining reason to choose for new adoption.

Typical pipeline

A CI/CD pipeline runs multiple steps in order. Failing a step doesn’t proceed, instantly notifying.

flowchart TB
    A([git push]) --> B[Lint / Format<br/>seconds]
    B --> C[Unit Test<br/>minutes]
    C --> D[Build Image<br/>minutes]
    D --> E[Integration Test<br/>tens of minutes]
    E --> F[Security Scan<br/>SAST / SCA]
    F --> G[Deploy to Staging<br/>auto]
    G -->|manual approval/<br/>gate| H[Deploy to Production<br/>Canary -> 100%]
    B -. fail .-> X([instant notification])
    C -. fail .-> X
    D -. fail .-> X
    E -. fail .-> X
    F -. fail .-> X
    classDef step fill:#dbeafe,stroke:#2563eb;
    classDef prod fill:#fef3c7,stroke:#d97706,stroke-width:2px;
    classDef fail fill:#fee2e2,stroke:#dc2626;
    class B,C,D,E,F,G step;
    class H prod;
    class X fail;

The principle is failing in early steps (Fail Fast). Running 30-minute E2E tests first means waiting 30 minutes on a trivial code-convention violation. Place in the order seconds Lint → minutes Unit Test → tens of minutes Integration Test, with slower tests later.

What to run in CI - phased practice

CI isn’t “running everything at once on push.” Realistic to split into 4 phases by timing of human contact with code, varying roles per phase. Earlier phases are faster, more frequent, local; later phases are heavier and broader.

Phase	When it runs	What to run	Target time
1. pre-commit	At commit creation (local)	Formatter, Lint, secret detection	Within 5s
2. pre-push	Just before push (local or server)	Unit Tests near changed files	Within 30s
3. PR creation/update	When pushed to GitHub	All Unit + type check + coverage + SAST + SCA	Within 10 min
4. At merge	Moment merged to main	Integration Test + build + image push + IaC plan	Within 20 min

The trick for pre-commit (implemented via Husky, lefthook, pre-commit framework) is narrowing to only coding conventions and secret leaks. Heavy ones get skipped by developers, so don’t put anything over 5 seconds. Standards at this stage: Prettier / Biome / ESLint / gitleaks.

Keep CI runtime within 10 minutes

When PR-creation CI (phase 3) exceeds 10 minutes, dev speed plummets. The rhythm of “submit PR → do another task → CI finishes → request review” breaks, and developers lose context while waiting. Google internal research also shows daily commit count decreasing as CI response time lengthens.

Approach	Effect	Caveats
Job parallelization	Lint, Unit, type check simultaneously, 3x faster	Runner billing increases linearly
Dependency caching	`node_modules` etc. become seconds from 2nd time	Wrong cache-key design pulls old dependencies
Run only change-scope tests	Don’t run all in monorepo	Test-dependency analysis needed (Nx, Turborepo, Bazel)
Docker layer caching	Image build several times faster	Buildx, GHA cache config required
Separate slow tests into different jobs	E2E, load tests to nightly batch	Doesn’t run on PR, separate route to guarantee coverage needed

The order parallelization → caching → change-scope limit works first. If you can’t get under 10 min, first suspect whether non-test heavy processing (full clean-build every time, unneeded integration tests) is mixed in CI.

CI on PR within 10 minutes. Organizations that can’t keep this definitely see dev speed decay.

What to run in CD - per-environment responsibilities

CD isn’t just “delivering to production” - it’s the process of phased verification while passing through environments. Vague per-environment roles fall into the thought-stop of “worked in stg so prod is fine too.” Prepare at least 3 environments, ideally 4.

Env	Purpose	Data	Access	Auto/manual
dev	Developer verification	Synthetic data	Dev team	Auto on main merge
staging	Production-equivalent integration verification	Production masked copy	QA, PM, stakeholders	Auto on main merge
pre-prod (canary)	Real-trial verification with partial production traffic	Production itself	Internal users, partial customers	Approval + auto
production	All users	Production	All customers	Auto-expansion after canary passage

Making staging “an env only developers touch” formalizes it. The reason for staging’s existence is that QA and PM regularly touch it, operating with content close to production data. Daily-syncing production-masked copies (PII removed, payment info synthetic) is the rule. Pre-prod (canary) tends to be omitted, but staging alone can’t catch problems with “production’s actual traffic, data volume, and 3rd-party integration,” so split it for critical systems.

DB migration is the CD pitfall

The most accident-prone in CD is DB-schema changes. Even with code instantly rollback-able via Blue/Green, schema changes are one-way and run while holding production data. So progress them in a different lifecycle from code, phased while preserving backward compatibility. This is called the expand/contract pattern.

[Renaming column user_name to full_name]
1. expand : add full_name column (write both, read user_name) ← deploy 1
2. backfill: copy user_name values to full_name (batch)
3. switch : switch reading to full_name                        ← deploy 2
4. contract: drop user_name column                              ← deploy 3

Concrete migration fragment example (PostgreSQL):

-- 1. expand: add new column NULL-allowed (instant, avoid lock)
ALTER TABLE users ADD COLUMN full_name TEXT;

-- 2. backfill: put values in existing rows (split batches to suppress prod load)
UPDATE users SET full_name = user_name
WHERE full_name IS NULL AND id BETWEEN $1 AND $2;

-- 3. switch: after app-side release, guarantee integrity
ALTER TABLE users ALTER COLUMN full_name SET NOT NULL;

-- 4. contract: drop old column (after confirming new code 100% running)
ALTER TABLE users DROP COLUMN user_name;

Sync rewrite the app side too.

// just after expand: dual writes (won't break even if read by old code)
await db.update(users).set({
  user_name: name,
  full_name: name,
}).where(eq(users.id, userId));

// after switch deploy: write only to new column
await db.update(users).set({ full_name: name }).where(eq(users.id, userId));

The reason to split into 4 stages is to avoid breaking at any moment when old and new code run simultaneously. During rolling updates, new and old instances coexist for minutes-tens of minutes, so without coexisting “columns old code reads” and “columns new code writes,” production dies.

Don’t finish schema changes in one shot. The rule is splitting into 3 deploys.

DB-migration forbidden moves and operation

Forbidden move	Why it’s bad
One-shot column rename	Instances running on old code instantly die
Immediate add of NOT NULL column	Constraint violation on existing records, migration failure
Add index without online	Writes locked on large tables
Put breaking changes and code changes in same deploy	Schema can’t roll back on rollback
”CI/CD adoption reduces bugs” — expecting too much	Without tests, bugs just flood production at high speed
”IAM keys in Secrets are safe” — complacency	Secrets also leak. Make long-term keys themselves unnecessary via OIDC integration

Migration is run as an independent step earlier in the CD pipeline, completing before code deploy. PostgreSQL uses CREATE INDEX CONCURRENTLY, MySQL uses pt-online-schema-change or gh-ost - operations not locking production tables are required.

Migration tools - Flyway, Liquibase, Prisma Migrate - are standard, with changes managed as SQL files (or declarative schema) in code. Putting on Git and making PR-reviewable is the modern standard, avoiding DBA GUI work that leaves no audit trail. On large tables, backfill batch runtime can become hours-days, so the practice is managing migration schedules separately on a release calendar.

AI decision axes

AI-era favorable	AI-era unfavorable
GitHub Actions + standard YAML	Jenkins custom DSL / commercial custom config
OIDC integration (no long-term keys)	IAM keys saved in Secrets
Trunk-Based + Feature Flag	Long-life branches / manual release
AI-supporting code-generation/review integration	Private CI untouchable by AI agents

Choose CI/CD tightly coupled with source-code hosting — GitHub Actions for GitHub
Make OIDC integration mandatory, exclude long-term IAM keys from Secrets
Trunk-Based + Feature Flag to separate release and deploy
Build IaC + DevSecOps into pipeline (shift-left)

Phased release - concrete percentages and gates

Canary tends to be explained as “release to a portion first and watch,” but if “what % / what to look at / how to advance” is vague, it doesn’t function. In practice, auto-progress at the following ratios, judging metric-based at each stage on error rate, latency, business metrics.

Phase	Ratio	Observation time	Auto-rollback condition examples
Canary 1	1%	15 min	Error rate > old version + 0.5%
Canary 2	5%	30 min	p99 latency > old version x 1.2
Canary 3	25%	1 hour	Business metric (CVR etc.) > old version - 5%
Canary 4	50%	2 hours	Any of the above 3
Full deploy	100%	-	-

Setting auto-rollback thresholds as relative comparison with old version (baseline) is the rule. Setting absolute values causes false-firing on time-of-day variations. AWS CodeDeploy, Argo Rollouts, Flagger come standard-equipped with this “metric-linked auto-phased deployment.”

Relying only on human visual judgment means no one can stop it on midnight incidents. Mechanize gates, don’t omit approval but have machines judge first - the modern CD design.

Branch strategy

Closely related to CI/CD is branch strategy. Use by team scale, release frequency, and product nature.

Strategy	Characteristics	Suited for
GitHub Flow	main + feature branches	Web services / continuous deploy
Git Flow	main + develop + release + hotfix	Product type / version-management type
Trunk-Based Development	Short-lived feature, direct push to main	High-grade CI/CD / expert teams

In modern web dev, GitHub Flow or Trunk-Based Development is mainstream. Complex Git Flow is effective only in cases like packaged products / on-prem distribution where “version-unit management is needed,” and is just excessive complexity for continuously-deployed SaaS.

Branch strategy is simplicity-first. Complex Git Flow is now a minority.

Test-automation levels

Tests have multiple levels with different roles. Don’t write all kinds - distribute appropriate amounts per use case is important.

Test	Speed	Scope	Ratio guideline
Unit Test	Fast (seconds)	Function/class unit	70%
Integration Test	Mid (minutes)	Module integration	20%
E2E Test	Slow (minutes-tens of minutes)	User-operation reproduction	10%

This ratio is called the test pyramid. The principle is more Unit Tests, fewer E2E Tests. E2E is slow, unstable, and brittle, so narrow to essential user flows. Always run Unit + Integration in CI; the typical composition runs E2E nightly in staging environment.

Deploy strategy

There are multiple ways to “deploy.” The bigger the impact of a release, the more phased and cautious to do it - the rule.

Strategy	Mechanism	Characteristics
Rolling Update	Sequential replacement	Simple, no downtime
Blue/Green	Verify in new env, switch all at once	Instant rollback, 2x cost
Canary Release	Pre-deploy to partial users	Localize risk
Feature Flag	Code in production but switch feature ON/OFF	Separate deploy and release

Large-scale features and irreversible changes (DB migrations etc.) are done phased with Blue/Green or Canary. On problems, return traffic, not code - the point.

Canary and Feature Flag

Canary Release and Feature Flag are currently the most-emphasized 2 risk-management approaches.

Canary Release:
  Users
   |- 95% --> [old version v1.0]
   |- 5%  --> [new version v1.1]   <- expand gradually if no problems

Feature Flag:
  if (flag.newCheckout) {
    renderV2()
  } else {
    renderV1()
  }

Feature Flag is the mechanism deploying code to production while making feature ON/OFF switchable by flag. It severs the joined-fate relationship of “deploy = release,” separating code distribution from feature publication. Dedicated SaaS like LaunchDarkly, Unleash, GrowthBook are widely used.

The essence of phased deployment is avoiding “all break together” and enabling “detect partially and instantly roll back.”

Infrastructure as Code (IaC)

IaC is the approach of defining and managing infrastructure (servers, network, DB etc.) as code. Abolishing manual construction dramatically improves reproducibility, audit-ability, and change management.

Tool	Characteristics
Terraform	Multi-cloud support, de facto standard
OpenTofu	Terraform’s OSS fork (independent in 2024)
AWS CloudFormation	AWS-native, JSON/YAML
Azure Bicep	Azure-native, ARM successor
Pulumi	Written in TypeScript / Python etc.
AWS CDK	Define AWS in TypeScript etc.

Standard is Terraform (or successor OpenTofu). Multi-cloud-supporting with a large community, can manage almost all cloud resources as code. Build into CI/CD pipeline and review infrastructure changes via Pull Request too - the modern rule.

GitOps

GitOps is the method where Git repos are the Single Source of Truth, with pushes to Git auto-reflecting to infrastructure. Especially in Kubernetes operations, it’s rapidly standardizing.

[Git push]
    |
[ArgoCD / Flux]        <- continuously monitors Git repos
    |
[Kubernetes]           <- auto-converges to Git contents

Feature	Content
Declarative	Define “should-be state” and converge there
Audit-ability	All changes in Git history, clear who changed what when
Easy rollback	Revert with just `git revert`
Permission management	Git access permissions become deploy permissions

In Kubernetes envs, ArgoCD and Flux are the 2 major GitOps tools. In current K8s operations almost standard, with the move spreading to abolish manual kubectl apply.

DevSecOps (security integration)

DevSecOps is the thinking of building security checks into the CI/CD pipeline. Through shift-left (find problems at early stages), crush vulnerabilities before production deploy.

Target	Tool example
SAST (static analysis)	SonarQube / CodeQL / Semgrep
SCA (dependencies)	Dependabot / Snyk / Renovate
DAST (dynamic analysis)	OWASP ZAP / Burp Suite
Container scan	Trivy / Grype / Docker Scout
IaC scan	Checkov / tfsec / Terrascan

GitHub Dependabot is free, auto-detecting dependency-library vulnerabilities and even creating PRs. This alone has high adoption value, at the level of required as minimum vulnerability countermeasures.

Security is built into the pipeline - the modern basis.

Secret management

Putting API keys and passwords in CI/CD env in plaintext absolutely don’t do. They get instantly abused if leaked. Use appropriate management methods.

Method	Characteristics
GitHub Actions Secrets	Standard feature, simple
OIDC integration (GitHub <-> AWS / GCP)	No long-term keys needed, top recommendation
HashiCorp Vault	Multi-env integration
AWS Secrets Manager	AWS-native, auto-rotation

The current front-runner is OIDC (OpenID Connect, the standard auth protocol on OAuth2) integration. When accessing AWS from GitHub Actions, instead of storing long-term IAM access keys, the method of fetching AWS temporary tokens with GitHub credentials. Since IAM-key leak risk disappears, overwhelming superiority on the security side.

OIDC is the current standard. Putting long-term keys in Secrets is old operation.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

CI/CD platform (GitHub Actions / GitLab CI etc.)
Branch strategy (GitHub Flow / Trunk-Based / Git Flow)
What to run at CI stages (pre-commit / pre-push / PR / merge)
CI target time on PR (within 10 min recommended)
Env composition (dev / staging / canary / production) and per-env responsibilities
DB-migration strategy (expand / contract, tool selection)
Phased-release ratios, observation times, auto-rollback gates
IaC tool (Terraform / CloudFormation / CDK)
Secret-management method (OIDC recommended)
Rollback procedures and drill frequency

Author’s note - Knight Capital’s 45 minutes

On August 1, 2012, US securities firm Knight Capital introduced a new trading system, but old code remained on 1 of 8 servers when running, mass-issuing erroneous orders. About $460M lost in 45 minutes - the company effectively disappeared. A widely-told industry case.

The lesson: the very operation of allowing “partial different state” is the risk. Leaving deploy steps to human manual work always causes accidents like server-1-mistakes. Implementing the basics of “deploy all servers with same artifact and same procedure via CI/CD” is the minimum defense line.

Humans are creatures who occasionally forget just 1 server.

Summary

This article covered CI/CD, including phased practice, within-10-min PR, DB migration, phased releases, IaC, GitOps, DevSecOps, and OIDC integration.

Anchor on GitHub Actions + Trunk-Based + OIDC + IaC + Feature Flag, CI within 10 min on PR, DB migration via expand/contract, mechanize phased-release gates. That is the practical answer for CI/CD design in 2026.

Next time we’ll cover deploy strategy (Blue-Green, Canary, Feature Flag in detail).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.