DevOps Architecture

[DevOps Architecture] CI/CD - GitHub Actions + OIDC + Feature Flag Is the Standard

[DevOps Architecture] CI/CD - GitHub Actions + OIDC + Feature Flag Is the Standard

About this article

As the seventh installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains CI/CD.

The gap between Netflix deploying thousands of times per day and a team timidly proceeding once a week is competitiveness itself. This article covers phased CI/CD practice (from pre-commit to production canary), the within-10-min PR rule, expand/contract for DB migrations, phased releases, and IaC, GitOps, DevSecOps, and OIDC integration.

What is CI/CD

In a nutshell, CI/CD is “a system that automatically tests, builds, and distributes your code the moment you write it.”

Think of the publishing process as an analogy. An author submits a manuscript, the proofreader checks for errors, and if everything passes, it goes to the printing press and arrives at bookstores. CI/CD reproduces this flow for software development — the instant a developer saves code, tests run automatically, and if nothing fails, the code is delivered to production.

AcronymFull nameContent
CIContinuous IntegrationAuto-build/auto-test on every push
CDContinuous DeliveryMaintain deploy-ready state (humans approve)
CDContinuous DeploymentFull automation through production deploy

The difference between Delivery and Deployment is “whether human approval enters.” Regulated industries like finance/medical use approval-included Delivery type, while web services / SaaS mainstream the fully-automated Deployment type.

Why CI/CD matters

In a world without CI/CD, developers manually SSH into servers, place files one by one, and run tests while eyeballing a paper checklist. A midnight production deploy skips one step in the procedure, and the next morning an outage is discovered — “accidents caused by human manual work” have repeated themselves endlessly at shops without CI/CD.

Benefits of adopting CI/CD aren’t limited to time-saving. It’s the fundamental mechanism deciding overall business speed.

BenefitContent
Early bug discoveryTests run on every push, problem detected in minutes
Eliminating deploy person-lockingSame result reproduced regardless of who pushes
Higher deploy frequencyMultiple releases per day become possible
Easier rollbackRoll back to previous version in minutes on problems
Reproducibility of productionNo misreading of manuals

The goal is creating the state of “main merge = production-ready.” With this, feature additions and bug fixes are reflected in production within hours, making the dev-speed gap with competitors decisive. Netflix and Amazon being able to deploy thousands of times per day is impossible without CI/CD.

Deploy speed directly links to business competitiveness.

Main CI/CD platforms

Multiple services/tools provide CI/CD, mostly decided by the combination with source-code hosting.

ProductCharacteristicsSuited for
GitHub ActionsGitHub-integrated, YAML config, overwhelming adoptionGitHub use, top candidate for new adoption
GitLab CIGitLab-integrated, easy self-hostingGitLab env, enterprise
CircleCIFast, flexible configLarge OSS, independent operation
AWS CodePipelineAWS-integrated, IAM linkageAWS-specialized env
JenkinsOSS veteran, high customizationLegacy continuation, special requirements
Azure DevOpsMicrosoft-integrated.NET, Azure env

For new adoption, GitHub Actions is the overwhelming front-runner. Tightly coupled with GitHub source, intuitive in YAML, with Marketplace-rich reusable actions. Jenkins is veteran but heavy in operational effort, with almost no remaining reason to choose for new adoption.

Typical pipeline

A CI/CD pipeline runs multiple steps in order. Failing a step doesn’t proceed, instantly notifying.

flowchart TB
    A([git push]) --> B[Lint / Format<br/>seconds]
    B --> C[Unit Test<br/>minutes]
    C --> D[Build Image<br/>minutes]
    D --> E[Integration Test<br/>tens of minutes]
    E --> F[Security Scan<br/>SAST / SCA]
    F --> G[Deploy to Staging<br/>auto]
    G -->|manual approval/<br/>gate| H[Deploy to Production<br/>Canary -> 100%]
    B -. fail .-> X([instant notification])
    C -. fail .-> X
    D -. fail .-> X
    E -. fail .-> X
    F -. fail .-> X
    classDef step fill:#dbeafe,stroke:#2563eb;
    classDef prod fill:#fef3c7,stroke:#d97706,stroke-width:2px;
    classDef fail fill:#fee2e2,stroke:#dc2626;
    class B,C,D,E,F,G step;
    class H prod;
    class X fail;

The principle is failing in early steps (Fail Fast). Running 30-minute E2E tests first means waiting 30 minutes on a trivial code-convention violation. Place in the order seconds Lint → minutes Unit Test → tens of minutes Integration Test, with slower tests later.

What to run in CI - phased practice

CI isn’t “running everything at once on push.” Realistic to split into 4 phases by timing of human contact with code, varying roles per phase. Earlier phases are faster, more frequent, local; later phases are heavier and broader.

PhaseWhen it runsWhat to runTarget time
1. pre-commitAt commit creation (local)Formatter, Lint, secret detectionWithin 5s
2. pre-pushJust before push (local or server)Unit Tests near changed filesWithin 30s
3. PR creation/updateWhen pushed to GitHubAll Unit + type check + coverage + SAST + SCAWithin 10 min
4. At mergeMoment merged to mainIntegration Test + build + image push + IaC planWithin 20 min

The trick for pre-commit (implemented via Husky, lefthook, pre-commit framework) is narrowing to only coding conventions and secret leaks. Heavy ones get skipped by developers, so don’t put anything over 5 seconds. Standards at this stage: Prettier / Biome / ESLint / gitleaks.

Keep CI runtime within 10 minutes

When PR-creation CI (phase 3) exceeds 10 minutes, dev speed plummets. The rhythm of “submit PR → do another task → CI finishes → request review” breaks, and developers lose context while waiting. Google internal research also shows daily commit count decreasing as CI response time lengthens.

ApproachEffectCaveats
Job parallelizationLint, Unit, type check simultaneously, 3x fasterRunner billing increases linearly
Dependency cachingnode_modules etc. become seconds from 2nd timeWrong cache-key design pulls old dependencies
Run only change-scope testsDon’t run all in monorepoTest-dependency analysis needed (Nx, Turborepo, Bazel)
Docker layer cachingImage build several times fasterBuildx, GHA cache config required
Separate slow tests into different jobsE2E, load tests to nightly batchDoesn’t run on PR, separate route to guarantee coverage needed

The order parallelization → caching → change-scope limit works first. If you can’t get under 10 min, first suspect whether non-test heavy processing (full clean-build every time, unneeded integration tests) is mixed in CI.

CI on PR within 10 minutes. Organizations that can’t keep this definitely see dev speed decay.

What to run in CD - per-environment responsibilities

CD isn’t just “delivering to production” - it’s the process of phased verification while passing through environments. Vague per-environment roles fall into the thought-stop of “worked in stg so prod is fine too.” Prepare at least 3 environments, ideally 4.

EnvPurposeDataAccessAuto/manual
devDeveloper verificationSynthetic dataDev teamAuto on main merge
stagingProduction-equivalent integration verificationProduction masked copyQA, PM, stakeholdersAuto on main merge
pre-prod (canary)Real-trial verification with partial production trafficProduction itselfInternal users, partial customersApproval + auto
productionAll usersProductionAll customersAuto-expansion after canary passage

Making staging “an env only developers touch” formalizes it. The reason for staging’s existence is that QA and PM regularly touch it, operating with content close to production data. Daily-syncing production-masked copies (PII removed, payment info synthetic) is the rule. Pre-prod (canary) tends to be omitted, but staging alone can’t catch problems with “production’s actual traffic, data volume, and 3rd-party integration,” so split it for critical systems.

DB migration is the CD pitfall

The most accident-prone in CD is DB-schema changes. Even with code instantly rollback-able via Blue/Green, schema changes are one-way and run while holding production data. So progress them in a different lifecycle from code, phased while preserving backward compatibility. This is called the expand/contract pattern.

[Renaming column user_name to full_name]
1. expand : add full_name column (write both, read user_name) ← deploy 1
2. backfill: copy user_name values to full_name (batch)
3. switch : switch reading to full_name                        ← deploy 2
4. contract: drop user_name column                              ← deploy 3

Concrete migration fragment example (PostgreSQL):

-- 1. expand: add new column NULL-allowed (instant, avoid lock)
ALTER TABLE users ADD COLUMN full_name TEXT;

-- 2. backfill: put values in existing rows (split batches to suppress prod load)
UPDATE users SET full_name = user_name
WHERE full_name IS NULL AND id BETWEEN $1 AND $2;

-- 3. switch: after app-side release, guarantee integrity
ALTER TABLE users ALTER COLUMN full_name SET NOT NULL;

-- 4. contract: drop old column (after confirming new code 100% running)
ALTER TABLE users DROP COLUMN user_name;

Sync rewrite the app side too.

// just after expand: dual writes (won't break even if read by old code)
await db.update(users).set({
  user_name: name,
  full_name: name,
}).where(eq(users.id, userId));

// after switch deploy: write only to new column
await db.update(users).set({ full_name: name }).where(eq(users.id, userId));

The reason to split into 4 stages is to avoid breaking at any moment when old and new code run simultaneously. During rolling updates, new and old instances coexist for minutes-tens of minutes, so without coexisting “columns old code reads” and “columns new code writes,” production dies.

Don’t finish schema changes in one shot. The rule is splitting into 3 deploys.

DB-migration forbidden moves and operation

Forbidden moveWhy it’s bad
One-shot column renameInstances running on old code instantly die
Immediate add of NOT NULL columnConstraint violation on existing records, migration failure
Add index without onlineWrites locked on large tables
Put breaking changes and code changes in same deploySchema can’t roll back on rollback
”CI/CD adoption reduces bugs” — expecting too muchWithout tests, bugs just flood production at high speed
”IAM keys in Secrets are safe” — complacencySecrets also leak. Make long-term keys themselves unnecessary via OIDC integration

Migration is run as an independent step earlier in the CD pipeline, completing before code deploy. PostgreSQL uses CREATE INDEX CONCURRENTLY, MySQL uses pt-online-schema-change or gh-ost - operations not locking production tables are required.

Migration tools - Flyway, Liquibase, Prisma Migrate - are standard, with changes managed as SQL files (or declarative schema) in code. Putting on Git and making PR-reviewable is the modern standard, avoiding DBA GUI work that leaves no audit trail. On large tables, backfill batch runtime can become hours-days, so the practice is managing migration schedules separately on a release calendar.

AI decision axes

AI-era favorableAI-era unfavorable
GitHub Actions + standard YAMLJenkins custom DSL / commercial custom config
OIDC integration (no long-term keys)IAM keys saved in Secrets
Trunk-Based + Feature FlagLong-life branches / manual release
AI-supporting code-generation/review integrationPrivate CI untouchable by AI agents
  1. Choose CI/CD tightly coupled with source-code hosting — GitHub Actions for GitHub
  2. Make OIDC integration mandatory, exclude long-term IAM keys from Secrets
  3. Trunk-Based + Feature Flag to separate release and deploy
  4. Build IaC + DevSecOps into pipeline (shift-left)

Phased release - concrete percentages and gates

Canary tends to be explained as “release to a portion first and watch,” but if “what % / what to look at / how to advance” is vague, it doesn’t function. In practice, auto-progress at the following ratios, judging metric-based at each stage on error rate, latency, business metrics.

PhaseRatioObservation timeAuto-rollback condition examples
Canary 11%15 minError rate > old version + 0.5%
Canary 25%30 minp99 latency > old version x 1.2
Canary 325%1 hourBusiness metric (CVR etc.) > old version - 5%
Canary 450%2 hoursAny of the above 3
Full deploy100%--

Setting auto-rollback thresholds as relative comparison with old version (baseline) is the rule. Setting absolute values causes false-firing on time-of-day variations. AWS CodeDeploy, Argo Rollouts, Flagger come standard-equipped with this “metric-linked auto-phased deployment.”

Relying only on human visual judgment means no one can stop it on midnight incidents. Mechanize gates, don’t omit approval but have machines judge first - the modern CD design.

Branch strategy

Closely related to CI/CD is branch strategy. Use by team scale, release frequency, and product nature.

StrategyCharacteristicsSuited for
GitHub Flowmain + feature branchesWeb services / continuous deploy
Git Flowmain + develop + release + hotfixProduct type / version-management type
Trunk-Based DevelopmentShort-lived feature, direct push to mainHigh-grade CI/CD / expert teams

In modern web dev, GitHub Flow or Trunk-Based Development is mainstream. Complex Git Flow is effective only in cases like packaged products / on-prem distribution where “version-unit management is needed,” and is just excessive complexity for continuously-deployed SaaS.

Branch strategy is simplicity-first. Complex Git Flow is now a minority.

Test-automation levels

Tests have multiple levels with different roles. Don’t write all kinds - distribute appropriate amounts per use case is important.

TestSpeedScopeRatio guideline
Unit TestFast (seconds)Function/class unit70%
Integration TestMid (minutes)Module integration20%
E2E TestSlow (minutes-tens of minutes)User-operation reproduction10%

This ratio is called the test pyramid. The principle is more Unit Tests, fewer E2E Tests. E2E is slow, unstable, and brittle, so narrow to essential user flows. Always run Unit + Integration in CI; the typical composition runs E2E nightly in staging environment.

Deploy strategy

There are multiple ways to “deploy.” The bigger the impact of a release, the more phased and cautious to do it - the rule.

StrategyMechanismCharacteristics
Rolling UpdateSequential replacementSimple, no downtime
Blue/GreenVerify in new env, switch all at onceInstant rollback, 2x cost
Canary ReleasePre-deploy to partial usersLocalize risk
Feature FlagCode in production but switch feature ON/OFFSeparate deploy and release

Large-scale features and irreversible changes (DB migrations etc.) are done phased with Blue/Green or Canary. On problems, return traffic, not code - the point.

Canary and Feature Flag

Canary Release and Feature Flag are currently the most-emphasized 2 risk-management approaches.

Canary Release:
  Users
   |- 95% --> [old version v1.0]
   |- 5%  --> [new version v1.1]   <- expand gradually if no problems

Feature Flag:
  if (flag.newCheckout) {
    renderV2()
  } else {
    renderV1()
  }

Feature Flag is the mechanism deploying code to production while making feature ON/OFF switchable by flag. It severs the joined-fate relationship of “deploy = release,” separating code distribution from feature publication. Dedicated SaaS like LaunchDarkly, Unleash, GrowthBook are widely used.

The essence of phased deployment is avoiding “all break together” and enabling “detect partially and instantly roll back.”

Infrastructure as Code (IaC)

IaC is the approach of defining and managing infrastructure (servers, network, DB etc.) as code. Abolishing manual construction dramatically improves reproducibility, audit-ability, and change management.

ToolCharacteristics
TerraformMulti-cloud support, de facto standard
OpenTofuTerraform’s OSS fork (independent in 2024)
AWS CloudFormationAWS-native, JSON/YAML
Azure BicepAzure-native, ARM successor
PulumiWritten in TypeScript / Python etc.
AWS CDKDefine AWS in TypeScript etc.

Standard is Terraform (or successor OpenTofu). Multi-cloud-supporting with a large community, can manage almost all cloud resources as code. Build into CI/CD pipeline and review infrastructure changes via Pull Request too - the modern rule.

GitOps

GitOps is the method where Git repos are the Single Source of Truth, with pushes to Git auto-reflecting to infrastructure. Especially in Kubernetes operations, it’s rapidly standardizing.

[Git push]
    |
[ArgoCD / Flux]        <- continuously monitors Git repos
    |
[Kubernetes]           <- auto-converges to Git contents
FeatureContent
DeclarativeDefine “should-be state” and converge there
Audit-abilityAll changes in Git history, clear who changed what when
Easy rollbackRevert with just git revert
Permission managementGit access permissions become deploy permissions

In Kubernetes envs, ArgoCD and Flux are the 2 major GitOps tools. In current K8s operations almost standard, with the move spreading to abolish manual kubectl apply.

DevSecOps (security integration)

DevSecOps is the thinking of building security checks into the CI/CD pipeline. Through shift-left (find problems at early stages), crush vulnerabilities before production deploy.

TargetTool example
SAST (static analysis)SonarQube / CodeQL / Semgrep
SCA (dependencies)Dependabot / Snyk / Renovate
DAST (dynamic analysis)OWASP ZAP / Burp Suite
Container scanTrivy / Grype / Docker Scout
IaC scanCheckov / tfsec / Terrascan

GitHub Dependabot is free, auto-detecting dependency-library vulnerabilities and even creating PRs. This alone has high adoption value, at the level of required as minimum vulnerability countermeasures.

Security is built into the pipeline - the modern basis.

Secret management

Putting API keys and passwords in CI/CD env in plaintext absolutely don’t do. They get instantly abused if leaked. Use appropriate management methods.

MethodCharacteristics
GitHub Actions SecretsStandard feature, simple
OIDC integration (GitHub <-> AWS / GCP)No long-term keys needed, top recommendation
HashiCorp VaultMulti-env integration
AWS Secrets ManagerAWS-native, auto-rotation

The current front-runner is OIDC (OpenID Connect, the standard auth protocol on OAuth2) integration. When accessing AWS from GitHub Actions, instead of storing long-term IAM access keys, the method of fetching AWS temporary tokens with GitHub credentials. Since IAM-key leak risk disappears, overwhelming superiority on the security side.

OIDC is the current standard. Putting long-term keys in Secrets is old operation.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • CI/CD platform (GitHub Actions / GitLab CI etc.)
  • Branch strategy (GitHub Flow / Trunk-Based / Git Flow)
  • What to run at CI stages (pre-commit / pre-push / PR / merge)
  • CI target time on PR (within 10 min recommended)
  • Env composition (dev / staging / canary / production) and per-env responsibilities
  • DB-migration strategy (expand / contract, tool selection)
  • Phased-release ratios, observation times, auto-rollback gates
  • IaC tool (Terraform / CloudFormation / CDK)
  • Secret-management method (OIDC recommended)
  • Rollback procedures and drill frequency

Author’s note - Knight Capital’s 45 minutes

On August 1, 2012, US securities firm Knight Capital introduced a new trading system, but old code remained on 1 of 8 servers when running, mass-issuing erroneous orders. About $460M lost in 45 minutes - the company effectively disappeared. A widely-told industry case.

The lesson: the very operation of allowing “partial different state” is the risk. Leaving deploy steps to human manual work always causes accidents like server-1-mistakes. Implementing the basics of “deploy all servers with same artifact and same procedure via CI/CD” is the minimum defense line.

Humans are creatures who occasionally forget just 1 server.

Summary

This article covered CI/CD, including phased practice, within-10-min PR, DB migration, phased releases, IaC, GitOps, DevSecOps, and OIDC integration.

Anchor on GitHub Actions + Trunk-Based + OIDC + IaC + Feature Flag, CI within 10 min on PR, DB migration via expand/contract, mechanize phased-release gates. That is the practical answer for CI/CD design in 2026.

Next time we’ll cover deploy strategy (Blue-Green, Canary, Feature Flag in detail).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.