DevOps Architecture

[DevOps Architecture] Test Design - Pyramid + Testcontainers + Branch Coverage

[DevOps Architecture] Test Design - Pyramid + Testcontainers + Branch Coverage

About this article

As the sixth installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains test design.

The goal of testing is not the coverage number, but “grasping the cause within 5 minutes when broken.” This article handles practical decisions like the test pyramid, coverage targets, TDD, and flake countermeasures - “what % coverage to aim for,” “how far to write E2E,” “what to run in CI.”

What is test design in the first place

Picture a car inspection. Brakes, lights, emissions - there are inspection criteria for each item, and you can’t drive on public roads without passing. Driving without inspection means an accident could happen at any time.

Test design is about deciding how to build mechanisms that automatically verify whether software is working correctly. You design what to inspect, at what granularity, and at what timing, creating a state where automated checks run every time code changes.

Without test design, every time you change a single line of code, you’d have to manually verify “whether other features are broken”, and fear accompanies every change. The result: nobody wants to touch the code anymore.

Why test design matters

Eliminating fear of change

Without tests, every single-line code change requires manually checking “whether other features are broken”. With automated tests, feedback returns within minutes after a change, letting you refactor with confidence.

Grasping the cause within 5 minutes of failure

Looking at where tests broke tells you immediately which change caused the problem. Without tests, root-cause identification can take hours to days.

Ensuring quality of AI-generated code

Humans reviewing every piece of AI-written code has limits. Mechanically verifying with tests is the realistic safety net.

Think of tests split by 3 responsibilities

Organizing tests into the layers “unit,” “integration,” “E2E” is the industry rule, but rather than memorizing layer names, grasping what each is responsible for guaranteeing affects judgment more.

LayerResponsibility (what it guarantees)Representative tools
Unit TestLogic correctness of functions/classesJest / Vitest / pytest / JUnit
Integration TestModule integration, DB / external-API connectionTestcontainers / Supertest / pytest + Docker
E2E TestThrough-screen flows of user operationsPlaywright / Cypress
Contract TestAPI-contract compatibility between servicesPact / Spring Cloud Contract
Performance TestNumerical boundaries of load/latencyk6 / Gatling / Locust

Contract tests (machine-verify the API spec agreed between caller and callee) gain importance in microservices but are unneeded in monoliths. Choosing the layers needed per article is also the architect’s job.

The ratio of the test pyramid

The test pyramid (proposed by Mike Cohn in 2009) is a model organizing tests as “fast, many, stacked low” / “slow, few, stacked high.” Today it remains the standard starting point for test strategy.

flowchart TB
    E2E["E2E Test (10%)<br/>slow, brittle<br/>user flows only"]
    INT["Integration Test (20%)<br/>verification incl. DB / external API"]
    UNIT["Unit Test (70%)<br/>finishes in seconds, write many"]
    E2E --> INT --> UNIT
    LEFT[Speed: slow<br/>Count: few] -.- E2E
    UNIT -.- RIGHT[Speed: fast<br/>Count: many]
    classDef e2e fill:#fee2e2,stroke:#dc2626;
    classDef int fill:#fef3c7,stroke:#d97706;
    classDef unit fill:#dcfce7,stroke:#16a34a;
    class E2E e2e;
    class INT int;
    class UNIT unit;

The Unit 70 / Integration 20 / E2E 10 ratio is just a guideline that shifts by domain. Logic-centric domains like payments/inventory lean toward Unit 80, while admin-panel-centric CRUD business apps lean Integration toward 30-40. What you most want to trust decides the ratio.

Antipattern: ice-cream cone

What’s happening in many sites is the case where the test pyramid becomes inverted. Many E2Es, thin Unit - this is called the Icecream Cone Antipattern in the industry.

  ____________
  \          /   E2E Test     60%   ← brittle, slow, maintenance abandoned
   \--------/
    \      /     Integration  30%
     \----/
      \  /       Unit Test    10%
       \/

The 2 reasons it tends to happen: “E2E feels safer” and “Unit Tests are tedious to write.” But E2Es take dozens of seconds to minutes, are easy to flake (unstable results despite same code) due to async, and skip tags accumulate as no one fixes them. You end up in the hell of 500 E2Es but half skipped, the other half 3 reds daily.

Stacking E2E to feel safe is a trap. Thin Unit can’t be compensated by E2E.

What to run in tests - phased practice

Running “all tests every time” in CI is unrealistic. Like CI/CD, split phases by timing of code touch and vary the kinds and amounts run per phase - the practical answer.

PhaseWhen it runsWhat to runTarget time
1. pre-commitAt commit creation (local)Lint of changed files + single testWithin 5s
2. pre-pushJust before pushUnit Tests of change scopeWithin 30s
3. PR creation/updateWhen pushed to GitHubAll Unit + type check + change-scope IntegrationWithin 10 min
4. At mergeMoment merged to mainAll Integration + smoke E2EWithin 20 min
5. NightlyOvernight batchAll E2E + load tests + Contract TestsSeveral hours

Running all E2E on PR is a bad choice. Dev speed plummets and no one adds more tests. Split E2E into post-merge smoke (minimum flow) and nightly, with Unit + Integration sufficient at PR stage - this composition functions in the field.

What % coverage to aim for

Coverage (the rate of code executed by tests) is used as a lower-bound line, not a target value. Aiming for 80% is good, but the field reality is meaningless tests start being mass-produced the moment 90% is chased.

TargetRealistic?Comment
Under 40%DangerousTest foundation too thin. Instant death on refactor
60-70%CommonTypical line for new projects / SaaS
80%Front-runnerDesign domain-logic core to exceed 80%
90%+CautionRitualization, getter/setter test mass-production starts
100%ForbiddenCost not worth achievement. Religious

Coverage has 3 types - line, branch, function - but anchoring on branch coverage (whether both arms of an if-condition were taken) is the modern rule. Looking only at line coverage, only one arm of if-statements may be taken at 90% - false positives are easy. jest --coverage and pytest-cov can output branch coverage.

Coverage realistically lines up at 80%+ for domain logic, 60% for the rest.

Coverage pitfalls - chasing only numbers breaks it

Because coverage is easy to measure, it tends to be reported to management as KPI, but test quality collapses the moment numbers alone are chased. Below are forbidden moves frequently happening in the field.

Forbidden moveWhy it’s bad
Adopt coverage % as exec metric”Coverage-up PRs” sprout tests on getters/setters
Use coverage threshold as PR-merge conditionDiffs just at threshold don’t pass, blocking before review
Tie coverage shortfall to individual evaluationBehavior of number-grinding over test quality starts
Uniform 80% target on all filesAuto-generated code and library wrappers get caught
Believe “80% coverage means no bugs”Line coverage alone gives false positives where only one branch of an if-statement is taken
Defer with “no time to write tests”Time melted in debug, production incidents, and customer explanations is definitely longer

Coverage is an indicator viewed internally as a team’s quality sensor, and degrades when used as evaluation axis or approval blocker. For PR approval, “coverage of new code” (change-line basis) over absolute coverage value is safer - the differentiation of codecov’s project and patch is the example.

AI decision axes

AI-favoredAI-disfavored
Test-first (TDD) to hand AI the spec”Works so OK” / tests deferred
Jest / Vitest / pytest (abundant AI training data)Custom test frameworks / in-house wrappers
Production-equivalent DB verification with TestcontainersDB mocks miss “SQL doesn’t pass”
Type checks + coverage + branch testsCulture of relaxing on 90% line coverage alone
Culture of human review of AI-generated testsUnconditional merge of AI-written tests
  1. Decide test-pyramid shape first - check it isn’t an inverted triangle
  2. Build production-equivalent Integration env with Testcontainers
  3. Branch coverage + change-line coverage as operational metrics
  4. Flake isolation flow and “retry isn’t passing” culture

TDD - the school of writing tests first

TDD (Test Driven Development) is a development style spread by Kent Beck in the 2000s, running the 3 steps of write a failing test → minimal implementation to pass → refactor.

StepWhat to doRed/Green
RedWrite a failing test firstRed
GreenWrite minimal implementation to passGreen
RefactorTidy design while keeping tests greenGreen

TDD’s essence isn’t “writing tests first” but the thinking order of verbalizing spec first then entering implementation. Thinking spec while implementing makes humanly fall into “works so OK”, and TDD functions as the device avoiding that trap.

But running TDD on all features is unrealistic. The field balance is TDD on domain-logic core (payments, inventory calculation, pricing-plan judgment) and post-write tests for UI wiring/CRUD.

Flaky-test pitfalls

Flake (tests that pass and fail on different runs despite same code) is the biggest enemy collapsing test reliability. With 10 tests failing 1 in 3 times, CI green only emerges probabilistically and no one trusts red. When the culture of ignoring red takes root, the org changes into one missing real bugs.

CauseTreatment
Time-dependent (Date.now())Fix time inside the test (jest’s fakeTimers)
Async-timing dependentWait with explicit waitFor / expect.poll
State sharing between testsAlways clean DB/cache per test
External-API calls over networkMock or isolate with Testcontainers
Random/UUID dependentFix seed or relax contract

The rule for flaky tests is choosing among immediate isolation, fix, or deletion. The operation “fell again, passed on retry so OK” makes the retry feature itself a hotbed of incidents. Jest, Vitest, and Playwright have retry features, but treating “tests passing on retry as not passing” is the operation of a sound team.

Tests passing on retry are not passing. Two choices: isolation or fix.

Test DB and Testcontainers

The most accident-prone thing in integration tests is DB connection. Replacing DB with mock (a fake in place of the real thing) frequently produces “SQL should be right but fails in production,” and substituting with locally-installed SQLite misses dialect-difference bugs from PostgreSQL/MySQL.

Today’s standard is Testcontainers. It’s a library that starts real DB/Redis/Kafka in Docker containers during testing and discards them after, realizing “running integration tests on the same middleware as production.”

MethodDifference from prodSpeedRecommended
Mock (replace DB connection itself)Large (can’t verify SQL)Extreme fastSQL bugs leak
Local SQLite substituteMid (dialect difference)FastSmall scale only
Shared staging DBSmallMidTest parallelization impossible
TestcontainersNear zeroMid (only first time slow)Front-runner

“Verifying down to DB dialect” is investment 100x more valuable than raising coverage numbers.

Where to use mocks - mocks not to break vs mocks to use

Mocks are convenient but a tool that erases test meaning when overused. Misjudging boundaries leads to the worst state of “tests all green but production is broken.”

  • Mock targets: external SaaS (Stripe, SendGrid, Slack), time/random, costly computations
  • Don’t mock: own project’s DB, code we wrote, modules in same process

A typical failure is the pattern of “mocking DB access at the repository layer.” This routes SQL bugs to production undetected. Similarly, “mocking neighbor use cases in the use-case layer test” is also a bad choice, losing the chance to make coupling visible.

Don’t mock your own code, mock only the external world - the empirical iron rule.

Contract tests

When sharing APIs across microservices and multiple teams, the divergence between caller and callee API specs becomes a major cause of production incidents. Contract Test is a mechanism to machine-verify whether the contract caller expects matches what callee actually returns.

MethodRepresentative toolsCharacteristics
Consumer-Driven ContractPactCaller writes contract, callee verifies
Spec-First (OpenAPI contract verification)Dredd / PrismVerify response against OpenAPI spec
Spring familySpring Cloud ContractDecisive in JVM env

Unneeded in monoliths / single team, but the line for considering introduction stands the moment microservices or BFF (Backend-for-Frontend, frontend-dedicated API aggregation layer) is operated by 2+ teams. Investment of replacing coordination cost that melts human time with machine verification.

Test-data design

Another underestimated thing in testing is how to create test data. Filling with bland fixed data like user_01 / product_01, when tests increase, you can’t tell “what case was this testing?”

ApproachContentSuited for
Fixture (fixed data files)Inject preset data via YAML/JSONSmall, reading-heavy apps
Factory (factory functions)Generate via userFactory({role: 'admin'})Mid-large (front-runner)
Builder patternAssemble via chainsComplex entities
Faker (random generation)Auto-generate names/addressesLoad tests, mass data

The Factory pattern (FactoryBot, fishery, @mikro-orm/seeder etc.) is the modern standard. The style of “write only meaningful diffs, default the rest” turns tests into a form readable as spec. Writing all in fixture makes inter-test dependencies invisible, breaking down mid-term.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

  • Test-pyramid ratio target (Unit / Integration / E2E)
  • Coverage target (overall %, domain core %, branch coverage adoption)
  • CI scope on PR (change scope only or all Unit)
  • E2E execution timing (post-merge smoke + nightly)
  • Test-DB strategy (Testcontainers / mock / shared DB)
  • Flaky-test handling (isolate → fix → delete flow)
  • Contract Test introduction (judge by microservices / BFF presence)
  • Test-data approach (Factory / Fixture / Builder)

Author’s note - team broken by KPI-izing coverage

A widely-known industry case: a mid-size SaaS company set “coverage 80% as team KPI,” and in 3 months, getter/setter tests grew by thousands of lines while essential domain-logic bugs didn’t decrease. Numbers go up easily, but quality doesn’t change. Because “raising coverage” and “reducing bugs” are different things.

This team then switched KPI to the 2 axes of “new bug count to production” and “PR-diff coverage (change-line basis),” abolishing absolute coverage targets. As a result, test quality returned, and a suggestive outcome where getter/setter junk tests were mass-renamed and deleted.

Test value can only be measured by track record of reducing bugs.

Summary

This article covered test design, including test pyramid, coverage, TDD, flake countermeasures, Testcontainers, mock boundaries, and AI-era test-first.

Be aware of the test pyramid, verify production-equivalent DB with Testcontainers, branch + change-line coverage as operational metrics, immediately isolate flake. That is the practical answer for test design in 2026.

Next time we’ll cover CI/CD (pipeline design, deploy automation).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.