[DevOps Architecture] Test Design - Pyramid + Testcontainers + Branch Coverage

About this article

As the sixth installment of the “DevOps Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains test design.

The goal of testing is not the coverage number, but “grasping the cause within 5 minutes when broken.” This article handles practical decisions like the test pyramid, coverage targets, TDD, and flake countermeasures - “what % coverage to aim for,” “how far to write E2E,” “what to run in CI.”

What is test design in the first place

Picture a car inspection. Brakes, lights, emissions - there are inspection criteria for each item, and you can’t drive on public roads without passing. Driving without inspection means an accident could happen at any time.

Test design is about deciding how to build mechanisms that automatically verify whether software is working correctly. You design what to inspect, at what granularity, and at what timing, creating a state where automated checks run every time code changes.

Without test design, every time you change a single line of code, you’d have to manually verify “whether other features are broken”, and fear accompanies every change. The result: nobody wants to touch the code anymore.

Why test design matters

Eliminating fear of change

Without tests, every single-line code change requires manually checking “whether other features are broken”. With automated tests, feedback returns within minutes after a change, letting you refactor with confidence.

Grasping the cause within 5 minutes of failure

Looking at where tests broke tells you immediately which change caused the problem. Without tests, root-cause identification can take hours to days.

Ensuring quality of AI-generated code

Humans reviewing every piece of AI-written code has limits. Mechanically verifying with tests is the realistic safety net.

Think of tests split by 3 responsibilities

Organizing tests into the layers “unit,” “integration,” “E2E” is the industry rule, but rather than memorizing layer names, grasping what each is responsible for guaranteeing affects judgment more.

Layer	Responsibility (what it guarantees)	Representative tools
Unit Test	Logic correctness of functions/classes	Jest / Vitest / pytest / JUnit
Integration Test	Module integration, DB / external-API connection	Testcontainers / Supertest / pytest + Docker
E2E Test	Through-screen flows of user operations	Playwright / Cypress
Contract Test	API-contract compatibility between services	Pact / Spring Cloud Contract
Performance Test	Numerical boundaries of load/latency	k6 / Gatling / Locust

Contract tests (machine-verify the API spec agreed between caller and callee) gain importance in microservices but are unneeded in monoliths. Choosing the layers needed per article is also the architect’s job.

The ratio of the test pyramid

The test pyramid (proposed by Mike Cohn in 2009) is a model organizing tests as “fast, many, stacked low” / “slow, few, stacked high.” Today it remains the standard starting point for test strategy.

flowchart TB
    E2E["E2E Test (10%)<br/>slow, brittle<br/>user flows only"]
    INT["Integration Test (20%)<br/>verification incl. DB / external API"]
    UNIT["Unit Test (70%)<br/>finishes in seconds, write many"]
    E2E --> INT --> UNIT
    LEFT[Speed: slow<br/>Count: few] -.- E2E
    UNIT -.- RIGHT[Speed: fast<br/>Count: many]
    classDef e2e fill:#fee2e2,stroke:#dc2626;
    classDef int fill:#fef3c7,stroke:#d97706;
    classDef unit fill:#dcfce7,stroke:#16a34a;
    class E2E e2e;
    class INT int;
    class UNIT unit;

The Unit 70 / Integration 20 / E2E 10 ratio is just a guideline that shifts by domain. Logic-centric domains like payments/inventory lean toward Unit 80, while admin-panel-centric CRUD business apps lean Integration toward 30-40. What you most want to trust decides the ratio.

Antipattern: ice-cream cone

What’s happening in many sites is the case where the test pyramid becomes inverted. Many E2Es, thin Unit - this is called the Icecream Cone Antipattern in the industry.

  ____________
  \          /   E2E Test     60%   ← brittle, slow, maintenance abandoned
   \--------/
    \      /     Integration  30%
     \----/
      \  /       Unit Test    10%
       \/

The 2 reasons it tends to happen: “E2E feels safer” and “Unit Tests are tedious to write.” But E2Es take dozens of seconds to minutes, are easy to flake (unstable results despite same code) due to async, and skip tags accumulate as no one fixes them. You end up in the hell of 500 E2Es but half skipped, the other half 3 reds daily.

Stacking E2E to feel safe is a trap. Thin Unit can’t be compensated by E2E.

What to run in tests - phased practice

Running “all tests every time” in CI is unrealistic. Like CI/CD, split phases by timing of code touch and vary the kinds and amounts run per phase - the practical answer.

Phase	When it runs	What to run	Target time
1. pre-commit	At commit creation (local)	Lint of changed files + single test	Within 5s
2. pre-push	Just before push	Unit Tests of change scope	Within 30s
3. PR creation/update	When pushed to GitHub	All Unit + type check + change-scope Integration	Within 10 min
4. At merge	Moment merged to main	All Integration + smoke E2E	Within 20 min
5. Nightly	Overnight batch	All E2E + load tests + Contract Tests	Several hours

Running all E2E on PR is a bad choice. Dev speed plummets and no one adds more tests. Split E2E into post-merge smoke (minimum flow) and nightly, with Unit + Integration sufficient at PR stage - this composition functions in the field.

What % coverage to aim for

Coverage (the rate of code executed by tests) is used as a lower-bound line, not a target value. Aiming for 80% is good, but the field reality is meaningless tests start being mass-produced the moment 90% is chased.

Target	Realistic?	Comment
Under 40%	Dangerous	Test foundation too thin. Instant death on refactor
60-70%	Common	Typical line for new projects / SaaS
80%	Front-runner	Design domain-logic core to exceed 80%
90%+	Caution	Ritualization, getter/setter test mass-production starts
100%	Forbidden	Cost not worth achievement. Religious

Coverage has 3 types - line, branch, function - but anchoring on branch coverage (whether both arms of an if-condition were taken) is the modern rule. Looking only at line coverage, only one arm of if-statements may be taken at 90% - false positives are easy. jest --coverage and pytest-cov can output branch coverage.

Coverage realistically lines up at 80%+ for domain logic, 60% for the rest.

Coverage pitfalls - chasing only numbers breaks it

Because coverage is easy to measure, it tends to be reported to management as KPI, but test quality collapses the moment numbers alone are chased. Below are forbidden moves frequently happening in the field.

Forbidden move	Why it’s bad
Adopt coverage % as exec metric	”Coverage-up PRs” sprout tests on getters/setters
Use coverage threshold as PR-merge condition	Diffs just at threshold don’t pass, blocking before review
Tie coverage shortfall to individual evaluation	Behavior of number-grinding over test quality starts
Uniform 80% target on all files	Auto-generated code and library wrappers get caught
Believe “80% coverage means no bugs”	Line coverage alone gives false positives where only one branch of an if-statement is taken
Defer with “no time to write tests”	Time melted in debug, production incidents, and customer explanations is definitely longer

Coverage is an indicator viewed internally as a team’s quality sensor, and degrades when used as evaluation axis or approval blocker. For PR approval, “coverage of new code” (change-line basis) over absolute coverage value is safer - the differentiation of codecov’s project and patch is the example.

AI decision axes

AI-favored	AI-disfavored
Test-first (TDD) to hand AI the spec	”Works so OK” / tests deferred
Jest / Vitest / pytest (abundant AI training data)	Custom test frameworks / in-house wrappers
Production-equivalent DB verification with Testcontainers	DB mocks miss “SQL doesn’t pass”
Type checks + coverage + branch tests	Culture of relaxing on 90% line coverage alone
Culture of human review of AI-generated tests	Unconditional merge of AI-written tests

Decide test-pyramid shape first - check it isn’t an inverted triangle
Build production-equivalent Integration env with Testcontainers
Branch coverage + change-line coverage as operational metrics
Flake isolation flow and “retry isn’t passing” culture

TDD - the school of writing tests first

TDD (Test Driven Development) is a development style spread by Kent Beck in the 2000s, running the 3 steps of write a failing test → minimal implementation to pass → refactor.

Step	What to do	Red/Green
Red	Write a failing test first	Red
Green	Write minimal implementation to pass	Green
Refactor	Tidy design while keeping tests green	Green

TDD’s essence isn’t “writing tests first” but the thinking order of verbalizing spec first then entering implementation. Thinking spec while implementing makes humanly fall into “works so OK”, and TDD functions as the device avoiding that trap.

But running TDD on all features is unrealistic. The field balance is TDD on domain-logic core (payments, inventory calculation, pricing-plan judgment) and post-write tests for UI wiring/CRUD.

Flaky-test pitfalls

Flake (tests that pass and fail on different runs despite same code) is the biggest enemy collapsing test reliability. With 10 tests failing 1 in 3 times, CI green only emerges probabilistically and no one trusts red. When the culture of ignoring red takes root, the org changes into one missing real bugs.

Cause	Treatment
Time-dependent (`Date.now()`)	Fix time inside the test (jest’s `fakeTimers`)
Async-timing dependent	Wait with explicit `waitFor` / `expect.poll`
State sharing between tests	Always clean DB/cache per test
External-API calls over network	Mock or isolate with Testcontainers
Random/UUID dependent	Fix seed or relax contract

The rule for flaky tests is choosing among immediate isolation, fix, or deletion. The operation “fell again, passed on retry so OK” makes the retry feature itself a hotbed of incidents. Jest, Vitest, and Playwright have retry features, but treating “tests passing on retry as not passing” is the operation of a sound team.

Tests passing on retry are not passing. Two choices: isolation or fix.

Test DB and Testcontainers

The most accident-prone thing in integration tests is DB connection. Replacing DB with mock (a fake in place of the real thing) frequently produces “SQL should be right but fails in production,” and substituting with locally-installed SQLite misses dialect-difference bugs from PostgreSQL/MySQL.

Today’s standard is Testcontainers. It’s a library that starts real DB/Redis/Kafka in Docker containers during testing and discards them after, realizing “running integration tests on the same middleware as production.”

Method	Difference from prod	Speed	Recommended
Mock (replace DB connection itself)	Large (can’t verify SQL)	Extreme fast	SQL bugs leak
Local SQLite substitute	Mid (dialect difference)	Fast	Small scale only
Shared staging DB	Small	Mid	Test parallelization impossible
Testcontainers	Near zero	Mid (only first time slow)	Front-runner

“Verifying down to DB dialect” is investment 100x more valuable than raising coverage numbers.

Where to use mocks - mocks not to break vs mocks to use

Mocks are convenient but a tool that erases test meaning when overused. Misjudging boundaries leads to the worst state of “tests all green but production is broken.”

Mock targets: external SaaS (Stripe, SendGrid, Slack), time/random, costly computations
Don’t mock: own project’s DB, code we wrote, modules in same process

A typical failure is the pattern of “mocking DB access at the repository layer.” This routes SQL bugs to production undetected. Similarly, “mocking neighbor use cases in the use-case layer test” is also a bad choice, losing the chance to make coupling visible.

Don’t mock your own code, mock only the external world - the empirical iron rule.

Contract tests

When sharing APIs across microservices and multiple teams, the divergence between caller and callee API specs becomes a major cause of production incidents. Contract Test is a mechanism to machine-verify whether the contract caller expects matches what callee actually returns.

Method	Representative tools	Characteristics
Consumer-Driven Contract	Pact	Caller writes contract, callee verifies
Spec-First (OpenAPI contract verification)	Dredd / Prism	Verify response against OpenAPI spec
Spring family	Spring Cloud Contract	Decisive in JVM env

Unneeded in monoliths / single team, but the line for considering introduction stands the moment microservices or BFF (Backend-for-Frontend, frontend-dedicated API aggregation layer) is operated by 2+ teams. Investment of replacing coordination cost that melts human time with machine verification.

Test-data design

Another underestimated thing in testing is how to create test data. Filling with bland fixed data like user_01 / product_01, when tests increase, you can’t tell “what case was this testing?”

Approach	Content	Suited for
Fixture (fixed data files)	Inject preset data via YAML/JSON	Small, reading-heavy apps
Factory (factory functions)	Generate via `userFactory({role: 'admin'})`	Mid-large (front-runner)
Builder pattern	Assemble via chains	Complex entities
Faker (random generation)	Auto-generate names/addresses	Load tests, mass data

The Factory pattern (FactoryBot, fishery, @mikro-orm/seeder etc.) is the modern standard. The style of “write only meaningful diffs, default the rest” turns tests into a form readable as spec. Writing all in fixture makes inter-test dependencies invisible, breaking down mid-term.

What to decide - what is your project’s answer?

For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”

Test-pyramid ratio target (Unit / Integration / E2E)
Coverage target (overall %, domain core %, branch coverage adoption)
CI scope on PR (change scope only or all Unit)
E2E execution timing (post-merge smoke + nightly)
Test-DB strategy (Testcontainers / mock / shared DB)
Flaky-test handling (isolate → fix → delete flow)
Contract Test introduction (judge by microservices / BFF presence)
Test-data approach (Factory / Fixture / Builder)

Author’s note - team broken by KPI-izing coverage

A widely-known industry case: a mid-size SaaS company set “coverage 80% as team KPI,” and in 3 months, getter/setter tests grew by thousands of lines while essential domain-logic bugs didn’t decrease. Numbers go up easily, but quality doesn’t change. Because “raising coverage” and “reducing bugs” are different things.

This team then switched KPI to the 2 axes of “new bug count to production” and “PR-diff coverage (change-line basis),” abolishing absolute coverage targets. As a result, test quality returned, and a suggestive outcome where getter/setter junk tests were mass-renamed and deleted.

Test value can only be measured by track record of reducing bugs.

Summary

This article covered test design, including test pyramid, coverage, TDD, flake countermeasures, Testcontainers, mock boundaries, and AI-era test-first.

Be aware of the test pyramid, verify production-equivalent DB with Testcontainers, branch + change-line coverage as operational metrics, immediately isolate flake. That is the practical answer for test design in 2026.

Next time we’ll cover CI/CD (pipeline design, deploy automation).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.