About this article
As the sixth installment of the âDevOps Architectureâ category in the series âArchitecture Crash Course for the Generative-AI Era,â this article explains test design.
The goal of testing is not the coverage number, but âgrasping the cause within 5 minutes when broken.â This article handles practical decisions like the test pyramid, coverage targets, TDD, and flake countermeasures - âwhat % coverage to aim for,â âhow far to write E2E,â âwhat to run in CI.â
What is test design in the first place
Picture a car inspection. Brakes, lights, emissions - there are inspection criteria for each item, and you canât drive on public roads without passing. Driving without inspection means an accident could happen at any time.
Test design is about deciding how to build mechanisms that automatically verify whether software is working correctly. You design what to inspect, at what granularity, and at what timing, creating a state where automated checks run every time code changes.
Without test design, every time you change a single line of code, youâd have to manually verify âwhether other features are brokenâ, and fear accompanies every change. The result: nobody wants to touch the code anymore.
Why test design matters
Eliminating fear of change
Without tests, every single-line code change requires manually checking âwhether other features are brokenâ. With automated tests, feedback returns within minutes after a change, letting you refactor with confidence.
Grasping the cause within 5 minutes of failure
Looking at where tests broke tells you immediately which change caused the problem. Without tests, root-cause identification can take hours to days.
Ensuring quality of AI-generated code
Humans reviewing every piece of AI-written code has limits. Mechanically verifying with tests is the realistic safety net.
Think of tests split by 3 responsibilities
Organizing tests into the layers âunit,â âintegration,â âE2Eâ is the industry rule, but rather than memorizing layer names, grasping what each is responsible for guaranteeing affects judgment more.
| Layer | Responsibility (what it guarantees) | Representative tools |
|---|---|---|
| Unit Test | Logic correctness of functions/classes | Jest / Vitest / pytest / JUnit |
| Integration Test | Module integration, DB / external-API connection | Testcontainers / Supertest / pytest + Docker |
| E2E Test | Through-screen flows of user operations | Playwright / Cypress |
| Contract Test | API-contract compatibility between services | Pact / Spring Cloud Contract |
| Performance Test | Numerical boundaries of load/latency | k6 / Gatling / Locust |
Contract tests (machine-verify the API spec agreed between caller and callee) gain importance in microservices but are unneeded in monoliths. Choosing the layers needed per article is also the architectâs job.
The ratio of the test pyramid
The test pyramid (proposed by Mike Cohn in 2009) is a model organizing tests as âfast, many, stacked lowâ / âslow, few, stacked high.â Today it remains the standard starting point for test strategy.
flowchart TB
E2E["E2E Test (10%)<br/>slow, brittle<br/>user flows only"]
INT["Integration Test (20%)<br/>verification incl. DB / external API"]
UNIT["Unit Test (70%)<br/>finishes in seconds, write many"]
E2E --> INT --> UNIT
LEFT[Speed: slow<br/>Count: few] -.- E2E
UNIT -.- RIGHT[Speed: fast<br/>Count: many]
classDef e2e fill:#fee2e2,stroke:#dc2626;
classDef int fill:#fef3c7,stroke:#d97706;
classDef unit fill:#dcfce7,stroke:#16a34a;
class E2E e2e;
class INT int;
class UNIT unit;
The Unit 70 / Integration 20 / E2E 10 ratio is just a guideline that shifts by domain. Logic-centric domains like payments/inventory lean toward Unit 80, while admin-panel-centric CRUD business apps lean Integration toward 30-40. What you most want to trust decides the ratio.
Antipattern: ice-cream cone
Whatâs happening in many sites is the case where the test pyramid becomes inverted. Many E2Es, thin Unit - this is called the Icecream Cone Antipattern in the industry.
____________
\ / E2E Test 60% â brittle, slow, maintenance abandoned
\--------/
\ / Integration 30%
\----/
\ / Unit Test 10%
\/
The 2 reasons it tends to happen: âE2E feels saferâ and âUnit Tests are tedious to write.â But E2Es take dozens of seconds to minutes, are easy to flake (unstable results despite same code) due to async, and skip tags accumulate as no one fixes them. You end up in the hell of 500 E2Es but half skipped, the other half 3 reds daily.
Stacking E2E to feel safe is a trap. Thin Unit canât be compensated by E2E.
What to run in tests - phased practice
Running âall tests every timeâ in CI is unrealistic. Like CI/CD, split phases by timing of code touch and vary the kinds and amounts run per phase - the practical answer.
| Phase | When it runs | What to run | Target time |
|---|---|---|---|
| 1. pre-commit | At commit creation (local) | Lint of changed files + single test | Within 5s |
| 2. pre-push | Just before push | Unit Tests of change scope | Within 30s |
| 3. PR creation/update | When pushed to GitHub | All Unit + type check + change-scope Integration | Within 10 min |
| 4. At merge | Moment merged to main | All Integration + smoke E2E | Within 20 min |
| 5. Nightly | Overnight batch | All E2E + load tests + Contract Tests | Several hours |
Running all E2E on PR is a bad choice. Dev speed plummets and no one adds more tests. Split E2E into post-merge smoke (minimum flow) and nightly, with Unit + Integration sufficient at PR stage - this composition functions in the field.
What % coverage to aim for
Coverage (the rate of code executed by tests) is used as a lower-bound line, not a target value. Aiming for 80% is good, but the field reality is meaningless tests start being mass-produced the moment 90% is chased.
| Target | Realistic? | Comment |
|---|---|---|
| Under 40% | Dangerous | Test foundation too thin. Instant death on refactor |
| 60-70% | Common | Typical line for new projects / SaaS |
| 80% | Front-runner | Design domain-logic core to exceed 80% |
| 90%+ | Caution | Ritualization, getter/setter test mass-production starts |
| 100% | Forbidden | Cost not worth achievement. Religious |
Coverage has 3 types - line, branch, function - but anchoring on branch coverage (whether both arms of an if-condition were taken) is the modern rule. Looking only at line coverage, only one arm of if-statements may be taken at 90% - false positives are easy. jest --coverage and pytest-cov can output branch coverage.
Coverage realistically lines up at 80%+ for domain logic, 60% for the rest.
Coverage pitfalls - chasing only numbers breaks it
Because coverage is easy to measure, it tends to be reported to management as KPI, but test quality collapses the moment numbers alone are chased. Below are forbidden moves frequently happening in the field.
| Forbidden move | Why itâs bad |
|---|---|
| Adopt coverage % as exec metric | âCoverage-up PRsâ sprout tests on getters/setters |
| Use coverage threshold as PR-merge condition | Diffs just at threshold donât pass, blocking before review |
| Tie coverage shortfall to individual evaluation | Behavior of number-grinding over test quality starts |
| Uniform 80% target on all files | Auto-generated code and library wrappers get caught |
| Believe â80% coverage means no bugsâ | Line coverage alone gives false positives where only one branch of an if-statement is taken |
| Defer with âno time to write testsâ | Time melted in debug, production incidents, and customer explanations is definitely longer |
Coverage is an indicator viewed internally as a teamâs quality sensor, and degrades when used as evaluation axis or approval blocker. For PR approval, âcoverage of new codeâ (change-line basis) over absolute coverage value is safer - the differentiation of codecovâs project and patch is the example.
AI decision axes
| AI-favored | AI-disfavored |
|---|---|
| Test-first (TDD) to hand AI the spec | âWorks so OKâ / tests deferred |
| Jest / Vitest / pytest (abundant AI training data) | Custom test frameworks / in-house wrappers |
| Production-equivalent DB verification with Testcontainers | DB mocks miss âSQL doesnât passâ |
| Type checks + coverage + branch tests | Culture of relaxing on 90% line coverage alone |
| Culture of human review of AI-generated tests | Unconditional merge of AI-written tests |
- Decide test-pyramid shape first - check it isnât an inverted triangle
- Build production-equivalent Integration env with Testcontainers
- Branch coverage + change-line coverage as operational metrics
- Flake isolation flow and âretry isnât passingâ culture
TDD - the school of writing tests first
TDD (Test Driven Development) is a development style spread by Kent Beck in the 2000s, running the 3 steps of write a failing test â minimal implementation to pass â refactor.
| Step | What to do | Red/Green |
|---|---|---|
| Red | Write a failing test first | Red |
| Green | Write minimal implementation to pass | Green |
| Refactor | Tidy design while keeping tests green | Green |
TDDâs essence isnât âwriting tests firstâ but the thinking order of verbalizing spec first then entering implementation. Thinking spec while implementing makes humanly fall into âworks so OKâ, and TDD functions as the device avoiding that trap.
But running TDD on all features is unrealistic. The field balance is TDD on domain-logic core (payments, inventory calculation, pricing-plan judgment) and post-write tests for UI wiring/CRUD.
Flaky-test pitfalls
Flake (tests that pass and fail on different runs despite same code) is the biggest enemy collapsing test reliability. With 10 tests failing 1 in 3 times, CI green only emerges probabilistically and no one trusts red. When the culture of ignoring red takes root, the org changes into one missing real bugs.
| Cause | Treatment |
|---|---|
Time-dependent (Date.now()) | Fix time inside the test (jestâs fakeTimers) |
| Async-timing dependent | Wait with explicit waitFor / expect.poll |
| State sharing between tests | Always clean DB/cache per test |
| External-API calls over network | Mock or isolate with Testcontainers |
| Random/UUID dependent | Fix seed or relax contract |
The rule for flaky tests is choosing among immediate isolation, fix, or deletion. The operation âfell again, passed on retry so OKâ makes the retry feature itself a hotbed of incidents. Jest, Vitest, and Playwright have retry features, but treating âtests passing on retry as not passingâ is the operation of a sound team.
Tests passing on retry are not passing. Two choices: isolation or fix.
Test DB and Testcontainers
The most accident-prone thing in integration tests is DB connection. Replacing DB with mock (a fake in place of the real thing) frequently produces âSQL should be right but fails in production,â and substituting with locally-installed SQLite misses dialect-difference bugs from PostgreSQL/MySQL.
Todayâs standard is Testcontainers. Itâs a library that starts real DB/Redis/Kafka in Docker containers during testing and discards them after, realizing ârunning integration tests on the same middleware as production.â
| Method | Difference from prod | Speed | Recommended |
|---|---|---|---|
| Mock (replace DB connection itself) | Large (canât verify SQL) | Extreme fast | SQL bugs leak |
| Local SQLite substitute | Mid (dialect difference) | Fast | Small scale only |
| Shared staging DB | Small | Mid | Test parallelization impossible |
| Testcontainers | Near zero | Mid (only first time slow) | Front-runner |
âVerifying down to DB dialectâ is investment 100x more valuable than raising coverage numbers.
Where to use mocks - mocks not to break vs mocks to use
Mocks are convenient but a tool that erases test meaning when overused. Misjudging boundaries leads to the worst state of âtests all green but production is broken.â
- Mock targets: external SaaS (Stripe, SendGrid, Slack), time/random, costly computations
- Donât mock: own projectâs DB, code we wrote, modules in same process
A typical failure is the pattern of âmocking DB access at the repository layer.â This routes SQL bugs to production undetected. Similarly, âmocking neighbor use cases in the use-case layer testâ is also a bad choice, losing the chance to make coupling visible.
Donât mock your own code, mock only the external world - the empirical iron rule.
Contract tests
When sharing APIs across microservices and multiple teams, the divergence between caller and callee API specs becomes a major cause of production incidents. Contract Test is a mechanism to machine-verify whether the contract caller expects matches what callee actually returns.
| Method | Representative tools | Characteristics |
|---|---|---|
| Consumer-Driven Contract | Pact | Caller writes contract, callee verifies |
| Spec-First (OpenAPI contract verification) | Dredd / Prism | Verify response against OpenAPI spec |
| Spring family | Spring Cloud Contract | Decisive in JVM env |
Unneeded in monoliths / single team, but the line for considering introduction stands the moment microservices or BFF (Backend-for-Frontend, frontend-dedicated API aggregation layer) is operated by 2+ teams. Investment of replacing coordination cost that melts human time with machine verification.
Test-data design
Another underestimated thing in testing is how to create test data. Filling with bland fixed data like user_01 / product_01, when tests increase, you canât tell âwhat case was this testing?â
| Approach | Content | Suited for |
|---|---|---|
| Fixture (fixed data files) | Inject preset data via YAML/JSON | Small, reading-heavy apps |
| Factory (factory functions) | Generate via userFactory({role: 'admin'}) | Mid-large (front-runner) |
| Builder pattern | Assemble via chains | Complex entities |
| Faker (random generation) | Auto-generate names/addresses | Load tests, mass data |
The Factory pattern (FactoryBot, fishery, @mikro-orm/seeder etc.) is the modern standard. The style of âwrite only meaningful diffs, default the restâ turns tests into a form readable as spec. Writing all in fixture makes inter-test dependencies invisible, breaking down mid-term.
What to decide - what is your projectâs answer?
For each of the following, try to articulate your projectâs answer in 1-2 sentences. Starting work with these vague always invites later questions like âwhy did we decide this again?â
- Test-pyramid ratio target (Unit / Integration / E2E)
- Coverage target (overall %, domain core %, branch coverage adoption)
- CI scope on PR (change scope only or all Unit)
- E2E execution timing (post-merge smoke + nightly)
- Test-DB strategy (Testcontainers / mock / shared DB)
- Flaky-test handling (isolate â fix â delete flow)
- Contract Test introduction (judge by microservices / BFF presence)
- Test-data approach (Factory / Fixture / Builder)
Authorâs note - team broken by KPI-izing coverage
A widely-known industry case: a mid-size SaaS company set âcoverage 80% as team KPI,â and in 3 months, getter/setter tests grew by thousands of lines while essential domain-logic bugs didnât decrease. Numbers go up easily, but quality doesnât change. Because âraising coverageâ and âreducing bugsâ are different things.
This team then switched KPI to the 2 axes of ânew bug count to productionâ and âPR-diff coverage (change-line basis),â abolishing absolute coverage targets. As a result, test quality returned, and a suggestive outcome where getter/setter junk tests were mass-renamed and deleted.
Test value can only be measured by track record of reducing bugs.
Summary
This article covered test design, including test pyramid, coverage, TDD, flake countermeasures, Testcontainers, mock boundaries, and AI-era test-first.
Be aware of the test pyramid, verify production-equivalent DB with Testcontainers, branch + change-line coverage as operational metrics, immediately isolate flake. That is the practical answer for test design in 2026.
Next time weâll cover CI/CD (pipeline design, deploy automation).
I hope youâll read the next article as well.
đ Series: Architecture Crash Course for the Generative-AI Era (59/89)