About this article
As the third installment of the “Solution Architecture” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains non-functional requirements.
Functional requirements can be written by business; non-functional requirements can’t be written without specialists. Vague non-functional requirements cause post-completion firestorms of “running but slow / stops / ops is hell.” This article handles numerically quantifying performance, availability, security, and operability, IPA non-functional requirement grades, and AI-era non-functional-test automation.
What are non-functional requirements in the first place
In a nutshell, non-functional requirements are “rules that define not ‘what the system does’ but ‘how well it runs.’”
Think of earthquake resistance and insulation specs when building a house. The floor plan (functional requirements) can be decided by the residents, but “can it withstand a magnitude 6 earthquake?” “can it maintain winter room temperature at a certain degree?” — only specialists can design these. And raising the seismic rating after construction is essentially a rebuild. Software is the same: without settling quality standards like “respond within 1 second” or “maintain 99.9% monthly uptime” upfront, you end up with the post-completion firestorm of “runs but slow, stops, ops is hell.”
Why non-functional requirements are needed
Prevent “completed but unusable”
Even with perfect features, a 10-second-response system isn’t used. Without numerically settled, disputes arise at acceptance.
Becomes basis for cost estimates
“99.9% uptime” and “99.99% uptime” sometimes have 5x build-cost differences. Estimates only emerge once numbers are settled.
Alignment with regulatory requirements
Finance / medical / personal info often have non-functional-requirement levels decided by law - without early clarification, violation risk emerges.
Main NFR categories
IPA’s (Information-technology Promotion Agency) non-functional requirement grades are the standard classification in Japan. Comprehensive organization in 6 major items.
flowchart TB
NFR([Non-functional requirements])
AVAIL[Availability<br/>uptime 99.9% etc.]
PERF[Performance / scalability<br/>response, TPS]
OPS[Operability / maintainability<br/>monitoring/backup]
MIG[Migratability<br/>data/env migration]
SEC[Security<br/>authentication/encryption]
ENV[System environment<br/>OS/browser premise]
NFR --> AVAIL
NFR --> PERF
NFR --> OPS
NFR --> MIG
NFR --> SEC
NFR --> ENV
BAD[Make non-functional vague<br/>= post-completion firestorm]
BAD -.|common failure| NFR
classDef root fill:#fef3c7,stroke:#d97706,stroke-width:2px;
classDef cat fill:#dbeafe,stroke:#2563eb;
classDef bad fill:#fee2e2,stroke:#dc2626;
class NFR root;
class AVAIL,PERF,OPS,MIG,SEC,ENV cat;
class BAD bad;
| Category | Content |
|---|---|
| Availability | Degree of not stopping |
| Performance / scalability | Speed, scale |
| Operability / maintainability | Ease of operations |
| Migratability | Ease of migration |
| Security | Whether protected |
| System environment | Premise environment, requirements |
IPA’s non-functional-requirement grades are a free-to-use template, widely used in Japanese companies.
Availability
Define how much the system doesn’t stop. Same viewpoint as SLO content - numerical quantification required.
| Metric | Content | Typical |
|---|---|---|
| Uptime | What % running monthly | 99.9% (43 min monthly down) |
| RTO | Recovery target time on failure | 1 hour |
| RPO | Allowed data-loss time | 15 min |
| MTBF | Mean time between failures | 30 days |
| MTTR | Mean time to repair | 30 min |
Promising 99.99% is the extremely strict level allowing only 4.3 min monthly down. Cost spikes - choose levels matching business requirements.
Performance
Define how fast and many can be processed. Per business nature, clearly numerically quantify response time and throughput.
| Metric | Content | Typical example |
|---|---|---|
| Response time | Per-request processing time | Within 300ms at P95 |
| Throughput | Per-unit-time processing count | 1000 req/sec |
| Concurrent connections | Parallel users | 10,000 |
| Peak multiplier | Peak-time load | 10x normal |
| Latency | Network delay | Under 50ms |
For “within 3 seconds response,” clarify whether average or max. Usually defined at P95 / P99 (95 / 99 percentile) - the modern way.
Scalability
Define whether you can respond to future growth. Beyond service-launch scale, design including growth predictions for years ahead.
| Metric | Content |
|---|---|
| Horizontal | Can add servers to handle |
| Vertical | Can boost CPU / memory to handle |
| Data | DB-capacity growth |
| User | 10x / 100x growth |
| Geographic | Overseas expansion |
Designing all from the start is excessive, but scenarios for phased expansion need consideration.
Operability / maintainability
Define ease of operations. Weakness here makes ops-team load explode, increasing incidents.
| Item | Content |
|---|---|
| Backup | Frequency, retention, restoration test |
| Monitoring | What to monitor at what frequency |
| Log retention | Period, capacity |
| Deploy | Frequency, downtime |
| Documentation | Operational manual setup |
| On-call | 24/7 response regime |
If “premising outsourced operations,” define levels outsourceable.
Security
Define levels to protect. Levels vary with handled-data sensitivity and law/regulation.
| Item | Content |
|---|---|
| Authentication | MFA (Multi-Factor Authentication) required, password strength |
| Authorization | Permission design, least privilege |
| Encryption | Communication, storage, key management |
| Audit logs | Retention, tamper-prevention |
| Vulnerability response | Patch-application SLA |
| Penetration testing | Frequency, scope |
Always weave in regulatory requirements like Personal Information Protection Act, GDPR, and PCI DSS.
Migratability
Define ease of migration from existing systems. Cases where projects break down from migration-plan-design lack are many - shouldn’t be underestimated.
| Item | Content |
|---|---|
| Data-migration method | Bulk / phased |
| Parallel operation | New-old coexistence period |
| Rollback | Reversion procedure, conditions |
| System-stop time | At cutover |
| User training | Education plan |
| Business-stop impact | Business-department coordination |
Countermeasures against NFR gaps
Many easily-forgotten items in NFR. Use comprehensive checklists like IPA’s non-functional-requirement grades to eliminate gaps.
| Easily-forgotten items | Content |
|---|---|
| Browser-support scope | IE11? Latest Chrome only? |
| Character encoding | UTF-8, emoji support |
| Timezone | UTC, JST, multiple regions |
| Multilingual support | i18n (internationalization), L10n (localization) |
| Accessibility | WCAG (Web Content Accessibility Guidelines) 2.1 compliance |
| Disaster countermeasure | DR (Disaster Recovery), geo-distribution |
| Log retention | Legal requirements |
These are items easily firestormed with “didn’t support that” after completion. Define from start.
Relationship with SLA / SLO
NFR closely links with SLA / SLO. SLA is external-contractual promise, NFR is target value at design.
| NFR | SLO | SLA | |
|---|---|---|---|
| Phase | At design | At operation | At contract |
| Nature | Target | Internal target | External contract |
| On violation | Design change | Improvement investment | Penalty / reduction |
For NFRs affecting SLA (availability, performance), the iron rule is setting stricter than SLA.
Decision criterion 1: system nature
Strictness of NFR varies with business importance and disclosure scope.
| System nature | Availability guideline |
|---|---|
| Internal tools | 99% |
| General B2C services | 99.9% |
| B2B SaaS | 99.95% |
| Finance / payments | 99.99% |
| Power / telecom | 99.999% |
Decision criterion 2: org regime
Realizability of NFR varies with ops-team scale. Without 24/7 regime, can’t keep 99.99%.
| Ops regime | Possible availability |
|---|---|
| Business hours only | 99% |
| Extended hours | 99.5% |
| 24/7 on-call | 99.9% |
| 24/7 SRE (Site Reliability Engineering) dedicated | 99.95% |
| Multi-region / Follow-the-Sun | 99.99%+ |
How to choose by case
In-house tools / business-hours use
Availability 99% + response 3 sec + daily backup. IPA non-functional grade equivalent to “Model System 1.” SLA unneeded, RTO 24 hours / RPO 1 day enough. Security covers minimum with internal ID linkage + TLS.
General B2C web service
Availability 99.9% + P95 500ms + 24/7 on-call + auto backup. IPA “Model 2,” introduce SLO management, PII (Personally Identifiable Information) masking for Personal Information Protection Act, annual pentest. Optimize cost via AWS / GCP managed services.
B2B SaaS / enterprise customers
Availability 99.95% + SLA contract + 7-year audit logs + SOC 2 (US standard auditing service-organization security/availability) compliance. IPA “Model 3” equivalent, individual SLA agreements per customer, RTO / RPO clearly stated in contract, eyeing ISO 27001 acquisition. Include multi-tenant separation design in NFR.
Finance / payments / medical
99.99%+ availability + multi-region DR + FISC / PCI DSS / HIPAA compliance. IPA “Model 4,” 24/7 dedicated SRE, annual pentest / quarterly vulnerability scans, encryption with FIPS 140-2-certified HSM (Hardware Security Module), tamper-proof audit logs. NFR integrated with regulations themselves.
Service-type x NFR numerical gates
Note: Industry baseline values as of April 2026. Will become outdated as technology and the talent market shift, so requires periodic updates.
NFR is the area where discussion starts the moment numbers are agreed. Below is the industry-standard correspondence table.
| Service type | Availability | RTO | RPO | Response time (P95) | Monthly-cost guideline |
|---|---|---|---|---|---|
| Internal tools | 99% | 24 hours | 1 day | 3 sec | Tens of thousands of yen |
| General B2C web | 99.9% | 1 hour | 15 min | 500ms | Hundreds of thousands |
| B2B SaaS | 99.95% | 30 min | 5 min | 300ms | Hundreds of thousands to millions |
| Finance / payments | 99.99% | 5 min | 1 min | 100ms | Millions+ |
| Telecom / power | 99.999% | 30 sec | 10 sec | 50ms | Tens of millions+ |
The empirical rule: 99.9% and 99.99% have several-times build-cost differences. Even when business requests “non-stopping system,” presenting numerically often gets “99.9% is enough.” IPA non-functional-requirement grades are Japan’s standard checklist, comprehensively covering easily-forgotten items (timezone, browser support, i18n, WCAG, etc.).
“Don’t stop” discussion only starts when presented numerically. In words, never aligns.
NFR-design pitfalls and forbidden moves
Typical accident patterns in NFR. All produce systems that run but unusable.
| Forbidden move | Why it’s bad |
|---|---|
| Decide NFR later | The Knight Capital incident (phased deployment / auto-rollback missing, $440M loss in 45 min) |
| Vaguely agree availability “as high as possible” | Without numbers, design and quote impossible |
| Apply “somehow 99.99%” to all systems | Several-times cost difference 99.9% vs 99.99%, over-investment |
| Define response time by average | Slow 1% of users invisible, measure with P95 / P99 |
| Promise 99.99% without ops regime | Can’t keep without 24/7 dedicated SRE |
| Add disaster countermeasures (DR) at the end | After-fitting costs 10x, design from start |
| Release with undefined browser-support scope | Pointed out “not supporting IE11” after completion, major revision |
| Forget timezone / character encoding | Fatal bugs in overseas deployment / emoji |
| Ignore WCAG (accessibility) | Risk of violating revised Act for Eliminating Discrimination against Persons with Disabilities, effective April 2024 |
| Don’t test-ize NFR | Just written in design doc, no one verifies, surfaces in production |
The 2012 Knight Capital incident had complete absence of NFR like phased deployment (Canary), auto-rollback, and monitoring as the lethal blow (details in appendix “Critical Incident Cases”). The case of a major EC site’s new UI release for year-end shopping causing 30-second delays and SNS firestorm, emergency rollback to old UI 4 hours later also shows the cost of undefined response-time NFR.
NFR is insurance preventing “running but unusable”. Define numerically from the start.
| “NFR is decided later” — postponing | Later additions cost 10x; decide first is the iron rule | | “99.9% and 99.99% aren’t much different” — being casual | 43 min/month vs 4.3 min/month, build cost several times; choose levels matching business requirements |
AI decision axes
| AI-era favorable | AI-era unfavorable |
|---|---|
| Numerically-quantified NFR | Words like “fast” “safe” |
| Automated load tests | One-time manual tests |
| Automated security scans | Manual review only |
| SLO-based monitoring | Threshold-based |
- Decide numerically first — postponement is 10x cost, vagueness is fatal
- Comprehensive check via IPA grades — even easily-forgotten items (timezone, browser, etc.) without gaps
- Align with ops regime — can’t keep 99.99% without 24/7, levels matching capability
- Auto-test-ize in CI/CD — continuously verify performance / security of AI-generated code
What to decide - what is your project’s answer?
For each of the following, try to articulate your project’s answer in 1-2 sentences. Starting work with these vague always invites later questions like “why did we decide this again?”
- Availability target (99.X%, RTO, RPO)
- Performance target (response time, throughput)
- Scalability (growth scenarios)
- Operational requirements (monitoring, backup, on-call)
- Security level (authentication, encryption, audit)
- Migration plan (parallel operation, rollback)
- Comprehensive check (IPA non-functional-requirement grade, etc.)
Author’s note - cases of “no NFR” producing firestorms
Cases of postponing NFR and firestorm-ing are continuously told in the SI industry.
The 2012 Knight Capital incident is symbolic of the result of underestimating NFR (especially deploy safety) - subsequent investigation determined “complete absence of NFR like phased deployment (Canary), auto-rollback, and monitoring” was the lethal blow (details in appendix “Critical Incident Cases”).
Another, Amazon Prime’s first-year Prime Day stoppage is also cited. Sloppy estimation of performance NFR for peak traffic caused checkout to be down for hours, with estimated billion-dollar-class opportunity loss. Thereafter, Amazon updates performance requirements actuals-based quarterly and built in mechanisms auto-verifying with chaos engineering.
Domestically too, a major EC site released new UI for year-end shopping, with undefined response-time NFR causing 30-second-plus response delays at peak, SNS firestorm, and emergency rollback to old UI 4 hours later - cases continuously told. “Even with completed features, undefined non-functional makes the system unusable” - this reality is repeatedly slapped home in the history of NFR underestimation.
Summary
This article covered non-functional requirements design, including availability, performance, operations, security, IPA grades, SLA/SLO relationship, and AI-era auto-test-ization.
Decide numerically first, comprehensive via IPA, align with ops regime, auto-test-ize. That is the practical answer for NFR design in 2026.
Next time we’ll cover “estimation and ROI.” Plan to dig into the practice of 3-point estimation, buffers, 3-year ROI, break-even points, and how to build numbers to pass approvals.
Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book
I hope you’ll read the next article as well.
📚 Series: Architecture Crash Course for the Generative-AI Era (77/89)