About this article
This article is the entry point of the “System Architecture” category in the Architecture Crash Course for the Generative-AI Era series. It surveys the entire skeleton — hardware, software, network — together. In construction terms it corresponds to the foundation and the structural framing, and across all architectural layers it is the one where redo-ing is the hardest.
This article surveys what gets decided here, why it must be decided first, and the AI-driven-development selection criteria.
What is system architecture in the first place
Imagine a house’s foundation work. If you say “actually, I want a basement” after the foundation is poured, redoing the foundation costs almost as much as rebuilding the house. You can change the floor plan and wallpaper later, but the foundation and structural framing must be decided correctly from the start.
System architecture is the domain that decides the overall skeleton encompassing hardware, software, and network. Cloud or your own data center, OS, DB, container platform, network design — of all architectural layers, this is the one where redo is hardest.
If you neglect system architecture, situations like switching cloud vendors mid-project = effectively “rebuilding” arise, and every other design decision is reduced to nothing.
The first decision, and the hardest one to undo
System architecture is the layer that decides the overall skeleton across hardware, software, and network. In building terms, foundation and structural framing. Trying to change it later runs essentially the same cost as rebuilding the house — and that is the decisive difference from other layers.
Many sites use “infrastructure architecture” more or less synonymously, and some camps don’t distinguish them at all. The naming doesn’t matter. What matters is the framing: “the layer that handles the skeleton as a whole.”
Other architectural layers — software, data, security, etc. — are interior fittings built on top of this skeleton, fully constrained by it. Of every architectural domain, this one has the highest difficulty; the other domains are essentially refinements within the constraints set here.
Why decide it first
This domain is a concentration of One-way Doors (Bezos’s term for “decisions hard to reverse”). Downstream judgments are bound by what’s settled here. Changing the skeleton later means tasks like:
- Switching cloud vendor mid-project is essentially “rebuild from scratch.”
- Switching between on-prem and cloud later is also major construction work.
- OS and DB product changes propagate widely.
Even on small projects, deciding the broad strokes up front is the rule. “We’ll think about that later” becomes more fatal the smaller the project — by the time code and operations have grown into the skeleton, peeling them apart is the work.
The mindset of “MVP, so just wing it” turns into the most expensive choice in retrospect. System architecture should be sketched out in week 1 of the project; deferral is most fatal at smaller scale.
What you must decide — what’s your project’s answer?
For each of the following, articulate your project’s answer in 1-2 sentences. Leaving them ambiguous now will always come back as “why did we decide that?” later.
The decisions in this domain split into three groups:
flowchart TB
A[1. Foundation<br/>App form / Deploy / Cloud / Runtime / OS]
B[2. Where data lives<br/>DB / Storage / Network]
C[3. Operational machinery<br/>Monitoring / BCP / Cost / Security]
A -->|Upstream decisions<br/>bind downstream| B
B --> C
classDef base fill:#dbeafe,stroke:#2563eb,stroke-width:2px;
classDef data fill:#fef3c7,stroke:#d97706;
classDef ops fill:#fae8ff,stroke:#a21caf;
class A base;
class B data;
class C ops;
Foundation selection
| Item | Examples |
|---|---|
| Application form | Native app / Web app / Hybrid app |
| Deployment model | On-prem / Cloud / Hybrid |
| Cloud vendor | AWS / GCP / Azure |
| Runtime | VM / Container / Serverless |
| OS | Linux / Windows / UNIX |
| Data persistence | RDBMS / NoSQL / Filesystem |
This layer is covered in articles 01-06 below. The substrate of all later decisions — get this wrong and everything downstream goes off the rails.
Network, security, operations
| Item | Examples |
|---|---|
| DB vendor | Oracle / PostgreSQL / DynamoDB |
| Batch processing | Long-running / Scheduled / Event-driven |
| Network | IP range design, subnet partitioning |
| Communication protocols | HTTPS / gRPC / WebSocket |
| Security foundation | WAF (Web Application Firewall — defends against web attacks), IDS/IPS (Intrusion Detection / Prevention), Zero Trust (re-authenticating every request, no implicit internal trust) |
| Monitoring / alerting | CloudWatch / Datadog / PagerDuty |
Network, security, and monitoring are extremely difficult to bolt on after the fact; they need to be built into the design. Articles 07-09 in this category.
BCP, cost, ops automation
| Item | Examples |
|---|---|
| Storage / backup | S3 / Blob / Archive strategy |
| External connectivity | Internet GW / NAT / VPN / Dedicated lines |
| BCP | Multi-AZ (multiple data centers within the same region) / Multi-region (geographically separated) / DR site (Disaster Recovery — standby for disasters) |
| CI/CD platform | GitHub Actions / GitLab CI / CodePipeline |
| IaC (Infrastructure as Code — manage infra config as code) | Terraform / CloudFormation / Pulumi |
Business continuity and cost management appear in articles 10-11. CI/CD, IaC, and configuration management — the development-process layer — are consolidated in the separate “DevOps Architecture” category.
How to proceed
System architecture must be decided strictly upstream-first. Deciding downstream and back-propagating is a recipe for accidents and guaranteed rework.
1. Decide application form <- Top of the chain (01_application-types)
|
2. Decide deployment model <- On-prem / Cloud (02_deployment-model)
|
3. Decide cloud vendor <- AWS / Azure / GCP (03_cloud-vendor)
|
4. Decide runtime <- VM / Container / FaaS (04_runtime)
|
5. Detail design of DB, network, security, ...
Following this order means each downstream decision automatically inherits the assumptions it needs. Going in reverse — say, “pick the DB then pick the cloud” — produces unnatural moves and artificially narrows your selection range.
Decide upstream first. Back-propagating from downstream is a recipe for accidents. No exceptions on this one.
System configurations by scale and phase
Optimal system architecture changes with org scale. The conclusion up front: from individuals to mid-sized SaaS, single cloud + managed services + IaC is the default. Hybrid and multi-cloud should be limited to large enterprises in 2026.
| Phase | Monthly infra cost | Recommended config | BCP target | Dedicated infra people |
|---|---|---|---|---|
| MVP / individual | up to $30 | Single cloud, single region, managed | 99% | 0 |
| Early startup | $300-3k | Single cloud, multi-AZ, IaC mandatory | 99.9% | 0.5 |
| Mid-sized SaaS | $3k-30k | Single cloud + 2 regions for DR | 99.95% | 1-3 |
| Enterprise | $30k+ | Hybrid + dedicated lines + AWS Organizations | 99.99% | 5+ |
| Finance / healthcare / public | Industry-dependent | Private cloud, compliance certifications | Industry-required | 10+ |
The practical floor for multi-cloud or hybrid is 3+ dedicated infra engineers. Going below that just melts the team in operations.
The courage to lean on a single cloud is what reconciles operations with AI productivity. Startups copying enterprise topologies is a textbook failure pattern — they exhaust themselves in months and end up rolling back. An out-of-scale architecture is purely debt.
Architecture-level traps
The forbidden moves at the system-wide design level:
| Forbidden move | Why |
|---|---|
| Decide downstream and back-propagate | Breaking the app-form -> cloud -> runtime order forces rework |
| Adopt multi-cloud without enough people | IAM, monitoring, IaC duplication doubles operations cost |
| No IaC, manual GUI setup | Environment drift, not reproducible, becomes debt in the AI era |
| BCP doc, no drills | Like GitLab 2017 — backups silently broken |
| Add security later | Auth, encryption, audit logs must be designed in |
| Cloud usage with no FinOps | Six-figure monthly bills shake the business |
| DB / Redis in a public subnet | The MongoDB ransomware pattern of 2017 |
| Single AZ in production | One AZ outage kills everything; multi-AZ is the floor |
| Adopting the latest tech stack unconditionally | Thin information density, low AI generation accuracy. Battle-tested standards are safer |
| Copying big-company architectures verbatim | Scale assumptions differ; over-engineering makes it unmaintainable |
| Going all-managed for everything | Lock-in and fixed costs pile up. Pick the right tool for the right job |
| Assuming your own data center is safer | Patching, monitoring, BCP, and staffing in-house actually raises risk |
FinOps above is the practice of continuous cloud-cost optimization. Covered in detail in the “Cost Management” article in this category.
System architecture is the least reversible domain. Scale, upstream-first, IaC, security as standard equipment — get those four right or pay forever.
AI decision axes
| AI-era favorable | AI-era unfavorable |
|---|---|
| Public cloud + IaC (Terraform) | On-prem, GUI-dependent operations |
| Containers + standards (k8s, OCI) | Vendor-proprietary runtimes |
| OIDC (OpenID Connect), SSO | Custom auth, long-lived keys |
| OpenTelemetry (vendor-neutral observability) | Vendor-proprietary SDKs |
- Decide upstream first (app form → deploy → vendor → runtime)
- IaC-ability as a mandatory check on every selection
- Lean on a single cloud (multi-cloud only with a clear reason)
- Standard-protocol compliance (OIDC, OpenTelemetry, etc.) as a non-negotiable AI-era requirement
The judgments that “feel intuitively right” deserve the most scrutiny. The habit of always cross-checking against scale and assumptions levels you up by a tier within a year as a system architect.
What you must decide — priority-ordered checklist (recap + additions)
Reordered from the section above by priority for project kickoff. Leaving them ambiguous now always comes back as “why did we decide that?” later.
- Application form (Native / Web / Hybrid)
- Deployment model (Cloud / On-prem / Hybrid)
- Cloud vendor (AWS / GCP / Azure, or multi)
- Runtime (VM / Container / Serverless)
- Datastore (RDB / NoSQL / Mixed)
- Network design (VPC / Subnets / Connectivity)
- Monitoring (Tooling and alerting policy)
- BCP (RTO / RPO / Redundancy level)
Summary
This article served as the entry point of the “System Architecture” category, surveying the domain.
System architecture is the least reversible domain. Locking down scale, upstream-first ordering, IaC, and security as standard equipment at the start determines your operations cost and AI-era readiness for the next 5-10 years.
The next article begins the deep dive: how to choose the application form (Native / Web / Hybrid).
Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book
I hope you’ll read the next article as well.
📚 Series: Architecture Crash Course for the Generative-AI Era (5/89)