System Architecture Overview — The Skeleton You Decide First

About this article

This article is the entry point of the “System Architecture” category in the Architecture Crash Course for the Generative-AI Era series. It surveys the entire skeleton — hardware, software, network — together. In construction terms it corresponds to the foundation and the structural framing, and across all architectural layers it is the one where redo-ing is the hardest.

This article surveys what gets decided here, why it must be decided first, and the AI-driven-development selection criteria.

A full list of all articles in this category, with summaries and learning points, is available at the following page.

System Architecture — Article Indexen.senkohome.com/arch-intro-index-system/

What is system architecture in the first place

Imagine a house’s foundation work. If you say “actually, I want a basement” after the foundation is poured, redoing the foundation costs almost as much as rebuilding the house. You can change the floor plan and wallpaper later, but the foundation and structural framing must be decided correctly from the start.

System architecture is the domain that decides the overall skeleton encompassing hardware, software, and network. Cloud or your own data center, OS, DB, container platform, network design — of all architectural layers, this is the one where redo is hardest.

If you neglect system architecture, situations like switching cloud vendors mid-project = effectively “rebuilding” arise, and every other design decision is reduced to nothing.

The first decision, and the hardest one to undo

System architecture is the layer that decides the overall skeleton across hardware, software, and network. In building terms, foundation and structural framing. Trying to change it later runs essentially the same cost as rebuilding the house — and that is the decisive difference from other layers.

Many sites use “infrastructure architecture” more or less synonymously, and some camps don’t distinguish them at all. The naming doesn’t matter. What matters is the framing: “the layer that handles the skeleton as a whole.”

Other architectural layers — software, data, security, etc. — are interior fittings built on top of this skeleton, fully constrained by it. Of every architectural domain, this one has the highest difficulty; the other domains are essentially refinements within the constraints set here.

Why decide it first

This domain is a concentration of One-way Doors (Bezos’s term for “decisions hard to reverse”). Downstream judgments are bound by what’s settled here. Changing the skeleton later means tasks like:

Switching cloud vendor mid-project is essentially “rebuild from scratch.”
Switching between on-prem and cloud later is also major construction work.
OS and DB product changes propagate widely.

Even on small projects, deciding the broad strokes up front is the rule. “We’ll think about that later” becomes more fatal the smaller the project — by the time code and operations have grown into the skeleton, peeling them apart is the work.

The mindset of “MVP, so just wing it” turns into the most expensive choice in retrospect. System architecture should be sketched out in week 1 of the project; deferral is most fatal at smaller scale.

What you must decide — what’s your project’s answer?

For each of the following, articulate your project’s answer in 1-2 sentences. Leaving them ambiguous now will always come back as “why did we decide that?” later.

The decisions in this domain split into three groups:

Foundation selection

Item	Examples
Application form	Native app / Web app / Hybrid app
Deployment model	On-prem / Cloud / Hybrid
Cloud vendor	AWS / GCP / Azure
Runtime	VM / Container / Serverless
OS	Linux / Windows / UNIX
Data persistence	RDBMS / NoSQL / Filesystem

This layer is covered in articles 01-06 below. The substrate of all later decisions — get this wrong and everything downstream goes off the rails.

Network, security, operations

Item	Examples
DB vendor	Oracle / PostgreSQL / DynamoDB
Batch processing	Long-running / Scheduled / Event-driven
Network	IP range design, subnet partitioning
Communication protocols	HTTPS / gRPC / WebSocket
Security foundation	WAF, IDS/IPS, Zero Trust (re-authenticating every request, no implicit internal trust)
Monitoring / alerting	CloudWatch / Datadog / PagerDuty

Network, security, and monitoring are extremely difficult to bolt on after the fact; they need to be built into the design. Articles 07-09 in this category.

BCP, cost, ops automation

Item	Examples
Storage / backup	S3 / Blob / Archive strategy
External connectivity	Internet GW / NAT / VPN / Dedicated lines
BCP	Multi-AZ (multiple data centers within the same region) / Multi-region (geographically separated) / DR site
CI/CD platform	GitHub Actions / GitLab CI / CodePipeline
IaC	Terraform / CloudFormation / Pulumi

Business continuity and cost management appear in articles 10-11. CI/CD, IaC, and configuration management — the development-process layer — are consolidated in the separate “DevOps Architecture” category.

Knowledge structure of this category

This category comprises 12 articles, but you don’t need to read them all at once. They divide into three groups, where decisions in the upstream group constrain the downstream ones.

Group 1 (Foundation selection) follows the order: app form → deployment → vendor → runtime → OS. Decisions upstream narrow choices downstream, so reversing this order guarantees rework. For example, choosing a DB before choosing a cloud vendor artificially narrows your options.

Group 2 (Data and network) designs where data lives, how it connects, how it’s protected, and how it’s monitored on top of the foundation. Network and security are extremely hard to bolt on later, so they should be started in parallel with foundation selection.

Group 3 (Continuity and cost) designs “how to recover when things break” and “how to manage expenses” once the topology is locked. Writing a BCP plan and never drilling is a textbook failure pattern.

Across the entire category, deciding upstream first is the iron rule of system architecture.

System configurations by scale and phase

Optimal system architecture changes with org scale. The conclusion up front: from individuals to mid-sized SaaS, single cloud + managed services + IaC is the default. Hybrid and multi-cloud should be limited to large enterprises in 2026.

Phase	Monthly infra cost	Recommended config	BCP target	Dedicated infra people
MVP / individual	up to $30	Single cloud, single region, managed	99%	0
Early startup	$300-3k	Single cloud, multi-AZ, IaC mandatory	99.9%	0.5
Mid-sized SaaS	$3k-30k	Single cloud + 2 regions for DR	99.95%	1-3
Enterprise	$30k+	Hybrid + dedicated lines + AWS Organizations	99.99%	5+
Finance / healthcare / public	Industry-dependent	Private cloud, compliance certifications	Industry-required	10+

The practical floor for multi-cloud or hybrid is 3+ dedicated infra engineers. Going below that just melts the team in operations.

The courage to lean on a single cloud is what reconciles operations with AI productivity. Startups copying enterprise topologies is a textbook failure pattern — they exhaust themselves in months and end up rolling back. An out-of-scale architecture is purely debt.

Architecture-level traps

The forbidden moves at the system-wide design level:

Forbidden move	Why
Decide downstream and back-propagate	Breaking the app-form -> cloud -> runtime order forces rework
Adopt multi-cloud without enough people	IAM, monitoring, IaC duplication doubles operations cost
No IaC, manual GUI setup	Environment drift, not reproducible, becomes debt in the AI era
BCP doc, no drills	Like GitLab 2017 — backups silently broken
Add security later	Auth, encryption, audit logs must be designed in
Cloud usage with no FinOps	Six-figure monthly bills shake the business
DB / Redis in a public subnet	The MongoDB ransomware pattern of 2017
Single AZ in production	One AZ outage kills everything; multi-AZ is the floor
Adopting the latest tech stack unconditionally	Thin information density, low AI generation accuracy. Battle-tested standards are safer
Copying big-company architectures verbatim	Scale assumptions differ; over-engineering makes it unmaintainable
Going all-managed for everything	Lock-in and fixed costs pile up. Pick the right tool for the right job
Assuming your own data center is safer	Patching, monitoring, BCP, and staffing in-house actually raises risk

FinOps above is the practice of continuous cloud-cost optimization. Covered in detail in the “Cost Management” article in this category.

System architecture is the least reversible domain. Scale, upstream-first, IaC, security as standard equipment — get those four right or pay forever.

AI decision axes

AI-era favorable	AI-era unfavorable
Public cloud + IaC (Terraform)	On-prem, GUI-dependent operations
Containers + standards (k8s, OCI)	Vendor-proprietary runtimes
OIDC, SSO	Custom auth, long-lived keys
OpenTelemetry (vendor-neutral observability)	Vendor-proprietary SDKs

Decide upstream first (app form → deploy → vendor → runtime)
IaC-ability as a mandatory check on every selection
Lean on a single cloud (multi-cloud only with a clear reason)
Standard-protocol compliance (OIDC, OpenTelemetry, etc.) as a non-negotiable AI-era requirement

IaC-managed infrastructure is “readable code” for AI

Infrastructure defined in Terraform or CDK is treated as ordinary source code by AI. VPC configuration, security-group rules, and IAM policies all exist as text files, so AI can understand the topology and submit change proposals as PRs.

In contrast, infrastructure built through the management console GUI is invisible to AI. Configuration is only accessible via API, change history can’t be tracked, and it falls outside the scope of AI review and auto-remediation.

Why a single cloud is structurally advantaged for AI adoption

AWS’s information volume is several times that of Azure and GCP, and AI training data follows suit. For instance, there are enormous public samples of AWS IAM policies and CloudFormation templates, so AI can generate accurate code. A multi-cloud configuration forces dual management of each vendor’s IaC, IAM, and network design, doubling the context load on AI as well.

Multi-cloud is justified only when there are regulatory requirements, M&A constraints, or a specific service’s exclusive advantage (e.g., BigQuery).

The judgments that “feel intuitively right” deserve the most scrutiny. The habit of always cross-checking against scale and assumptions levels you up by a tier within a year as a system architect.

What you must decide — priority-ordered checklist (recap + additions)

Reordered from the section above by priority for project kickoff. Leaving them ambiguous now always comes back as “why did we decide that?” later.

Application form (Native / Web / Hybrid)
Deployment model (Cloud / On-prem / Hybrid)
Cloud vendor (AWS / GCP / Azure, or multi)
Runtime (VM / Container / Serverless)
Datastore (RDB / NoSQL / Mixed)
Network design (VPC / Subnets / Connectivity)
Monitoring (Tooling and alerting policy)
BCP (RTO / RPO / Redundancy level)

Summary

This article served as the entry point of the “System Architecture” category, surveying the domain.

System architecture is the least reversible domain. Locking down scale, upstream-first ordering, IaC, and security as standard equipment at the start determines your operations cost and AI-era readiness for the next 5-10 years.

The next article begins the deep dive: how to choose the application form (Native / Web / Hybrid).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.