About this article
This article is the seventh deep dive in the “System Architecture” category of the Architecture Crash Course for the Generative-AI Era series, covering network design.
Network design fixes the physical / logical skeleton between servers and to the outside; a CIDR (IP-address range) once decided is effectively unchangeable. The article covers CIDR design, the 3-tier subnet structure, multi-AZ, external connectivity, and gives criteria for “draw it wide, with margin, aligned to internal standards.”
What is network design in the first place
Network design is, roughly speaking, “the work of designing communication paths between servers and to the Internet.”
Imagine the plumbing in an apartment building. The pipe diameter and branching must be decided before construction. If you decide “actually, I want independent pipes per room” after move-in, it’s a major tear-down-the-walls operation. Cloud network design is the same — once IP-address ranges and subnet partitioning are decided, changing them requires rebuilding every resource.
Why network design matters
What happens if you move forward with a tentative network design? A fresh VPC assigned a casual 10.0.0.0/16 — then three months later, another department discovers they’ve been using the same range on-prem for years. The VPC has to be rebuilt from scratch. This is a commonly reported failure pattern in the industry.
Network design is a project-week-1 decision. Tentative-and-defer here collapses everything once serious construction starts. “Get it running and then think about it” is the layer where this approach is most catastrophic.
A floor plan you can’t redraw
If IP ranges overlap with another system at integration time, you can’t connect — and trying to fix it later means rebuilding every resource. A common failure pattern in the industry: a fresh VPC gets assigned the casual 10.0.0.0/16, and three months later a different internal team discovers they’ve used the same range on-prem for years — VPC has to be rebuilt from scratch.
Network design is a project-week-1 decision. Tentative-and-defer here collapses everything once serious construction starts. “Get it running and then think about it” is the layer where this approach is most catastrophic.
The network is “a floor plan you can’t redraw.” The first design is everything.
Basic terms
| Term | Meaning |
|---|---|
| VPC | Virtual Private Cloud — virtual private network for your org |
| Subnet | A small IP range carved out of a VPC |
| CIDR | IP-address range notation like 10.0.0.0/16 |
| Route table | Defines next-hop routing |
| IGW | Internet Gateway — connection point to the Internet |
| NAT GW | Proxies outbound traffic from private subnets |
| Security Group | Resource-level firewall |
You combine these to draw “areas that can talk outside” vs “areas that cannot.”
CIDR design principles
CIDR (Classless Inter-Domain Routing) is the IP-range notation. Numbers like /16 and /24 indicate “the bit length of the network portion.” Smaller number = wider range.
The default for the entire VPC is /16 (65,536 addresses), taken wide. Subdivide into /24 (256 addresses) per subnet.
VPC 10.0.0.0/16 <- 65,536 addresses (whole)
+ Public-a 10.0.0.0/24 <- 256 addresses
+ Public-c 10.0.1.0/24
+ Private-a 10.0.10.0/24
+ Private-c 10.0.11.0/24
+ DB-a 10.0.20.0/24
+ DB-c 10.0.21.0/24
CIDR overlap with other companies, sites, or internal systems is forbidden. VPN and VPC peering connections fail when IPs overlap. If your org has a standard for IP ranges, follow it; using 10.0.0.0/16 based on personal preference is a landmine.
CIDR: take wide (/16), confirm no conflict with internal standards before locking it.
The 3-tier subnet structure
Subnets divide a VPC into purpose-specific compartments. Public / Private / Isolated as a 3-tier structure is the modern standard.
flowchart TB
NET([Internet]) --> PUB
subgraph VPC["VPC"]
subgraph PUB["Public Subnet"]
ALB[ALB / NAT GW / Bastion]
end
subgraph PRIV["Private Subnet"]
APP[App servers / Workers]
end
subgraph ISO["Isolated Subnet"]
DB[(DB / Internal batch)]
end
ALB --> APP
APP --> DB
APP -.NAT only.-> NET
end
classDef pub fill:#fee2e2,stroke:#dc2626;
classDef priv fill:#fef3c7,stroke:#d97706;
classDef iso fill:#dbeafe,stroke:#2563eb;
class PUB,ALB pub;
class PRIV,APP priv;
class ISO,DB iso;
| Tier | Use | Internet |
|---|---|---|
| Public Subnet | ALB / NAT GW / Bastion | Direct connection allowed |
| Private Subnet | App servers, workers | Outbound only via NAT |
| Isolated Subnet | DB, internal batch | No external connectivity |
The 3-tier structure preserves the basic rule “DB unreachable from the Internet.” Exposing a DB directly to the Internet is forbidden by every security baseline; breaking this is a hotbed for major data breaches.
DB lives in the Isolated subnet. Architecturally unreachable from outside.
Multi-AZ configuration
An AZ (Availability Zone) is a physically separated data center within the same region. Power and network failures can hit one AZ at a time, so placing the same configuration across multiple AZs is multi-AZ design.
Region: ap-northeast-1 (Tokyo)
+ AZ-a: Public-a / Private-a / DB-a
+ AZ-c: Public-c / Private-c / DB-c
ALB (Application Load Balancer, AWS’s app-layer LB) automatically distributes traffic across AZs; RDS’s Multi-AZ places primary and standby in different AZs and auto-fails-over in tens of seconds. Production: multi-AZ as the rule. Cutting corners here means a single AZ outage drops the service entirely.
Production is multi-AZ, no exceptions. Single-AZ is dev/staging only.
Routing
Route tables define “for this destination, send via this next hop” for resources in a subnet. Different route tables per subnet produce public vs private behavior.
| Source subnet | Destination | Next hop |
|---|---|---|
| Public | 0.0.0.0/0 (Internet) | IGW |
| Private | 0.0.0.0/0 | NAT GW |
| Isolated | VPC only | local |
Public -> IGW, Private -> NAT GW is the default. NAT GW is for “private subnets that need outbound but no inbound,” but with hourly + data-transfer billing it’s surprisingly costly — use only when needed.
Security Groups vs NACLs
The two firewall types are Security Groups (SG) and Network ACLs (NACL). Different roles; use them appropriately.
| Aspect | Security Group | Network ACL |
|---|---|---|
| Applies to | Per resource (EC2, etc.) | Per subnet |
| Stateful | Yes (return traffic auto-allowed) | No |
| Deny rules | Not possible (allow only) | Possible |
| Main use | General firewall | Coarse subnet-wide control |
| Evaluation | All rules | By rule number |
Default firewall control via SG. NACLs are for cases like “block this specific IP from the entire subnet.” New builds usually leave NACLs at default (allow all) and control via SG.
SG is enough by default. NACL is for special subnet-wide control.
External connectivity
Multiple methods connect a cloud VPC to the outside (sites, other clouds, Internet). Pick by security, bandwidth, and cost.
| Method | Use | Trait |
|---|---|---|
| IGW + public IP | Internet-facing | Simplest, public |
| NAT GW | Private outbound | No direct inbound |
| VPN | Site <-> cloud | Internet-based, encrypted |
| Dedicated line (Direct Connect, etc.) | High reliability / bandwidth / low latency | $1k+/month, expensive |
| VPC Peering | VPC-to-VPC direct | Same / different region |
| Transit Gateway | Multi-VPC / multi-account hub | Large orgs |
Small/mid in-house integrations: VPN is enough. Finance and large enterprises run dedicated lines for closed-network connectivity that bypasses the Internet.
PrivateLink (private connectivity)
PrivateLink connects to AWS / Azure / GCP services without going through the Internet. Services normally accessed publicly (S3, RDS, SSM, CloudWatch) can be reached over closed networks via VPC endpoints.
| Service | Name |
|---|---|
| AWS | VPC Endpoint / PrivateLink |
| Azure | Private Endpoint |
| GCP | Private Service Connect |
In tightly regulated industries (finance, healthcare, public sector), routing all major AWS services through VPC Endpoints is the rule.
DNS design
DNS (Domain Name System) maps domain names to IPs. Cloud uses both public DNS for external services and private DNS for internal name resolution.
| Type | Use |
|---|---|
| Public DNS | Service exposure (api.example.com, etc.) |
| Private DNS | Internal name resolution (Route 53 Private Hosted Zone) |
| Split-horizon | Same domain, different resolution inside vs outside |
For microservice-to-microservice traffic, “resolve by service name” (Service Discovery) matters. Hardcoding IPs is a landmine — connections drop on scale events. Internal DNS that absorbs IP changes is the rule.
Internal traffic uses DNS names, not IPs. Resilient to service additions and removals.
Bandwidth and latency
Network design considers bandwidth (throughput) and latency (delay). In cloud, by communication distance:
| Range | Latency target | Cost |
|---|---|---|
| Same AZ | < 1ms | Free (often) |
| Same region, different AZ | 1-2ms | Low |
| Same country, different region | 10-30ms | Medium |
| Cross-country (e.g. Japan-US) | > 100ms | High |
DB and app servers belong in the same AZ as a rule. Cross-AZ traffic adds latency and cost. Cross-region transfers are usually billed; missed in cost estimates, this is a “surprise on the bill” region.
Network configurations by scale
“Wide and from day one” is the rule, but realistic configurations differ by phase.
| Phase | VPC count | CIDR policy | External | Typical cost |
|---|---|---|---|---|
| Personal / MVP | 1 | /16 (10.0.0.0/16) | IGW + single NAT GW | up to $30/mo |
| Startup | 1-2 | /16 (consider per-use VPCs) | IGW + multi-AZ NAT GW | up to $300/mo |
| Mid-sized SaaS | 3-10 | Per environment (prod/stg/dev separation) | Transit Gateway + VPN | $300-3k/mo |
| Enterprise | 50-hundreds | Hierarchical: BU / env / country | Direct Connect + TGW + VPC Lattice | $30k+/mo |
NAT Gateway is hourly + transfer billing, an invisible cost source at ~$30-60/month per gateway plus data transfer. Multi-AZ with one per AZ is the production rule, but applying that to dev environments is wasteful — varying NAT structure per environment is the practical play.
One NAT GW per AZ. Single-AZ NAT means an AZ outage kills outbound from other AZs too.
Network-design traps
| Forbidden move | Why |
|---|---|
Start with /24 CIDR (256 addresses) | Subnet partition exhausts immediately. Take /16 wide |
| Create a VPC without checking internal CIDR standards | VPN / Direct Connect causes IP overlap, VPC must be rebuilt |
Open admin ports 0.0.0.0/0 in SG | Bots find it within hours |
| DB in a Public subnet | Forbidden by every security baseline; Isolated mandatory |
| NAT GW in only one AZ | That AZ outage stops outbound from other AZs |
| Hardcoding IPs in internal traffic | Connections drop on scale; Service Discovery / DNS required |
| Routing everything via NAT instead of PrivateLink / VPC Endpoint | Data transfer costs spike; S3 / DynamoDB use Gateway Endpoints (free) |
| Hand-writing Egress rules in SG | SG is stateful; return traffic is auto-allowed. Explicit egress is NACL’s job |
CIDR design has the property of “effectively unchangeable once decided”, so check the network-management team’s standards and existing-system CIDR list before designing. Cases where teams started with /24 thinking “we can widen later” and ended up rebuilding the entire VPC still happen.
CIDR and SG are decided once at the start. The largest area for retroactive cost.
AI decision axes
With AI-driven development as the assumption, “can it be coded in Terraform / CloudFormation” is the absolute requirement.
| AI-era favorable | AI-era unfavorable |
|---|---|
| Terraform / CDK (AWS Cloud Development Kit) for full network definition | Manual setup via management console |
| Auto-generating network diagrams from code | Hand-drawn Visio for config management |
| Policy as Code (defining policies in code with automated checks) for change guardrails | Oral-tradition change rules |
| Transit Gateway etc. declared too | Manual VPC-to-VPC connections |
- Check internal CIDR standards and existing systems’ IP ranges first (no overlap).
- Adopt the 3-tier subnet structure (Public/Private/Isolated) as default.
- Multi-AZ as a production requirement (NAT GW per AZ).
- Code the entire configuration (Terraform / CDK) so AI can generate PRs.
”VPN won’t connect because of CIDR overlap” (industry case)
A junior engineer assigns 10.0.0.0/16 to a VPC for an enterprise cloud-migration project; another department has been using the same range on-prem for years. When Direct Connect / VPN time arrives, the overlap surfaces, the VPC gets rebuilt with a different range, and EC2 / RDS / ALB and all resources are recreated. This is reported repeatedly in the field.
I’ve heard from infra engineer friends, “They told me to redo the VPC I built over 3 months by next month.”
The lesson: “CIDR sits on the same map as your physical infrastructure.” Numbers you can’t decide alone, looking only at your VPC. The habit of checking the network-management team and internal standards documents before designing prevents months of loss.
The moment IP enters the connected world, you alone don’t get to decide.
What you must decide — what’s your project’s answer?
Articulate your project’s answer in 1-2 sentences for each:
- VPC CIDR (no overlap with other systems)
- AZ count and placement
- Subnet types and per-CIDR
- NAT Gateway necessity and placement
- External connectivity (VPN / dedicated line / PrivateLink)
- Security Group design policy
- DNS / name-resolution policy
- Integration with internal systems
Summary
This article covered cloud network design — CIDR, 3-tier subnets, multi-AZ, external connectivity, SG/NACL.
New VPCs: “wide /16 + 3-tier subnets + multi-AZ + code-managed” is the default. Confirm CIDR doesn’t conflict with internal standards first, DB always in Isolated, NAT GW per AZ. Stay within these and major incidents don’t happen.
The next article covers security foundation (the system-architecture-stage map of security functions to build in).
Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book
I hope you’ll read the next article as well.
📚 Series: Architecture Crash Course for the Generative-AI Era (12/89)