BCP / DR Design Rules — RPO, RTO, and the 3-2-1 Backup Rule

About this article

This article is the tenth deep dive in the “System Architecture” category of the Architecture Crash Course for the Generative-AI Era series, covering BCP (Business Continuity Plan).

Designs and procedures so the service doesn’t stop — or recovers fast — even with earthquakes, power outages, cloud incidents, cyber attacks, or human error. Setting RPO/RTO, the availability ladder, the four DR-strategy patterns, 3-2-1 backups, ransomware mitigation, and the reality that redundancy without drills is worthless.

“It doesn’t apply to us” doesn’t hold up

Major outages happen somewhere every few years. Tohoku earthquake 2011, AWS Tokyo region outage 2021, CrowdStrike global outage 2024. “Unforeseen” happens routinely. Without preparation, this leads directly to “days of downtime” — customer trust destroyed, revenue lost, contracts breached.

BCP is “preparation for when things go wrong,” not “just in case.” Optimism like “we’re small” or “nothing has happened so far” evaporates after one major incident. SaaS especially: one long outage and customers churn to competitors and never come back.

BCP ties directly to customer trust.

RPO and RTO

The center of BCP design is RPO and RTO — the two target values to agree with the business first.

Indicator	Meaning
RPO (Recovery Point Objective)	How far back can data be restored (allowable data loss)
RTO (Recovery Time Objective)	How fast to recover (allowable downtime)

Requirement	RPO	RTO	Examples
Mission-critical	Zero	Seconds-minutes	Financial trading, payments
Business-critical	Minutes	Within 1 hour	E-commerce, internal core systems
Normal business	24 hours	Days	Internal tools

Stricter requirements raise cost exponentially. Tightening RPO/RTO without thought triggers massive unnecessary infra investment — a landmine. “How much downtime can we accept?” must be agreed coldly with the business first.

RPO/RTO requires business agreement first. Don’t lead with technology.

The availability ladder

Each availability tier has order-of-magnitude different investment. “99.9%” vs “99.99%” narrows annual downtime from 8.76h to 52.6 min — but cost goes up several-fold.

Availability	Annual downtime	Configuration	Cost
99.0%	3.65 days	Single server	Cheapest
99.9%	8.76 hours	Redundant, multi-AZ	Mid
99.99%	52.6 minutes	Multi-region	High
99.999%	5.26 minutes	Multi-cloud, Active/Active	Highest

Multi-AZ (99.9-99.95%) is the realistic standard line. Higher tiers belong only to specialty domains (finance, healthcare, telecom); for most others it’s over-investment.

Multi-AZ as the standard line. Higher tiers require careful requirements review.

The four DR strategies

Per AWS Well-Architected Framework, four DR-strategy patterns. RTO drops left to right; cost climbs.

flowchart LR
    BR["Backup & Restore<br/>RTO: hours-1 day<br/>Cost: cheapest"]
    PL["Pilot Light<br/>RTO: tens of minutes-1 hour<br/>Cost: mid"]
    WS["Warm Standby<br/>RTO: minutes<br/>Cost: high"]
    AA["Multi-site<br/>Active/Active<br/>RTO: 0-seconds<br/>Cost: highest"]
    BR --> PL --> WS --> AA
    BR -.- L1[Startup<br/>Small/mid scale]
    PL -.- L2[Business core]
    WS -.- L3[Customer-impact systems]
    AA -.- L4[Finance / payments<br/>Zero-downtime]
    classDef cheap fill:#dcfce7,stroke:#16a34a;
    classDef mid fill:#fef3c7,stroke:#d97706;
    classDef high fill:#fee2e2,stroke:#dc2626;
    class BR cheap;
    class PL,WS mid;
    class AA high;

Strategy	Mechanism	RTO	Cost
Backup & Restore	Restore from backup	Hours to 1 day	Cheapest
Pilot Light	DB always-on, app stopped	Tens of minutes-1 hour	Mid
Warm Standby	Always-on at smaller scale	Minutes	High
Multi-site Active/Active	Both sites fully running	0-seconds	Highest

Startups and small/mid: Backup & Restore is enough. Business-core or customer-impact: Pilot Light or Warm Standby is realistic. Active/Active is reserved for finance, payments, telecom — zero-downtime requirements.

Pilot Light vs Warm Standby

The two are often compared. Understanding the difference balances budget and recovery speed.

Pilot Light keeps a small flame lit always. The secondary environment has only DB synchronously replicated; app servers are stopped. On disaster, start and scale out the apps.

Warm Standby keeps the secondary always running but at a smaller scale than production. Faster recovery via scale-up.

Aspect	Pilot Light	Warm Standby
Secondary running	DB only	Smaller-scale full stack
Switchover time	Tens of minutes+	Minutes
Monthly cost	Mid (DB+storage)	High (always-on full stack)
Fits	1-hour downtime acceptable	Minutes-only

Pick by RPO/RTO requirements. Over-provisioning is waste.

Multi-site Active/Active

Multi-site Active/Active runs both regions and load-balances normally. Top-tier configuration with RTO 0-seconds, but implementation complexity and cost are an order higher.

Required component	Substance
Global DNS	Route 53 / Global Accelerator distribution
Bidirectional DB replication	Aurora Global / DynamoDB Global Tables
Session sharing	Both regions handle the same session
Data-consistency design	Allow eventual consistency

The hard part is data consistency: conflict resolution when both regions write at the same time, application architecture that allows eventual consistency, deciding which side is canonical. “Adopt only when truly necessary” is the rule; over-adoption sinks projects in complexity.

Active/Active is a tier above in difficulty. Question its necessity repeatedly.

The 3-2-1 backup rule

The 3-2-1 rule is the world standard guidance for backups. Following it covers disaster, ransomware, and human error.

Rule	Substance
3 copies	Original + 2 backups
2 media types	Different storage types
1 offsite	Geographically separate / different cloud

Cloud implementation examples:

Auto-replicate to another region via S3 Cross-Region Replication.
AWS Backup for unified backups across services.
Storage in another AWS account against accidental deletion / attacks.
Cold copies in offline storage (Glacier Deep Archive).

Just “taking backups” isn’t enough — design where, how many generations, how long explicitly.

Ransomware mitigation

The largest current threat is ransomware. If backups also get encrypted, you have to pay. Designing to protect backups is essential.

Mitigation	Substance
Object Lock / WORM (Write Once Read Many — write once, read-only thereafter)	Lock writes for a period
Isolated account	Permission compromise on prod doesn’t reach
Offline / cold storage	Stored without network connectivity
Audit-log immutability	CloudTrail -> S3 + Object Lock
MFA Delete	S3 deletion requires MFA

The principle: “a design where admins can be compromised but backups can’t be deleted.” Backups in S3 inside the production AWS account can all be deleted on account compromise. Multi-layer storage in different account, different org, offline is required.

Redundancy patterns and drills

Redundancy has typical patterns. Use the right one for the use case.

Configuration	Mechanism	Use
Active-Active	Both running, load-balanced	Web apps, API servers
Active-Standby	One on standby, started on switchover	DB master, file servers
N+1	N required + 1 spare	Workers
N+M	N + multiple spares	Large clusters

Web apps: Active-Active (load balancer distributes). DBs: Active-Standby (single writer for consistency). Aurora and Cloud SQL automate this in managed form.

Failure drills (Chaos Engineering)

Redundancy alone isn’t enough; “does it actually switch?” must be drilled regularly or it won’t work when needed. That’s Chaos Engineering.

Method	Substance
Game Day	Planned outage drill with the whole team
Chaos Monkey	Random instance termination (Netflix-origin)
Failover test	Switchover drill in prod-equivalent environment
DR drill	Region-switch exercise

“Redundancy that isn’t drilled doesn’t work” is the harsh truth. Netflix runs Chaos Monkey to randomly kill servers in production, building a culture resilient to outages. Quarterly failover drills are recommended even for non-Netflix organizations.

Redundancy gets value only via drills. Paper redundancy is powerless.

BCP / DR operational traps

BCP gains value only through “operations that confirm it works each time,” not as paper equipment. Canonical incident patterns:

Forbidden move	Why
Take backups, never drill restore	Like GitLab January 31, 2017: when production fails, “all 5 backup methods don’t work” surfaces
Backup in the same AWS account as production	Encrypted / deleted on account compromise or ransomware. Different account + Object Lock is the rule
Verify DR config only via annual human-driven drill	Infra, permissions, and network change in a year; doesn’t work when invoked
Lead with tech without RPO/RTO agreement with business	Over- or under-investment results. Agree on “how much downtime is acceptable” first
Multi-region / Active-Active without staffing	Config management can’t keep up in normal times; arrives broken at incident time
Backups without Object Lock / WORM	Ransomware encrypts production-and-backups together
Switchover without estimating capacity at the destination	Production load on the switchover destination causes secondary outage
Designs without “the entire cloud region might go down”	September 2021 AWS Tokyo region outage and July 2024 CrowdStrike update outage demonstrate region-level total outages are real

The CrowdStrike 2024 outage (missing update validation, 8.5M Windows machines BSOD globally — airports, banks, hospitals halted) confronted modern BCP with the premise: even when your cloud is healthy, just one supply-chain piece going down stops the whole thing (details in Appendix: Major Incident Catalog).

Redundancy without drills is worthless. Run failover live at least quarterly.

The AI-era lens

With AI-driven development as the assumption, BCP / DR design has “everything declared in IaC, drills also scripted” as an absolute requirement.

A runbook sitting as a Word document on a paper shelf is obsolete. The modern style: leave incident-response procedures as Lambda or Step Functions code. AI interprets and executes runbooks during incidents while humans focus on decisions and approvals.

AI-era favorable	AI-era unfavorable
DR config in IaC (Terraform, CDK)	Manual failover procedures
Chaos Engineering automation scripts	Annual human-only drill
Runbooks as executable code	Word / PDF procedures
Recovery procedures as AI-readable Markdown	Tribal oral knowledge

RPO/RTO agreement with business requirements stays human work in the AI era. AI restores infrastructure, but “how much loss is acceptable” remains an executive-level decision.

AI-era BCP runs on “code + AI-executable.” Paper runbooks become debt.

Common misreadings

Backups means we’re safe -> “Can you restore” is a different question. Backups without restore drills are equivalent to no backups. Periodically restore for real and measure RTO; that’s operations.
Higher availability is always better -> 99.9% to 99.99% costs several times more. Choosing “higher = better” without weighing requirements is a textbook soft choice.
Multi-region means we’re set -> Multi-region multiplies operational difficulty. Without staffing and skills, it’s broken in normal times and useless during incidents.
On-prem is more disaster-resistant -> Your own DC takes physical damage and personnel constraints simultaneously. Major-cloud multi-region is actually stronger via geographic redundancy and auto-recovery.

GitLab “all 5 wiped out” (industry case)

January 31, 2017: a GitLab operator accidentally deleted the production DB, and none of the 5 prepared backup methods worked (full details in Appendix: Major Incident Catalog). What’s important from the BCP angle: backups taken != restorable is the hard truth.

Lesson: keep the 3-2-1 rule plus run actual restore drills at least quarterly, or all backups are wiped together when needed.

Backup mechanisms break, permissions change, logs flow. Confirming it works each time, as part of operations, is what “prepared” actually means.

Evaluate by “how many times you successfully restored”, not how many copies you have.

What you must decide — what’s your project’s answer?

Articulate your project’s answer in 1-2 sentences for each:

Per-feature RPO / RTO (agreed with business)
DR strategy (Backup / Pilot / Warm / Active)
Backup retention period and generations
Geographic distribution of backups
Failover procedure and owner
Ransomware mitigation (Object Lock, isolated account)
Failure-drill frequency (semi-annual, quarterly)
Communication / escalation flow

Common failure patterns

Implementing without setting RPO/RTO requirements -> Later “this isn’t enough” rebuild.
Backups exist, restore unverified -> Production failure surfaces “actually can’t restore.” GitLab January 31, 2017: all 5 backups inoperative; recovery was livestreamed on YouTube — an unprecedented situation.
Backups stored in the same AWS account as production -> Account compromise wipes all.
No failover drills before launch -> When invoked, procedures don’t work; hours of downtime.
Adopted multi-region but stopped operating it for cost -> Requirements / budget mismatch.

How to make the final call

The most dangerous BCP-design mistake is “setting strict RPO/RTO without thinking about cost.” Availability from 99.9% to 99.99% improves annual downtime 10x, but cost goes up several-fold.

The starting point isn’t tech — it’s agreement with business. “Coldly estimate how much downtime is acceptable, and avoid both over- and under-investment” as the core.

The realistic answer: “Multi-AZ + 3-2-1 backups” as the standard line, with anything beyond reserved as luxury for mission-critical industries (finance, healthcare, payments). What’s even more important is the premise “redundancy without drills is worthless” — embed periodic restore drills and failover drills as part of operations from day one.

Paper runbooks effectively don’t run, and the AI era requires runbooks-as-code to function.

Selection priority:

Agree RPO/RTO with the business before picking technology (don’t reverse).
Multi-AZ + 3-2-1 backups as the baseline.
Ransomware mitigation (Object Lock, isolated account) embedded from day one.
Drill plans embedded in operations (Chaos Engineering, DR drills).

“Redundancy without drills is meaningless.” BCP gains value only when it can move, not as paper equipment.

Summary

This article covered BCP / DR design — RPO/RTO, the availability ladder, DR strategies, 3-2-1 backups, ransomware mitigation.

Agree RPO/RTO with the business, then pick tech. Multi-AZ + 3-2-1 backups as the standard line. Redundancy is worthless without drills. These three points anchor BCP design’s core.

The next article covers the System Architecture category’s final installment: Cost Management (FinOps).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.