System Architecture

BCP / DR Design Rules — RPO, RTO, and the 3-2-1 Backup Rule

BCP / DR Design Rules — RPO, RTO, and the 3-2-1 Backup Rule

About this article

This article is the tenth deep dive in the “System Architecture” category of the Architecture Crash Course for the Generative-AI Era series, covering BCP (Business Continuity Plan).

Designs and procedures so the service doesn’t stop — or recovers fast — even with earthquakes, power outages, cloud incidents, cyber attacks, or human error. Setting RPO/RTO, the availability ladder, the four DR-strategy patterns, 3-2-1 backups, ransomware mitigation, and the reality that redundancy without drills is worthless.

“It doesn’t apply to us” doesn’t hold up

Major outages happen somewhere every few years. Tohoku earthquake 2011, AWS Tokyo region outage 2021, CrowdStrike global outage 2024. “Unforeseen” happens routinely. Without preparation, this leads directly to “days of downtime” — customer trust destroyed, revenue lost, contracts breached.

BCP is “preparation for when things go wrong,” not “just in case.” Optimism like “we’re small” or “nothing has happened so far” evaporates after one major incident. SaaS especially: one long outage and customers churn to competitors and never come back.

BCP ties directly to customer trust.

RPO and RTO

The center of BCP design is RPO and RTO — the two target values to agree with the business first.

IndicatorMeaning
RPO (Recovery Point Objective)How far back can data be restored (allowable data loss)
RTO (Recovery Time Objective)How fast to recover (allowable downtime)
RequirementRPORTOExamples
Mission-criticalZeroSeconds-minutesFinancial trading, payments
Business-criticalMinutesWithin 1 hourE-commerce, internal core systems
Normal business24 hoursDaysInternal tools

Stricter requirements raise cost exponentially. Tightening RPO/RTO without thought triggers massive unnecessary infra investment — a landmine. “How much downtime can we accept?” must be agreed coldly with the business first.

RPO/RTO requires business agreement first. Don’t lead with technology.

The availability ladder

Each availability tier has order-of-magnitude different investment. “99.9%” vs “99.99%” narrows annual downtime from 8.76h to 52.6 min — but cost goes up several-fold.

AvailabilityAnnual downtimeConfigurationCost
99.0%3.65 daysSingle serverCheapest
99.9%8.76 hoursRedundant, multi-AZMid
99.99%52.6 minutesMulti-regionHigh
99.999%5.26 minutesMulti-cloud, Active/ActiveHighest

Multi-AZ (99.9-99.95%) is the realistic standard line. Higher tiers belong only to specialty domains (finance, healthcare, telecom); for most others it’s over-investment.

Multi-AZ as the standard line. Higher tiers require careful requirements review.

The four DR strategies

Per AWS Well-Architected Framework, four DR-strategy patterns. RTO drops left to right; cost climbs.

flowchart LR
    BR["Backup & Restore<br/>RTO: hours-1 day<br/>Cost: cheapest"]
    PL["Pilot Light<br/>RTO: tens of minutes-1 hour<br/>Cost: mid"]
    WS["Warm Standby<br/>RTO: minutes<br/>Cost: high"]
    AA["Multi-site<br/>Active/Active<br/>RTO: 0-seconds<br/>Cost: highest"]
    BR --> PL --> WS --> AA
    BR -.- L1[Startup<br/>Small/mid scale]
    PL -.- L2[Business core]
    WS -.- L3[Customer-impact systems]
    AA -.- L4[Finance / payments<br/>Zero-downtime]
    classDef cheap fill:#dcfce7,stroke:#16a34a;
    classDef mid fill:#fef3c7,stroke:#d97706;
    classDef high fill:#fee2e2,stroke:#dc2626;
    class BR cheap;
    class PL,WS mid;
    class AA high;
StrategyMechanismRTOCost
Backup & RestoreRestore from backupHours to 1 dayCheapest
Pilot LightDB always-on, app stoppedTens of minutes-1 hourMid
Warm StandbyAlways-on at smaller scaleMinutesHigh
Multi-site Active/ActiveBoth sites fully running0-secondsHighest

Startups and small/mid: Backup & Restore is enough. Business-core or customer-impact: Pilot Light or Warm Standby is realistic. Active/Active is reserved for finance, payments, telecom — zero-downtime requirements.

Pilot Light vs Warm Standby

The two are often compared. Understanding the difference balances budget and recovery speed.

Pilot Light keeps a small flame lit always. The secondary environment has only DB synchronously replicated; app servers are stopped. On disaster, start and scale out the apps.

Warm Standby keeps the secondary always running but at a smaller scale than production. Faster recovery via scale-up.

AspectPilot LightWarm Standby
Secondary runningDB onlySmaller-scale full stack
Switchover timeTens of minutes+Minutes
Monthly costMid (DB+storage)High (always-on full stack)
Fits1-hour downtime acceptableMinutes-only

Pick by RPO/RTO requirements. Over-provisioning is waste.

Multi-site Active/Active

Multi-site Active/Active runs both regions and load-balances normally. Top-tier configuration with RTO 0-seconds, but implementation complexity and cost are an order higher.

Required componentSubstance
Global DNSRoute 53 / Global Accelerator distribution
Bidirectional DB replicationAurora Global / DynamoDB Global Tables
Session sharingBoth regions handle the same session
Data-consistency designAllow eventual consistency

The hard part is data consistency: conflict resolution when both regions write at the same time, application architecture that allows eventual consistency, deciding which side is canonical. “Adopt only when truly necessary” is the rule; over-adoption sinks projects in complexity.

Active/Active is a tier above in difficulty. Question its necessity repeatedly.

The 3-2-1 backup rule

The 3-2-1 rule is the world standard guidance for backups. Following it covers disaster, ransomware, and human error.

RuleSubstance
3 copiesOriginal + 2 backups
2 media typesDifferent storage types
1 offsiteGeographically separate / different cloud

Cloud implementation examples:

  • Auto-replicate to another region via S3 Cross-Region Replication.
  • AWS Backup for unified backups across services.
  • Storage in another AWS account against accidental deletion / attacks.
  • Cold copies in offline storage (Glacier Deep Archive).

Just “taking backups” isn’t enough — design where, how many generations, how long explicitly.

Ransomware mitigation

The largest current threat is ransomware. If backups also get encrypted, you have to pay. Designing to protect backups is essential.

MitigationSubstance
Object Lock / WORM (Write Once Read Many — write once, read-only thereafter)Lock writes for a period
Isolated accountPermission compromise on prod doesn’t reach
Offline / cold storageStored without network connectivity
Audit-log immutabilityCloudTrail -> S3 + Object Lock
MFA DeleteS3 deletion requires MFA

The principle: “a design where admins can be compromised but backups can’t be deleted.” Backups in S3 inside the production AWS account can all be deleted on account compromise. Multi-layer storage in different account, different org, offline is required.

Redundancy patterns and drills

Redundancy has typical patterns. Use the right one for the use case.

ConfigurationMechanismUse
Active-ActiveBoth running, load-balancedWeb apps, API servers
Active-StandbyOne on standby, started on switchoverDB master, file servers
N+1N required + 1 spareWorkers
N+MN + multiple sparesLarge clusters

Web apps: Active-Active (load balancer distributes). DBs: Active-Standby (single writer for consistency). Aurora and Cloud SQL automate this in managed form.

Failure drills (Chaos Engineering)

Redundancy alone isn’t enough; “does it actually switch?” must be drilled regularly or it won’t work when needed. That’s Chaos Engineering.

MethodSubstance
Game DayPlanned outage drill with the whole team
Chaos MonkeyRandom instance termination (Netflix-origin)
Failover testSwitchover drill in prod-equivalent environment
DR drillRegion-switch exercise

“Redundancy that isn’t drilled doesn’t work” is the harsh truth. Netflix runs Chaos Monkey to randomly kill servers in production, building a culture resilient to outages. Quarterly failover drills are recommended even for non-Netflix organizations.

Redundancy gets value only via drills. Paper redundancy is powerless.

BCP / DR operational traps

BCP gains value only through “operations that confirm it works each time,” not as paper equipment. Canonical incident patterns:

Forbidden moveWhy
Take backups, never drill restoreLike GitLab January 31, 2017: when production fails, “all 5 backup methods don’t work” surfaces
Backup in the same AWS account as productionEncrypted / deleted on account compromise or ransomware. Different account + Object Lock is the rule
Verify DR config only via annual human-driven drillInfra, permissions, and network change in a year; doesn’t work when invoked
Lead with tech without RPO/RTO agreement with businessOver- or under-investment results. Agree on “how much downtime is acceptable” first
Multi-region / Active-Active without staffingConfig management can’t keep up in normal times; arrives broken at incident time
Backups without Object Lock / WORMRansomware encrypts production-and-backups together
Switchover without estimating capacity at the destinationProduction load on the switchover destination causes secondary outage
Designs without “the entire cloud region might go down”September 2021 AWS Tokyo region outage and July 2024 CrowdStrike update outage demonstrate region-level total outages are real

The CrowdStrike 2024 outage (missing update validation, 8.5M Windows machines BSOD globally — airports, banks, hospitals halted) confronted modern BCP with the premise: even when your cloud is healthy, just one supply-chain piece going down stops the whole thing (details in Appendix: Major Incident Catalog).

Redundancy without drills is worthless. Run failover live at least quarterly.

The AI-era lens

With AI-driven development as the assumption, BCP / DR design has “everything declared in IaC, drills also scripted” as an absolute requirement.

A runbook sitting as a Word document on a paper shelf is obsolete. The modern style: leave incident-response procedures as Lambda or Step Functions code. AI interprets and executes runbooks during incidents while humans focus on decisions and approvals.

AI-era favorableAI-era unfavorable
DR config in IaC (Terraform, CDK)Manual failover procedures
Chaos Engineering automation scriptsAnnual human-only drill
Runbooks as executable codeWord / PDF procedures
Recovery procedures as AI-readable MarkdownTribal oral knowledge

RPO/RTO agreement with business requirements stays human work in the AI era. AI restores infrastructure, but “how much loss is acceptable” remains an executive-level decision.

AI-era BCP runs on “code + AI-executable.” Paper runbooks become debt.

Common misreadings

  • Backups means we’re safe -> “Can you restore” is a different question. Backups without restore drills are equivalent to no backups. Periodically restore for real and measure RTO; that’s operations.
  • Higher availability is always better -> 99.9% to 99.99% costs several times more. Choosing “higher = better” without weighing requirements is a textbook soft choice.
  • Multi-region means we’re set -> Multi-region multiplies operational difficulty. Without staffing and skills, it’s broken in normal times and useless during incidents.
  • On-prem is more disaster-resistant -> Your own DC takes physical damage and personnel constraints simultaneously. Major-cloud multi-region is actually stronger via geographic redundancy and auto-recovery.

GitLab “all 5 wiped out” (industry case)

January 31, 2017: a GitLab operator accidentally deleted the production DB, and none of the 5 prepared backup methods worked (full details in Appendix: Major Incident Catalog). What’s important from the BCP angle: backups taken != restorable is the hard truth.

Lesson: keep the 3-2-1 rule plus run actual restore drills at least quarterly, or all backups are wiped together when needed.

Backup mechanisms break, permissions change, logs flow. Confirming it works each time, as part of operations, is what “prepared” actually means.

Evaluate by “how many times you successfully restored”, not how many copies you have.

What you must decide — what’s your project’s answer?

Articulate your project’s answer in 1-2 sentences for each:

  • Per-feature RPO / RTO (agreed with business)
  • DR strategy (Backup / Pilot / Warm / Active)
  • Backup retention period and generations
  • Geographic distribution of backups
  • Failover procedure and owner
  • Ransomware mitigation (Object Lock, isolated account)
  • Failure-drill frequency (semi-annual, quarterly)
  • Communication / escalation flow

Common failure patterns

  • Implementing without setting RPO/RTO requirements -> Later “this isn’t enough” rebuild.
  • Backups exist, restore unverified -> Production failure surfaces “actually can’t restore.” GitLab January 31, 2017: all 5 backups inoperative; recovery was livestreamed on YouTube — an unprecedented situation.
  • Backups stored in the same AWS account as production -> Account compromise wipes all.
  • No failover drills before launch -> When invoked, procedures don’t work; hours of downtime.
  • Adopted multi-region but stopped operating it for cost -> Requirements / budget mismatch.

How to make the final call

The most dangerous BCP-design mistake is “setting strict RPO/RTO without thinking about cost.” Availability from 99.9% to 99.99% improves annual downtime 10x, but cost goes up several-fold.

The starting point isn’t tech — it’s agreement with business. “Coldly estimate how much downtime is acceptable, and avoid both over- and under-investment” as the core.

The realistic answer: “Multi-AZ + 3-2-1 backups” as the standard line, with anything beyond reserved as luxury for mission-critical industries (finance, healthcare, payments). What’s even more important is the premise “redundancy without drills is worthless” — embed periodic restore drills and failover drills as part of operations from day one.

Paper runbooks effectively don’t run, and the AI era requires runbooks-as-code to function.

Selection priority:

  1. Agree RPO/RTO with the business before picking technology (don’t reverse).
  2. Multi-AZ + 3-2-1 backups as the baseline.
  3. Ransomware mitigation (Object Lock, isolated account) embedded from day one.
  4. Drill plans embedded in operations (Chaos Engineering, DR drills).

“Redundancy without drills is meaningless.” BCP gains value only when it can move, not as paper equipment.

Summary

This article covered BCP / DR designRPO/RTO, the availability ladder, DR strategies, 3-2-1 backups, ransomware mitigation.

Agree RPO/RTO with the business, then pick tech. Multi-AZ + 3-2-1 backups as the standard line. Redundancy is worthless without drills. These three points anchor BCP design’s core.

The next article covers the System Architecture category’s final installment: Cost Management (FinOps).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.