System Architecture

BCP / DR Design Rules — RPO, RTO, and the 3-2-1 Backup Rule

BCP / DR Design Rules — RPO, RTO, and the 3-2-1 Backup Rule

About this article

This article is the tenth deep dive in the “System Architecture” category of the Architecture Crash Course for the Generative-AI Era series, covering BCP.

Designs and procedures so the service doesn’t stop — or recovers fast — even with earthquakes, power outages, cloud incidents, cyber attacks, or human error. Setting RPO/RTO, the availability ladder, the four DR-strategy patterns, 3-2-1 backups, ransomware mitigation, and the reality that redundancy without drills is worthless.

What is BCP/DR design in the first place

RPO and RTO Definitions

BCP/DR design is, roughly speaking, “the plan for how fast to recover when systems go down due to earthquakes, power outages, cyber attacks, or other disasters.”

Imagine an emergency preparedness bag. Searching for water and food after a disaster strikes is too late — preparing in advance is what makes it meaningful. BCP works the same way — you decide in advance “how many hours until recovery (RTO)” and “how far back can data be restored (RPO),” then set up backups, redundant configurations, and recovery procedures. Just like disaster preparedness, plans without regular drills won’t work when you need them.

Why BCP/DR design matters

What happens if you skip BCP/DR design? Major outages happen somewhere every few years. Tohoku earthquake 2011, AWS Tokyo region outage 2021, CrowdStrike global outage 2024 — “unforeseen” happens routinely. Without preparation, this leads directly to “days of downtime” — customer trust destroyed, revenue lost, contracts breached. SaaS especially: one long outage and customers churn to competitors and never come back.

”It doesn’t apply to us” doesn’t hold up

Major outages happen somewhere every few years. Tohoku earthquake 2011, AWS Tokyo region outage 2021, CrowdStrike global outage 2024. “Unforeseen” happens routinely. Without preparation, this leads directly to “days of downtime” — customer trust destroyed, revenue lost, contracts breached.

BCP is “preparation for when things go wrong,” not “just in case.” Optimism like “we’re small” or “nothing has happened so far” evaporates after one major incident. SaaS especially: one long outage and customers churn to competitors and never come back.

BCP ties directly to customer trust.

RPO and RTO

The center of BCP design is RPO and RTO — the two target values to agree with the business first.

IndicatorMeaning
RPO (Recovery Point Objective)How far back can data be restored (allowable data loss)
RTO (Recovery Time Objective)How fast to recover (allowable downtime)
RequirementRPORTOExamples
Mission-criticalZeroSeconds-minutesFinancial trading, payments
Business-criticalMinutesWithin 1 hourE-commerce, internal core systems
Normal business24 hoursDaysInternal tools

Stricter requirements raise cost exponentially. Tightening RPO/RTO without thought triggers massive unnecessary infra investment — a landmine. “How much downtime can we accept?” must be agreed coldly with the business first.

RPO/RTO requires business agreement first. Don’t lead with technology.

The availability ladder

Each availability tier has order-of-magnitude different investment. “99.9%” vs “99.99%” narrows annual downtime from 8.76h to 52.6 min — but cost goes up several-fold.

AvailabilityAnnual downtimeConfigurationCost
99.0%3.65 daysSingle serverCheapest
99.9%8.76 hoursRedundant, multi-AZMid
99.99%52.6 minutesMulti-regionHigh
99.999%5.26 minutesMulti-cloud, Active/ActiveHighest

Multi-AZ (99.9-99.95%) is the realistic standard line. Higher tiers belong only to specialty domains (finance, healthcare, telecom); for most others it’s over-investment.

Multi-AZ as the standard line. Higher tiers require careful requirements review.

The four DR strategies

Per AWS Well-Architected Framework, four DR-strategy patterns. RTO drops left to right; cost multiplies.

Four DR Strategy Patterns (Cost vs RTO) The further right, the faster recovery, but cost increases exponentially Short RTO Long RTO High Backup & Restore Restore from Backup RTO: Hours to 1 day Cost: Lowest Sufficient for MVP & small-scale Pilot Light Keep a pilot light burning Only DB always synced App servers are stopped RTO: Tens of minutes to 1 hour Cost: Medium Business systems tolerating 1-hour outage Warm Standby Reduced Operation Always running at reduced scale All components running But at smaller scale than production Scale up on failure RTO: Minutes Cost: High Core systems tolerating minutes of downtime Multi-site Active/Active A B Both sites fully operational Constant load balancing Bidirectional DB replication Global DNS distribution RTO: Zero to seconds Cost: Highest Challenges: Accepting eventual consistency Handling write conflicts Design difficulty is on another level Finance, payments, zero downtime Low Cost High Cost Agree RPO/RTO with business units. Overspec is waste, underspec is an accident
StrategyMechanismRTOCost
Backup & RestoreRestore from backupHours to 1 dayCheapest
Pilot LightDB always-on, app stoppedTens of minutes-1 hourMid
Warm StandbyAlways-on at smaller scaleMinutesHigh
Multi-site Active/ActiveBoth sites fully running0-secondsHighest

Startups and small/mid: Backup & Restore is enough. Business-core or customer-impact: Pilot Light or Warm Standby is realistic. Active/Active is reserved for finance, payments, telecom — zero-downtime requirements.

Pilot Light vs Warm Standby

The two are often compared. Understanding the difference balances budget and recovery speed.

Pilot Light keeps a small flame lit always. The secondary environment has only DB synchronously replicated; app servers are stopped. On disaster, start and scale out the apps.

Warm Standby keeps the secondary always running but at a smaller scale than production. Faster recovery via scale-up.

AspectPilot LightWarm Standby
Secondary runningDB onlySmaller-scale full stack
Switchover timeTens of minutes+Minutes
Monthly costMid (DB+storage)High (always-on full stack)
Fits1-hour downtime acceptableMinutes-only

Pick by RPO/RTO requirements. Over-provisioning is waste.

Multi-site Active/Active

Multi-site Active/Active runs both regions and load-balances normally. Top-tier configuration with RTO 0-seconds, but implementation complexity and cost are an order higher.

Required componentSubstance
Global DNSRoute 53 / Global Accelerator distribution
Bidirectional DB replicationAurora Global / DynamoDB Global Tables
Session sharingBoth regions handle the same session
Data-consistency designAllow eventual consistency

The hard part is data consistency: conflict resolution when both regions write at the same time, application architecture that allows eventual consistency, deciding which side is canonical. “Adopt only when truly necessary” is the rule; over-adoption sinks projects in complexity.

Active/Active is a tier above in difficulty. Question its necessity repeatedly.

The 3-2-1 backup rule

The 3-2-1 rule is the world standard guidance for backups. Following it covers disaster, ransomware, and human error.

RuleSubstance
3 copiesOriginal + 2 backups
2 media typesDifferent storage types
1 offsiteGeographically separate / different cloud

Cloud implementation examples:

  • Auto-replicate to another region via S3 Cross-Region Replication.
  • AWS Backup for unified backups across services.
  • Storage in another AWS account against accidental deletion / attacks.
  • Cold copies in offline storage (Glacier Deep Archive).

Just “taking backups” isn’t enough — design where, how many generations, how long explicitly.

Ransomware mitigation

The largest current threat is ransomware. If backups also get encrypted, you have to pay. Designing to protect backups is essential.

MitigationSubstance
Object Lock / WORM (Write Once Read Many — write once, read-only thereafter)Lock writes for a period
Isolated accountPermission compromise on prod doesn’t reach
Offline / cold storageStored without network connectivity
Audit-log immutabilityCloudTrail -> S3 + Object Lock
MFA DeleteS3 deletion requires MFA

The principle: “a design where admins can be compromised but backups can’t be deleted.” Backups in S3 inside the production AWS account can all be deleted on account compromise. Multi-layer storage in different account, different org, offline is required.

Redundancy patterns and drills

Redundancy has typical patterns. Use the right one for the use case.

ConfigurationMechanismUse
Active-ActiveBoth running, load-balancedWeb apps, API servers
Active-StandbyOne on standby, started on switchoverDB master, file servers
N+1N required + 1 spareWorkers
N+MN + multiple sparesLarge clusters

Web apps: Active-Active (load balancer distributes). DBs: Active-Standby (single writer for consistency). Aurora and Cloud SQL automate this in managed form.

Failure drills (Chaos Engineering)

Redundancy alone isn’t enough; “does it actually switch?” must be drilled regularly or it won’t work when needed. That’s Chaos Engineering.

MethodSubstance
Game DayPlanned outage drill with the whole team
Chaos MonkeyRandom instance termination (Netflix-origin)
Failover testSwitchover drill in prod-equivalent environment
DR drillRegion-switch exercise

“Redundancy that isn’t drilled doesn’t work” is the harsh truth. Netflix runs Chaos Monkey to randomly kill servers in production, building a culture resilient to outages. Quarterly failover drills are recommended even for non-Netflix organizations.

Redundancy gets value only via drills. Paper redundancy is powerless.

BCP / DR operational traps

BCP gains value only through “operations that confirm it works each time,” not as paper equipment. Canonical incident patterns:

Forbidden moveWhy
Take backups, never drill restoreLike GitLab January 31, 2017: when production fails, “all 5 backup methods don’t work” surfaces
Backup in the same AWS account as productionEncrypted / deleted on account compromise or ransomware. Different account + Object Lock is the rule
Verify DR config only via annual human-driven drillInfra, permissions, and network change in a year; doesn’t work when invoked
Lead with tech without RPO/RTO agreement with businessOver- or under-investment results. Agree on “how much downtime is acceptable” first
Multi-region / Active-Active without staffingConfig management can’t keep up in normal times; arrives broken at incident time
Backups without Object Lock / WORMRansomware encrypts production-and-backups together
Switchover without estimating capacity at the destinationProduction load on the switchover destination causes secondary outage
Designs without “the entire cloud region might go down”September 2021 AWS Tokyo region outage and July 2024 CrowdStrike update outage demonstrate region-level total outages are real
Choosing highest availability without cost consideration99.9% to 99.99% costs several times more. Choosing high availability without weighing requirements vs cost is textbook over-investment
Assuming on-prem is more disaster-resistantYour own DC takes physical damage and personnel constraints simultaneously. Major-cloud multi-region is actually stronger via geographic redundancy and auto-recovery

The CrowdStrike 2024 outage (missing update validation, 8.5M Windows machines BSOD globally — airports, banks, hospitals halted) confronted modern BCP with the premise: even when your cloud is healthy, just one supply-chain piece going down stops the whole thing (details in Appendix: Major Incident Catalog).

Redundancy without drills is worthless. Run failover live at least quarterly.

AI decision axes

With AI-driven development as the assumption, BCP / DR design has “everything declared in IaC, drills also scripted” as an absolute requirement.

AI-era favorableAI-era unfavorable
DR config in IaC (Terraform, CDK)Manual failover procedures
Chaos Engineering automation scriptsAnnual human-only drill
Runbooks as executable codeWord / PDF procedures
Recovery procedures as AI-readable MarkdownTribal oral knowledge

RPO/RTO agreement with business requirements stays human work in the AI era. AI restores infrastructure, but “how much loss is acceptable” remains an executive-level decision.

  1. Agree RPO/RTO with the business before picking technology (don’t reverse).
  2. Multi-AZ + 3-2-1 backups as the baseline.
  3. Ransomware mitigation (Object Lock, isolated account) embedded from day one.
  4. Drill plans embedded in operations (Chaos Engineering, DR drills).

IaC-defined DR configs can be verified by AI

When DR configurations are codified in Terraform/CDK, AI can detect “diffs between production and DR environments.” The problem of forgetting to reflect new production services on the DR side becomes visible as a code diff, enabling an operational flow where AI flags it during PR review.

With manually-built DR environments, config diffs against production can only be tracked via documentation, and situations where the DR environment has been left on a months-old configuration are frequent.

Auto-generating Chaos Engineering scripts

AWS Fault Injection Simulator (FIS) templates and LitmusChaos YAML are formats AI can generate easily. For instructions like “write an FIS template simulating an AZ failure” or “generate a LitmusChaos YAML that injects network latency to a specific Pod,” AI can produce code following standard templates.

This lowers the cost of creating DR drill scenarios, making it easier to increase drill frequency. Instead of one large annual drill, monthly small-scale fault-injection tests become realistic.

GitLab “all 5 wiped out” (industry case)

January 31, 2017: a GitLab operator accidentally deleted the production DB, and none of the 5 prepared backup methods worked (full details in Appendix: Major Incident Catalog). What’s important from the BCP angle: backups taken != restorable is the hard truth.

Lesson: keep the 3-2-1 rule plus run actual restore drills at least quarterly, or all backups are wiped together when needed.

Backup mechanisms break, permissions change, logs flow. Confirming it works each time, as part of operations, is what “prepared” actually means.

Evaluate by “how many times you successfully restored”, not how many copies you have.

What you must decide — what’s your project’s answer?

Articulate your project’s answer in 1-2 sentences for each:

  • Per-feature RPO / RTO (agreed with business)
  • DR strategy (Backup / Pilot / Warm / Active)
  • Backup retention period and generations
  • Geographic distribution of backups
  • Failover procedure and owner
  • Ransomware mitigation (Object Lock, isolated account)
  • Failure-drill frequency (semi-annual, quarterly)
  • Communication / escalation flow

https://en.senkohome.com/arch-intro-system-cloud-vendor/ https://en.senkohome.com/arch-intro-system-network/ https://en.senkohome.com/arch-intro-index-system/

Summary

This article covered BCP / DR designRPO/RTO, the availability ladder, DR strategies, 3-2-1 backups, ransomware mitigation.

Agree RPO/RTO with the business, then pick tech. Multi-AZ + 3-2-1 backups as the standard line. Redundancy is worthless without drills. These three points anchor BCP design’s core.

The next article covers the System Architecture category’s final installment: Cost Management (FinOps).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.

📚 Series: Architecture Crash Course for the Generative-AI Era (15/89)