About this article
This article is the tenth deep dive in the “System Architecture” category of the Architecture Crash Course for the Generative-AI Era series, covering BCP (Business Continuity Plan).
Designs and procedures so the service doesn’t stop — or recovers fast — even with earthquakes, power outages, cloud incidents, cyber attacks, or human error. Setting RPO/RTO, the availability ladder, the four DR-strategy patterns, 3-2-1 backups, ransomware mitigation, and the reality that redundancy without drills is worthless.
More articles in this category
“It doesn’t apply to us” doesn’t hold up
Major outages happen somewhere every few years. Tohoku earthquake 2011, AWS Tokyo region outage 2021, CrowdStrike global outage 2024. “Unforeseen” happens routinely. Without preparation, this leads directly to “days of downtime” — customer trust destroyed, revenue lost, contracts breached.
BCP is “preparation for when things go wrong,” not “just in case.” Optimism like “we’re small” or “nothing has happened so far” evaporates after one major incident. SaaS especially: one long outage and customers churn to competitors and never come back.
BCP ties directly to customer trust.
RPO and RTO
The center of BCP design is RPO and RTO — the two target values to agree with the business first.
| Indicator | Meaning |
|---|---|
| RPO (Recovery Point Objective) | How far back can data be restored (allowable data loss) |
| RTO (Recovery Time Objective) | How fast to recover (allowable downtime) |
| Requirement | RPO | RTO | Examples |
|---|---|---|---|
| Mission-critical | Zero | Seconds-minutes | Financial trading, payments |
| Business-critical | Minutes | Within 1 hour | E-commerce, internal core systems |
| Normal business | 24 hours | Days | Internal tools |
Stricter requirements raise cost exponentially. Tightening RPO/RTO without thought triggers massive unnecessary infra investment — a landmine. “How much downtime can we accept?” must be agreed coldly with the business first.
RPO/RTO requires business agreement first. Don’t lead with technology.
The availability ladder
Each availability tier has order-of-magnitude different investment. “99.9%” vs “99.99%” narrows annual downtime from 8.76h to 52.6 min — but cost goes up several-fold.
| Availability | Annual downtime | Configuration | Cost |
|---|---|---|---|
| 99.0% | 3.65 days | Single server | Cheapest |
| 99.9% | 8.76 hours | Redundant, multi-AZ | Mid |
| 99.99% | 52.6 minutes | Multi-region | High |
| 99.999% | 5.26 minutes | Multi-cloud, Active/Active | Highest |
Multi-AZ (99.9-99.95%) is the realistic standard line. Higher tiers belong only to specialty domains (finance, healthcare, telecom); for most others it’s over-investment.
Multi-AZ as the standard line. Higher tiers require careful requirements review.
The four DR strategies
Per AWS Well-Architected Framework, four DR-strategy patterns. RTO drops left to right; cost climbs.
flowchart LR
BR["Backup & Restore<br/>RTO: hours-1 day<br/>Cost: cheapest"]
PL["Pilot Light<br/>RTO: tens of minutes-1 hour<br/>Cost: mid"]
WS["Warm Standby<br/>RTO: minutes<br/>Cost: high"]
AA["Multi-site<br/>Active/Active<br/>RTO: 0-seconds<br/>Cost: highest"]
BR --> PL --> WS --> AA
BR -.- L1[Startup<br/>Small/mid scale]
PL -.- L2[Business core]
WS -.- L3[Customer-impact systems]
AA -.- L4[Finance / payments<br/>Zero-downtime]
classDef cheap fill:#dcfce7,stroke:#16a34a;
classDef mid fill:#fef3c7,stroke:#d97706;
classDef high fill:#fee2e2,stroke:#dc2626;
class BR cheap;
class PL,WS mid;
class AA high;
| Strategy | Mechanism | RTO | Cost |
|---|---|---|---|
| Backup & Restore | Restore from backup | Hours to 1 day | Cheapest |
| Pilot Light | DB always-on, app stopped | Tens of minutes-1 hour | Mid |
| Warm Standby | Always-on at smaller scale | Minutes | High |
| Multi-site Active/Active | Both sites fully running | 0-seconds | Highest |
Startups and small/mid: Backup & Restore is enough. Business-core or customer-impact: Pilot Light or Warm Standby is realistic. Active/Active is reserved for finance, payments, telecom — zero-downtime requirements.
Pilot Light vs Warm Standby
The two are often compared. Understanding the difference balances budget and recovery speed.
Pilot Light keeps a small flame lit always. The secondary environment has only DB synchronously replicated; app servers are stopped. On disaster, start and scale out the apps.
Warm Standby keeps the secondary always running but at a smaller scale than production. Faster recovery via scale-up.
| Aspect | Pilot Light | Warm Standby |
|---|---|---|
| Secondary running | DB only | Smaller-scale full stack |
| Switchover time | Tens of minutes+ | Minutes |
| Monthly cost | Mid (DB+storage) | High (always-on full stack) |
| Fits | 1-hour downtime acceptable | Minutes-only |
Pick by RPO/RTO requirements. Over-provisioning is waste.
Multi-site Active/Active
Multi-site Active/Active runs both regions and load-balances normally. Top-tier configuration with RTO 0-seconds, but implementation complexity and cost are an order higher.
| Required component | Substance |
|---|---|
| Global DNS | Route 53 / Global Accelerator distribution |
| Bidirectional DB replication | Aurora Global / DynamoDB Global Tables |
| Session sharing | Both regions handle the same session |
| Data-consistency design | Allow eventual consistency |
The hard part is data consistency: conflict resolution when both regions write at the same time, application architecture that allows eventual consistency, deciding which side is canonical. “Adopt only when truly necessary” is the rule; over-adoption sinks projects in complexity.
Active/Active is a tier above in difficulty. Question its necessity repeatedly.
The 3-2-1 backup rule
The 3-2-1 rule is the world standard guidance for backups. Following it covers disaster, ransomware, and human error.
| Rule | Substance |
|---|---|
| 3 copies | Original + 2 backups |
| 2 media types | Different storage types |
| 1 offsite | Geographically separate / different cloud |
Cloud implementation examples:
- Auto-replicate to another region via S3 Cross-Region Replication.
- AWS Backup for unified backups across services.
- Storage in another AWS account against accidental deletion / attacks.
- Cold copies in offline storage (Glacier Deep Archive).
Just “taking backups” isn’t enough — design where, how many generations, how long explicitly.
Ransomware mitigation
The largest current threat is ransomware. If backups also get encrypted, you have to pay. Designing to protect backups is essential.
| Mitigation | Substance |
|---|---|
| Object Lock / WORM (Write Once Read Many — write once, read-only thereafter) | Lock writes for a period |
| Isolated account | Permission compromise on prod doesn’t reach |
| Offline / cold storage | Stored without network connectivity |
| Audit-log immutability | CloudTrail -> S3 + Object Lock |
| MFA Delete | S3 deletion requires MFA |
The principle: “a design where admins can be compromised but backups can’t be deleted.” Backups in S3 inside the production AWS account can all be deleted on account compromise. Multi-layer storage in different account, different org, offline is required.
Redundancy patterns and drills
Redundancy has typical patterns. Use the right one for the use case.
| Configuration | Mechanism | Use |
|---|---|---|
| Active-Active | Both running, load-balanced | Web apps, API servers |
| Active-Standby | One on standby, started on switchover | DB master, file servers |
| N+1 | N required + 1 spare | Workers |
| N+M | N + multiple spares | Large clusters |
Web apps: Active-Active (load balancer distributes). DBs: Active-Standby (single writer for consistency). Aurora and Cloud SQL automate this in managed form.
Failure drills (Chaos Engineering)
Redundancy alone isn’t enough; “does it actually switch?” must be drilled regularly or it won’t work when needed. That’s Chaos Engineering.
| Method | Substance |
|---|---|
| Game Day | Planned outage drill with the whole team |
| Chaos Monkey | Random instance termination (Netflix-origin) |
| Failover test | Switchover drill in prod-equivalent environment |
| DR drill | Region-switch exercise |
“Redundancy that isn’t drilled doesn’t work” is the harsh truth. Netflix runs Chaos Monkey to randomly kill servers in production, building a culture resilient to outages. Quarterly failover drills are recommended even for non-Netflix organizations.
Redundancy gets value only via drills. Paper redundancy is powerless.
BCP / DR operational traps
BCP gains value only through “operations that confirm it works each time,” not as paper equipment. Canonical incident patterns:
| Forbidden move | Why |
|---|---|
| Take backups, never drill restore | Like GitLab January 31, 2017: when production fails, “all 5 backup methods don’t work” surfaces |
| Backup in the same AWS account as production | Encrypted / deleted on account compromise or ransomware. Different account + Object Lock is the rule |
| Verify DR config only via annual human-driven drill | Infra, permissions, and network change in a year; doesn’t work when invoked |
| Lead with tech without RPO/RTO agreement with business | Over- or under-investment results. Agree on “how much downtime is acceptable” first |
| Multi-region / Active-Active without staffing | Config management can’t keep up in normal times; arrives broken at incident time |
| Backups without Object Lock / WORM | Ransomware encrypts production-and-backups together |
| Switchover without estimating capacity at the destination | Production load on the switchover destination causes secondary outage |
| Designs without “the entire cloud region might go down” | September 2021 AWS Tokyo region outage and July 2024 CrowdStrike update outage demonstrate region-level total outages are real |
The CrowdStrike 2024 outage (missing update validation, 8.5M Windows machines BSOD globally — airports, banks, hospitals halted) confronted modern BCP with the premise: even when your cloud is healthy, just one supply-chain piece going down stops the whole thing (details in Appendix: Major Incident Catalog).
Redundancy without drills is worthless. Run failover live at least quarterly.
The AI-era lens
With AI-driven development as the assumption, BCP / DR design has “everything declared in IaC, drills also scripted” as an absolute requirement.
A runbook sitting as a Word document on a paper shelf is obsolete. The modern style: leave incident-response procedures as Lambda or Step Functions code. AI interprets and executes runbooks during incidents while humans focus on decisions and approvals.
| AI-era favorable | AI-era unfavorable |
|---|---|
| DR config in IaC (Terraform, CDK) | Manual failover procedures |
| Chaos Engineering automation scripts | Annual human-only drill |
| Runbooks as executable code | Word / PDF procedures |
| Recovery procedures as AI-readable Markdown | Tribal oral knowledge |
RPO/RTO agreement with business requirements stays human work in the AI era. AI restores infrastructure, but “how much loss is acceptable” remains an executive-level decision.
AI-era BCP runs on “code + AI-executable.” Paper runbooks become debt.
Common misreadings
- Backups means we’re safe -> “Can you restore” is a different question. Backups without restore drills are equivalent to no backups. Periodically restore for real and measure RTO; that’s operations.
- Higher availability is always better -> 99.9% to 99.99% costs several times more. Choosing “higher = better” without weighing requirements is a textbook soft choice.
- Multi-region means we’re set -> Multi-region multiplies operational difficulty. Without staffing and skills, it’s broken in normal times and useless during incidents.
- On-prem is more disaster-resistant -> Your own DC takes physical damage and personnel constraints simultaneously. Major-cloud multi-region is actually stronger via geographic redundancy and auto-recovery.
GitLab “all 5 wiped out” (industry case)
January 31, 2017: a GitLab operator accidentally deleted the production DB, and none of the 5 prepared backup methods worked (full details in Appendix: Major Incident Catalog). What’s important from the BCP angle: backups taken != restorable is the hard truth.
Lesson: keep the 3-2-1 rule plus run actual restore drills at least quarterly, or all backups are wiped together when needed.
Backup mechanisms break, permissions change, logs flow. Confirming it works each time, as part of operations, is what “prepared” actually means.
Evaluate by “how many times you successfully restored”, not how many copies you have.
What you must decide — what’s your project’s answer?
Articulate your project’s answer in 1-2 sentences for each:
- Per-feature RPO / RTO (agreed with business)
- DR strategy (Backup / Pilot / Warm / Active)
- Backup retention period and generations
- Geographic distribution of backups
- Failover procedure and owner
- Ransomware mitigation (Object Lock, isolated account)
- Failure-drill frequency (semi-annual, quarterly)
- Communication / escalation flow
Common failure patterns
- Implementing without setting RPO/RTO requirements -> Later “this isn’t enough” rebuild.
- Backups exist, restore unverified -> Production failure surfaces “actually can’t restore.” GitLab January 31, 2017: all 5 backups inoperative; recovery was livestreamed on YouTube — an unprecedented situation.
- Backups stored in the same AWS account as production -> Account compromise wipes all.
- No failover drills before launch -> When invoked, procedures don’t work; hours of downtime.
- Adopted multi-region but stopped operating it for cost -> Requirements / budget mismatch.
How to make the final call
The most dangerous BCP-design mistake is “setting strict RPO/RTO without thinking about cost.” Availability from 99.9% to 99.99% improves annual downtime 10x, but cost goes up several-fold.
The starting point isn’t tech — it’s agreement with business. “Coldly estimate how much downtime is acceptable, and avoid both over- and under-investment” as the core.
The realistic answer: “Multi-AZ + 3-2-1 backups” as the standard line, with anything beyond reserved as luxury for mission-critical industries (finance, healthcare, payments). What’s even more important is the premise “redundancy without drills is worthless” — embed periodic restore drills and failover drills as part of operations from day one.
Paper runbooks effectively don’t run, and the AI era requires runbooks-as-code to function.
Selection priority:
- Agree RPO/RTO with the business before picking technology (don’t reverse).
- Multi-AZ + 3-2-1 backups as the baseline.
- Ransomware mitigation (Object Lock, isolated account) embedded from day one.
- Drill plans embedded in operations (Chaos Engineering, DR drills).
“Redundancy without drills is meaningless.” BCP gains value only when it can move, not as paper equipment.
Summary
This article covered BCP / DR design — RPO/RTO, the availability ladder, DR strategies, 3-2-1 backups, ransomware mitigation.
Agree RPO/RTO with the business, then pick tech. Multi-AZ + 3-2-1 backups as the standard line. Redundancy is worthless without drills. These three points anchor BCP design’s core.
The next article covers the System Architecture category’s final installment: Cost Management (FinOps).
Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book
I hope you’ll read the next article as well.
📚 Series: Architecture Crash Course for the Generative-AI Era (15/89)