DR (Disaster Recovery)

Definition

Disaster Recovery (DR) is the set of policies, tools, and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-caused disaster. DR is a subset of business continuity planning.

DR focuses on restoring IT systems and data, while business continuity encompasses the entire organization’s operations.

Key Metrics

Metric	Definition	Typical Target
RTO (Recovery Time Objective)	Max acceptable downtime	Minutes to hours
RPO (Recovery Point Objective)	Max acceptable data loss	Minutes to hours
MTO (Maximum Tolerable Outage)	Max time before irreversible damage	Hours to days

DR Strategies (from most to least expensive)

Strategy	RTO	RPO	Cost	Description
Pilot Light	Minutes	Minutes	Medium	Core systems always running in DR site
Warm Standby	Hours	Minutes	Medium-High	Scaled-down DR environment ready to expand
Hot Standby	Minutes	Near-zero	High	Full DR environment, always active
Backup & Restore	Hours-Days	Hours	Low	Restore from backups after disaster
Multi-site Active-Active	Seconds	Zero	Very High	All sites serve traffic simultaneously

DR Components

Component	Purpose
Backup systems	Data backups (full, incremental, differential)
DR site	Physical or cloud location for recovery
Replication	Real-time or near-real-time data replication
Failover	Automatic or manual switch to DR site
Failback	Return to primary site after recovery
DNS failover	Route traffic to DR site
Load balancer failover	Redirect traffic to DR infrastructure

Cloud DR

Provider	DR Service	Notes
AWS	AWS Backup, DR on AWS, Cross-Region Replication	Most mature cloud DR
Azure	Azure Site Recovery, Backup Vault	Strong hybrid DR
GCP	Cloud Backup and DR, Cross-Region Replication	Integrated with GCP services
OpenStack	Sahara, Heat templates	Open-source DR automation

DR Testing

Tabletop exercise: Walk through DR plan verbally
Simulation: Simulate disaster without affecting production
Partial failover: Fail over non-critical systems
Full failover: Complete DR test with full system switch
Frequency: Quarterly or bi-annually recommended

Backup — DR relies on backups but includes more (failover, RTO/RPO)
High Availability — HA prevents downtime; DR recovers from it
Cloud — cloud enables cost-effective DR strategies
Vpc — access point for DR site management

References

Wikipedia: https://en.wikipedia.org/wiki/Disaster_recovery
NIST DR guidelines: https://csrc.nist.gov/pubs/sp/800-34/final
AWS DR on AWS: https://aws.amazon.com/disaster-recovery/