DR (Disaster Recovery)
Definition
Disaster Recovery (DR) is the set of policies, tools, and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-caused disaster. DR is a subset of business continuity planning.
DR focuses on restoring IT systems and data, while business continuity encompasses the entire organization’s operations.
Key Metrics
| Metric |
Definition |
Typical Target |
| RTO (Recovery Time Objective) |
Max acceptable downtime |
Minutes to hours |
| RPO (Recovery Point Objective) |
Max acceptable data loss |
Minutes to hours |
| MTO (Maximum Tolerable Outage) |
Max time before irreversible damage |
Hours to days |
DR Strategies (from most to least expensive)
| Strategy |
RTO |
RPO |
Cost |
Description |
| Pilot Light |
Minutes |
Minutes |
Medium |
Core systems always running in DR site |
| Warm Standby |
Hours |
Minutes |
Medium-High |
Scaled-down DR environment ready to expand |
| Hot Standby |
Minutes |
Near-zero |
High |
Full DR environment, always active |
| Backup & Restore |
Hours-Days |
Hours |
Low |
Restore from backups after disaster |
| Multi-site Active-Active |
Seconds |
Zero |
Very High |
All sites serve traffic simultaneously |
DR Components
| Component |
Purpose |
| Backup systems |
Data backups (full, incremental, differential) |
| DR site |
Physical or cloud location for recovery |
| Replication |
Real-time or near-real-time data replication |
| Failover |
Automatic or manual switch to DR site |
| Failback |
Return to primary site after recovery |
| DNS failover |
Route traffic to DR site |
| Load balancer failover |
Redirect traffic to DR infrastructure |
Cloud DR
| Provider |
DR Service |
Notes |
| AWS |
AWS Backup, DR on AWS, Cross-Region Replication |
Most mature cloud DR |
| Azure |
Azure Site Recovery, Backup Vault |
Strong hybrid DR |
| GCP |
Cloud Backup and DR, Cross-Region Replication |
Integrated with GCP services |
| OpenStack |
Sahara, Heat templates |
Open-source DR automation |
DR Testing
- Tabletop exercise: Walk through DR plan verbally
- Simulation: Simulate disaster without affecting production
- Partial failover: Fail over non-critical systems
- Full failover: Complete DR test with full system switch
- Frequency: Quarterly or bi-annually recommended
- Backup — DR relies on backups but includes more (failover, RTO/RPO)
- High Availability — HA prevents downtime; DR recovers from it
- Cloud — cloud enables cost-effective DR strategies
- Vpc — access point for DR site management
References