DR (Disaster Recovery)

Definition

Disaster Recovery (DR) is the set of policies, tools, and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-caused disaster. DR is a subset of business continuity planning.

DR focuses on restoring IT systems and data, while business continuity encompasses the entire organization’s operations.

Key Metrics

Metric Definition Typical Target
RTO (Recovery Time Objective) Max acceptable downtime Minutes to hours
RPO (Recovery Point Objective) Max acceptable data loss Minutes to hours
MTO (Maximum Tolerable Outage) Max time before irreversible damage Hours to days

DR Strategies (from most to least expensive)

Strategy RTO RPO Cost Description
Pilot Light Minutes Minutes Medium Core systems always running in DR site
Warm Standby Hours Minutes Medium-High Scaled-down DR environment ready to expand
Hot Standby Minutes Near-zero High Full DR environment, always active
Backup & Restore Hours-Days Hours Low Restore from backups after disaster
Multi-site Active-Active Seconds Zero Very High All sites serve traffic simultaneously

DR Components

Component Purpose
Backup systems Data backups (full, incremental, differential)
DR site Physical or cloud location for recovery
Replication Real-time or near-real-time data replication
Failover Automatic or manual switch to DR site
Failback Return to primary site after recovery
DNS failover Route traffic to DR site
Load balancer failover Redirect traffic to DR infrastructure

Cloud DR

Provider DR Service Notes
AWS AWS Backup, DR on AWS, Cross-Region Replication Most mature cloud DR
Azure Azure Site Recovery, Backup Vault Strong hybrid DR
GCP Cloud Backup and DR, Cross-Region Replication Integrated with GCP services
OpenStack Sahara, Heat templates Open-source DR automation

DR Testing

  • Tabletop exercise: Walk through DR plan verbally
  • Simulation: Simulate disaster without affecting production
  • Partial failover: Fail over non-critical systems
  • Full failover: Complete DR test with full system switch
  • Frequency: Quarterly or bi-annually recommended
  • Backup — DR relies on backups but includes more (failover, RTO/RPO)
  • High Availability — HA prevents downtime; DR recovers from it
  • Cloud — cloud enables cost-effective DR strategies
  • Vpc — access point for DR site management

References