Thursday, 15 June 2023

AWS Disaster Recovery Strategies

 Disaster recovery is one of the main requirements of making Cloud architectures today. This disaster may be a production bug, fault made by the developers, or maybe a flaw at the end of AWS Service itself. Disaster recovery is an essential part of applications. Before diving into AWS Disaster recovery strategies let’s understand some terms related to Disaster Recovery.

Recovery Time Objective (RTO): RTO is the maximum time span in which a service can remain unavailable before being damaging to the business.   

Recovery Point Objective (RPO): RPO is the maximum time for which data could be lost if a system goes down.

RTO-RPO Image

RTO-RPO Image

In the above example, the system goes down at 2 pm and is recovered to its normal state by 6 pm evening. This means that the Recovery Time Objective for the above situation is 4 hours. Similarly, say that the above scenario takes backup every 2 hours and the last backup is taken for the system was at 12 pm (marked by the green arrow). Since the system went down to This means that the data between 12 pm to 2 pm is lost and only the data or the system state at 12 pm can be recovered. This means that the Recovery Point objective for the above problem is 2 hours.

The choice of your architecture and data backup solution will solely depend upon how much RPO and RTO can your application support without being harmful to your business.

Different disaster recovery strategies

Backup and restore:

In this strategy, you take frequent snapshots of your data stored in EBS volumes and RDS databases and store these snapshots in a reliable storage space like AWS S3. You can regularly create AMIs of your servers to preserve the state of your server. This will preserve all the software and software updates on your server and IAM permissions associated with the server. Backup and Restore basically uses AWS as your virtual tape library. This strategy can not only be done for AWS applications but also for your on-premise applications. AWS Storage Gateway allows you to take and backup snapshots of your local volumes and store these snapshots in AWS S3. This is the slowest of the Disaster recovery strategies and is best used in accordance with other strategies. Storing backup data in AWS Glacier can help further reduce the costs of the strategy.

  • RTO- High (Example: 10-24 Hrs)
  • RPO- Depends on the frequency of the backups. Which can be hourly, 3 hourly, 6 hourly, or daily.

Pilot Light: 

In this strategy, a minimal version of the production environment is kept running on AWS. This does not mean the entire application scaled down (warm standby) but configuring and running only the core and the most critical components of the production environment. When disaster strikes an entire full-scaled application is rebooted around this running core. Pilot Light is more costly that Backup and Restore as you have some minimal services of AWS running all the time. This strategy also involves provisioning infrastructure using cloud scripts like AWS CloudFormation scripts for an efficient and quick restoration of the system.

  • RTO- High but less than backup and restore. Example: 5-10 hours.
  • RPO- Same as RPO for Backup and Restore i.e. depends on the frequency of backups. Even though a minimal core environment is running the data recovery still depends on backups.

Warm Standby:

As the name suggests warm standby strategy involves running an extremely scaled-down, but a full-fledged, fully functional application similar to your production application always running in the cloud. In case of failure or disaster, the warm standby application can be immediately scaled up to serve as the production application. EC2 servers can be left running to a minimal number and server type and can be scaled up to serve as a fully functional application using AWS AutoScaling features. Also, in case of failure, all DNS records and traffic routing tables are changed to point to the standby application rather than the production application. For quickly changing data architects will have to reverse duplicate data from the standby site to the primary site when the primary production environment takes over.

  • RTO: Lower than Pilot light.  Example:< 5 hours.
  • RPO: Since the last data write to the master-slave Multi-AZ Database.

Multi-Site:

As the name suggests, the multi-site strategy involves running a fully functional version of the production environment as a backup in the cloud. This is a one-to-one copy of your primary application that is typically run in a different Availability Zone or an entirely different region for durability. This is the most expensive of all the DR options as it makes your running costs double for running a single application. The cost overhead is compensated by the smallest RPO and RTO offered by the Multi-Site DR strategy. The RPO timings however may vary from system to system according to their choice of data replication methods (Synchronous and Asynchronous). As soon as failure strikes the developers only have to change DNS records and routing tables to point to the secondary application. 

  • RTO: Lowest of all DR strategies. Example: < 1 hour.
  • RPO: Lowest of all DR strategies. Choice of data replication affects RPO. The last data is written in a synchronous database.

Cloud Computing is one of the biggest assets to developers and investors out there to make highly efficient, simple applications and still have a cheaper cost structure. Backups in a traditional (non-Cloud) way can be more costly, inefficient, and are prone to hardware issues and manual errors. AWS offers backup strategies, not only for AWS applications but also for your on-premise applications which can leverage AWS to have a backup. Cloud backups provide a lot of benefits over the traditional backup system. Such as:

  • Low Costs
  • Fully AWS managed.
  • Secure and reliable.
  • No hardware maintenance.
  • Off-Site backup
  • Easy to access and test using Cloud Infrastructure.

No comments:

Post a Comment