Sunday, 20 March 2022

Backup and Restore vs Pilot Light vs Warm Standby vs Multi-site

 

Backup and Restore

Pilot Light

  • This DR plan provides the slowest system restoration after a DR event.

  • You take frequent snapshots of your data such as those in Amazon EBS Volumes and Amazon RDS databases, and you store them in a durable and secure storage location such as Amazon S3.

  • There are many ways for you to move data in and out of S3
      • Transfer over the network via S3 Transfer Acceleration

      • Transfer over a dedicated network line using AWS Direct Connect

      • Transfer using transport hardware such as AWS Snowball Edge and Snowmobile
  • With S3 Glacier, you get to reduce a large portion of your costs compared to using S3 Standard, since Glacier is meant for long term archival storage which is perfect for backups.

  • AWS Storage Gateway enables snapshots of your on-premises data volumes to be transparently copied into S3 for backup.
      • Storage-cached volumes allow you to store your primary data in S3, but keep your frequently accessed data local for low-latency access.
  • Gateway-VTL of AWS Storage Gateway serves as a replacement for traditional magnetic tape backup.

  • You can quickly create local volumes or Amazon EBS volumes from snapshots in S3.

  • You can create AMIs out of your EC2 instances which preserve the following:
      • A template for the root volume for the instance (for example, an operating system, an application server, and applications)

      • Launch permissions that control which AWS accounts can use the AMI to launch instances

      • A block device mapping that specifies the volumes to attach to the instance when it’s launched
  • Backup and restore is used in combination with other DR plans since it is crucial to always have a working backup of your system.
  • The pilot light method gives you a quicker recovery time than the backup-and-restore method because the core pieces of the system are already running and are continually kept up to date, but is not as fast as Warm Standby.

  • You can maintain a pilot light by configuring and running the most critical core elements of your system in AWS. When the time comes for recovery, you can rapidly provision a full-scale production environment around the critical core.

  • Pilot light is an example of active/passive failover configuration.

  • Infrastructure elements for the pilot light itself typically include your database servers, which would be configured for data mirroring replication.

  • Restoring the rest of the system includes utilizing EBS snapshots and EC2 AMIs that you should be regularly generating.

  • Pilot light tends to be more costly than backup and restore since you leave a few core AWS resources running all the time.

  • From a networking point of view, you have two main options for provisioning web servers:
      • Use Elastic IP addresses, which can be pre-allocated and pre-identified, and associate them with your instances.

      • Use Elastic Load Balancing (ELB) to distribute traffic to multiple instances. You would then update your DNS records to point at your EC2 instance or point to your load balancer using a CNAME.
  • Consider redundancy especially at your data layer (enable multi-AZ, cluster sharding, etc).

  • If your data is constantly changing and failover occurs, you would have to reverse replicate your data in the DR site back to the primary site, so that any data updates received while the primary site was down can be replicated back, without the loss of data.

Warm Standby

Multi-site

  • This DR plan is faster in system restoration than performing Pilot Light after a DR event, but is not as fast as having a Multi-site System.

  • Warm standby describes a DR scenario in which a scaled-down version of a fully functional environment is always running in the cloud.

  • Since it is not only your core elements that are running all the time, warm standby is usually more costly than pilot light.

  • Warm standby is another example of active/passive failover configuration.

  • Servers can be left running in a minimum number of EC2 instances on the smallest sizes possible. Once failover occurs, quickly resize them and add scaling capabilities. It is best to place these instances behind a load balancer as well.

  • For the data layer, the practice is similar to pilot light where a standby resource is present and changing data is constantly being replicated to the other.

  • In the case of failure of the production system, the standby environment will be scaled up for production load , and DNS records will be changed to route all traffic to AWS.

  • If your data is constantly changing and failover occurs, you would have to reverse replicate your data in the DR site back to the primary site, so that any data updates received while the primary site was down can be replicated back, without the loss of data.


  • This DR plan is the fastest in system restoration during a DR event.

  • Multi-site is a one-to-one copy of your infrastructure that is located and running in another region or AZ, known as an active-active configuration.

  • Because of this, multi-site is the most expensive among all DR plans.

  • Multi-site gives you the best RTO and RPO as no downtime is expected and little to no data loss should be experienced.

  • In addition to recovery point options, there are various replication methods, such as synchronous and asynchronous methods.

  • You can use a DNS service that supports weighted routing, such as Amazon Route 53, to route production traffic to different sites that deliver the same application or service.

  • During failover, you can quickly increase compute capacity by using AWS Auto Scaling or by resizing your instances to a larger size.

  • Multiple services in AWS such as RDS offer a multi-AZ feature which allows you to provision resources in a different location for a more fault-tolerant setup.

  • If your data is constantly changing and failover occurs, you would have to reverse replicate your data in the DR site back to the primary site, so that any data updates received while the primary site was down can be replicated back, without the loss of data.

No comments:

Post a Comment