Do You Run Routine Backup/Disaster Recovery Audits For Your AWS Cloud?

Disasters can strike anytime, anywhere. It can be in the form of fires and floods that can damage data centers to power outages and hardware/equipment failures that disrupt operations. With little or no prior warning, each of these disaster forms can tremendously impact your business and if you have not made Disaster Recovery preparations in advance, it can turn out to be a nightmare for your business. So, how well prepared are you?

Keeping the concern in mind, Botmetric implements the regular Disaster Recovery and Backup Audits by addressing compute, storage, networking, deployment, security & planning needs for your AWS Cloud Infrastructure to help you avoid unexpected problems in the event of an outage or disaster. The Botmetric Disaster Recovery & Backup Audit service is based on ‘as-needed’ and ‘pay-as-you-go’ model of AWS.

In this blog post, we’ll discuss the automated Disaster Recovery backup audit checklist that Botmetric runs to ensure business continuity and data backup during a DR event:

ELB Optimisation

Botmetric provides a list of ELBs that have either one availability zone or the EC2 instances are distributed unevenly among different availability zones. We recommend that you maintain approximately equivalent numbers of instances in each Availability Zone for better fault tolerance.

RDS Multi AZ

Botmetric provides a list of DB instances deployed in a single Availability Zone and recommends you to launch instances in separate Availability Zones with the help of Amazon RDS in order to protect your applications from the failure of a single location. Amazon RDS automatically switches to a standby replica in another Availability Zone (if Multi-AZ feature is enabled) to provide data redundancy, eliminate I/O freezes, and minimize latency spikes during system backups.

ELB Connection Draining

Connection Draining, a feature of ELB, completes the requests (in progress) before deregistration of any back-end instances and makes it easy to manage the capacity behind your ELB. If the back-end instances fail in health check, the load balancer does not send any new requests to the unhealthy EC2 instance. Instead, it allows the existing requests to complete.

Botmetric identifies if the load balancers have connection draining configured or not. It recommends you to enable connection draining to ensure in-progress requests are handled gracefully during auto-scaling termination or unhealthy instance removal events.

ELB Cross Zone

By default, your load balancer evenly distributes incoming requests across its enabled Availability Zones.

But, just to ensure your ELB distributes the incoming requests evenly across all back-end instances (irrespective of the AZs), enable cross-zone load balancing. Botmetric identifies which load balancers should be configured to use cross-zone load balancing option.

However, as a best practice we recommend you to evenly distribute your EC2 instances in each AZ for higher fault tolerance.

EC2 Availability Zone

Amazon EC2 is hosted worldwide in several regions and each of these regions has isolated locations called Availability Zones. It is recommended to host instances in multiple locations rather than one single location. In case of any disaster (though very rare), if you have hosted all your instances in a single location and that particular location is affected by any failure, none of your instances will be available.

Botmetric identifies such regions that have either all the instances in same availability zones, or have instances in multiple zones, but the distribution is uneven. Accordingly, it gives you smart recommendations to fix the uneven distribution in seconds with its ‘Click-To-Fix’ button feature.

Auto Scaling Group

Auto Scaling regularly runs a health check on your instances in the Auto Scaling Group and reports if any instance is unhealthy. Botmetric diagnoses all your EC2 instances and recommends you to have the health check type as ‘ELB’ if you use a load balancer with your Auto Scaling group and if you are not using any load balancers with Auto Scaling Group then you should choose the default health check as ‘EC2’.

Auto Scaling Group resource Audit

An Auto Scaling group resource ensures that your applications have enough capacity to handle the current traffic demands. To make your applications highly available and fault tolerant, you should use Auto Scaling Group. What’s even more important is the fact that implementation of Auto Scaling does not incur any additional cost—you only pay for the Amazon EC2 resources you use.

Botmetric runs an audit and identifies which auto scaling group is associated with a deleted load balancer or which launch configuration is associated with a deleted Amazon Machine Image (AMI).

Route53 High TTL RR Set

This check examines resource record sets that can benefit from having a lower time-to-live (TTL) value. A long TTL can cause unnecessary delays in rerouting traffic.

Botmetric identifies if the resource record set has a TTL greater than 60 seconds and if it is associated with a health check. It also checks if its routing policy is set to ‘Failover’ or not.

Volume Snapshot

Never forget to take incremental backup of the snapshots of your EBS volumes to Amazon S3. Botmetric provides a list of such EBS volumes that either don’t have a snapshot or without the latest snapshot. It recommends you to take regular snapshots of the required volumes for disaster recovery purpose. With Botmeric’s DevOps Cloud Automation, you can schedule a job that automatically takes EBS volume snapshots based on specified instance or volume tags.

RDS Backup

Amazon RDS has an automatic backup feature that enables point-in-time recovery for your DB Instance, and allows you to restore your DB Instance to any second during your retention period, up to the last five minutes. Botmetric provides a list of RDS instances that either don’t have a backup or the backup retention period are not at the recommended level. The range of maximum retention period for the automated backups is from eight days to thirty-five days. Hence, you can store more than a month of backups. Botmeric’s DevOps Cloud Automation feature schedules a job that automatically takes your RDS data backup based on specified instance tags.

S3 Access Configuration

Botmetric identifies S3 buckets that don’t have correct logging configurations enabled. By default, the Amazon S3 buckets and all its objects are private. Only the owner will have access to grant Read/Write permissions to other resources and users. When logging is initially enabled, the configuration is automatically validated. However, future modifications can result in logging failures. To avoid this, write an access policy for each of the permissions that you grant to other resources.

EC2 Instance BackUp

Just like EBS volume snapshots, regularly back up your instance using Amazon EBS snapshots.

EC2 Instance Scheduled Retirement

When an EC2 instance reaches its scheduled retirement date, it is automatically stopped or terminated by AWS and you will have no more access to the data on that retired instance. If your instance’s root device is an EBS volume, you can replace the instance by creating an AMI of your instance, and launching a new instance from the AMI. If you are unaware of the process, we advise you to reach out to us. We would love to guide you through the process.

Apart from these audits, it is highly recommended to periodically copy your data backups across the AWS regions.

With Botmetric, you can do so by scheduling a job for cross-region copy:

  • Copy EBS Volume snapshot (based on volume tags) across regions
  • Copy RDS snapshot (based on RDS tags) across regions

Thus, Botmetric’s extensive DR audit not only makes it cost-effective to back up data in the cloud, but also makes it easy, secure and reliable. With timely automation of Disaster Recovery backup tasks and the right APIs in place, Botmetric’s DevOps Automation makes your AWS Cloud Infrastructure DR ready.

So what are you waiting for? Sign Up for Botmetric 14-day trial and make your business on AWS Cloud ‘disaster-proof’.

How have you been using the cloud for Disaster Recovery? We would love to hear. Tweet to Us.