What is a DR Plan?
A disaster recovery plan involves a set of policies, tools and procedures that enable the continuation of vital technology infrastructure and systems following a natural or human-induced disaster.
If and when a disaster strikes, you need a well-tested disaster recovery process in place to quickly access and restore your company’s backup data.
Types of DR Plans
DR plans come in many flavors, but here is a listing of the most common methods.
- Backup and Restore: a simple, straightforward, cost-effective method that backs up and restores data as needed. Keep in mind that because none of your data is on standby, this method, while cheap, can be quite time-consuming.
- Pilot Light: This method keeps critical applications and data at the ready so that it can be quickly retrieved if needed.
- Warm Standby: This method keeps a duplicate version of your business’ core elements running on standby at all times, which makes for a little downtime and an almost seamless transition.
- Multi-Site Solution: Also known as a Hot Standby, this method fully replicates your company’s data/applications between two or more active locations and splits your traffic/usage between them. If a disaster strikes, everything is simply rerouted to the unaffected area, which means you’ll suffer almost zero downtime. However, by running two separate environments simultaneously, you will obviously incur much higher costs.
uAchieve Architecture in AWS
uAchieve consists of a relational database, uAchieve Engine, uAchieve web applications, AMIs, and storage for various logging. The environment is in Amazon Web Services (AWS) and takes advantage of all the benefits and services AWS offers to be highly scalable, available and fault-tolerant. Below, we'll talk briefly about each component of the cloud.
AWS Regions & Availability Zones
Amazon cloud computing resources are hosted in multiple locations world-wide. These locations are composed of AWS Regions and Availability Zones. Each AWS Region is a separate geographic area. Each AWS Region has multiple, isolated locations known as Availability Zones.
An availability zone is a logical data center in a region available for use by any AWS customer. Each zone in a region has redundant and separate power, networking and connectivity to reduce the likelihood of two zones failing simultaneously. A common misconception is that a single zone equals a single data center. In fact, each zone is backed by one or more physical data centers, with the largest backed by five. While a single availability zone can span multiple data centers, no two zones share a data center.
Currently, uAchieve is in the N. Virginia Region of AWS which has 6 distinct Availability Zones.
The uAchieve databases in AWS utilize Oracle RDS. Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale a relational database in the cloud. It provides cost-efficient and resizable capacity while automating time-consuming administration tasks such as hardware provisioning, database setup, patching and backups. It frees you to focus on your applications, so you can give them the fast performance, high availability, security and compatibility they need.
uAchieve utilizes a Multi-AZ database instance. With a Multi-AZ DB Instance, Amazon RDS automatically creates a primary DB Instance and synchronously replicates the data to a standby instance in a different Availability Zone (AZ). Each AZ runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. In case of an infrastructure failure, Amazon RDS performs an automatic failover to the standby (or to a read replica in the case of Amazon Aurora), so that we can resume database operations as soon as the failover is complete. Since the endpoint for the DB Instance remains the same after a failover, our application can resume database operation without the need for manual administrative intervention.
The uAchieve Engine is a java program that runs on a Linux VM that has the needed java version. In the cloud, a minimum of 2 EC2 (Elastic Compute Cloud) in 2 different availability zones are provisioned for each client. The 2 instances sit in an autoscaling group that can scale up to a maximum number (defined in subscription agreement) and down to the minimum 2 instances based on traffic and CPU utilization. The autoscaling service is an AWS service and does not require installation, patching of OS or worry about availability.
uAchieve Web Applications
The uAchieve Web Applications runs on a Linux VM that has the needed java and tomcat versions. In the cloud, a minimum of 2 EC2 (Elastic Compute Cloud) in 2 different availability zones are provisioned for each client. The 2 instances sit in an autoscaling group that can scale up to a maximum number (defined in subscription agreement) and down to the minimum 2 instances based on traffic and CPU utilization. The autoscaling service is an AWS service and does not require installation, patching of OS or worry about availability.
The web applications sit behind an AWS Elastic Load Balancer (ELB). The ELB automatically distributes incoming application traffic across multiple targets, the EC2 instances running the uAchieve Web Applications. The ELB works with the autoscaling group to check the health of the newly added instances and distribute traffic to them. once an instance becomes unhealthy, ELB will stop sending traffic to that instance till it passes the health check again.
uAchieve in the cloud utilizes 2 types of storage in AWS, S3 and EFS.
Amazon S3 is object storage built to store and retrieve any amount of data from anywhere – web sites and mobile apps, corporate applications, and data from IoT sensors or devices. It is designed to deliver 99.999999999% durability, 99.99% available and stores data for millions of applications used by market leaders in every industry. S3 provides comprehensive security and compliance capabilities that meet even the most stringent regulatory requirements. It gives customers flexibility in the way they manage data for cost optimization, access control, and compliance. And Amazon S3 is the most supported cloud storage service available, with integration from the largest community of third-party solutions, systems integrator partners, and other AWS services. In uAchieve, S3 are used for storing uAchieve audit logs. The S3 bucket is encrypted at rest and in-accessible to the public (private).
Amazon Elastic File System (Amazon EFS) provides simple, scalable, elastic file for use with AWS Cloud services. Amazon EFS is built to elastically scale on demand without disrupting applications, growing and shrinking automatically as files are added and removed, so applications have the storage they need, when they need it. It is designed to provide massively parallel shared access to thousands of Amazon EC2 instances, enabling applications to achieve high levels of aggregate throughput and IOPS that scale as a file system grows, with consistent low latencies. As a regional service, Amazon EFS is designed for high availability and durability storing data redundantly across multiple Availability Zones. In uAchieve, EFS is used for storing active logs for uAchieve Engine and uAchieve Web Applications.
An Amazon Machine Image (AMI) provides the information required to launch an EC2 instance. You can launch multiple instances from a single AMI when you need multiple instances with the same configuration.
An AMI includes the following:
- A template for the root volume for the instance (for example, an operating system, an application server, and applications)
- Launch permissions that control which AWS accounts can use the AMI to launch instances
- A block device mapping that specifies the volumes to attach to the instance when it's launched
When we setup the uAchieve Engine instance and the uAchieve Web Apps instance, an AMI image is created and saved in AWS' S3 storage. Instances launched from this new custom AMI include all the customizations made when we created the AMI. AMIs are used with autoscaling groups in AWS. Whenever the autoscaling group needs to add an instance based on the pre-defined rules, it retrieves the required AMI and launches the instance.
Here is a typical setup of uAchieve in AWS.
uAchieve Disaster Recovery Plans
For uAchieve in the cloud, we have 2 disaster recovery plans in place. This allows uAchieve to be very resilient in the event of a disaster.
The first plan involves a scenario where one or two data centers (Availably Zones) hosted by AWS becomes unavailable. The AZ's would have at least one or more instances running one or more component of uAchieve. The impacted instances could be the uAchieve Engine, uAchieve Web Applications or uAchieve Database.
The second plan involves a scenario where the North Virginia (NV) region of AWS becomes unavailable. Therefore, all used resources and services need to be in another region in advance to ensure ability to be restored easily and quickly.
If AZ’s Become Unavailable
The uAchieve Engine runs on a minimum of 2 instances spread across 2 AZ's. In the event one or two AZ's becomes unavailable, the autoscaling group will automatically launch a new instance in the available zones that are unimpacted. Currently, uAchieve utilizes 3 AZ's for distributing the environment.
Also, there is a production copy of the uAchieve Engine instance as an AMI. The AMI can be manually launched and placed in the autoscaling group if it was needed within seconds.
uAchieve Web Applications
The uAchieve Web Applications runs on a minimum of 2 instances spread across 2 AZ's. In the event one or two AZ's becomes unavailable, the autoscaling group will automatically launch a new instance in the available zones that are unimpacted. Currently, uAchieve in the cloud utilizes 3 AZ's for distributing the environment.
Also, there is a production copy of the uAchieve Web Applications instance as an AMI. The AMI can manually be launched and placed in the autoscaling group if it was needed within seconds.
The uAchieve database is a Multi-AZ RDS. With a Multi-AZ DB Instance, Amazon RDS automatically creates a primary DB Instance and synchronously replicates the data to a standby instance in a different Availability Zone (AZ). Daily snapshots of the entire database are automatically taken and saved to S3. A new database can be provisioned in minutes from the snapshots that would be exact copy of the database point-in-time.
If the AZ where the primary RDS is unavailable, then AWS will switch the DNS to point to the secondary RDS without need for intervention by us or the applications. The RDS that went down will be restored by AWS automatically and will act as the stand-by RDS.
If North Virginia Region Becomes Unavailable
If the N. Virginia region is unavailable, AWS Oregon region will be the region utilized to restore uAchieve. The region is located across the country from NV and has most of the services NV has that we need for uAchieve.
The Oregon uAchieve environment has been provisioned with similar settings as that of the N. Virginia. This includes all the needed settings for:
- Internet Gateway
- NAT Gateway
- Route Tables
- Security Groups
- S3 Buckets
- Amazon Machine Images (AMIs)
- RDS snapshots
- Route 53
- Autoscaling & Load Balancing
- Alarms and Notifications
The latest AMI for production clients will be copied to the Oregon region. Whenever an AMI is updated, a new copy will be made and moved to Oregon.
In AWS NV, RDS are snapshotted every night. We have setup automated processes to take a copy of the RDS Snapshot when it gets completed and put it in Oregon. In Oregon, snapshots for the past 2 days are kept at any time. The snapshot will have all needed data to restore an RDS in Oregon.
Steps to Take When NV is Unavailable
When NV becomes unavailable and the Oregon region needs to be activated, the following steps will be taken to bring up each client’s environment.
- Restore RDS snapshot using the same name for the RDS as that in NV.
- Add RDS security groups.
- Start NAT Gateway using the static IP
- Associate NAT Gateway with the 3 private subnets
- Update autoscaling properties
- Desired Capacity - 2
- Min - 2
- Max - 4 (or whatever the client needs are)
- Update Route 53 record set for client to point to ELB in Oregon instead of NV.
- Verify applications are running and working properly.
When a disaster occurs, cloud-based uAchieve will be ready to be automatically restored in other AZ's or into the Oregon region. If the disaster is isolated to some AZ's within N. Virginia, the current DR setup in AWS will handle the recovery with no interruption of services or loss of data. If the entire N. Virginia region becomes unavailable, the Oregon region has been provisioned with all the needed images, RDS snapshots, autoscaling groups, load balancers and DNS records. The restoration time in Oregon will be less than 30 minutes and data loss will be limited to RDS changes made since the last snapshot.
We never hope for a disaster to occur, but when it does, uAchieve in the cloud will be ready and your applications and data will be restored quickly and with minimal loss.