Data Protection

 View Only

Higher Availability with Global-Active-Device Cloud Quorum using AWS Auto Scaling

By Jonathan De La Torre posted 10-31-2022 18:34

  

Introduction

At Hitachi Vantara, we are always searching for ways to make sure that our storage products are fault tolerant and highly available. Although Amazon Web Services (AWS) availability zone outages are rare, every organization should be prepared in case an outage occurs. My team wondered what would happen to Global-Active-Device (GAD) Cloud Quorum if an availability zone went down. After running a few tests, we found that the quorum experienced a blockage on the Virtual Storage Platform (VSP), while the host I/O remained unaffected. Also, the blockage automatically resolved after the availability zone was restored.

We then wondered what would happen if an availability zone was down for more than a day? Although this is very unlikely, it is still a possibility. If you’re interested in learning more about how often AWS outages occur, check out this blog where Wojciech Gawronski explores the complete history of AWS outages. Finally, we decided to investigate if there was a way for the quorum to continue running and not depend on the zone to come back up when there is an extended availability zone outage.

What is AWS EC2 Auto Scaling?

AWS EC2 Auto Scaling is a free service that monitors your EC2 instance and automatically launches a new instance if the original instance becomes “unhealthy.” In this blog, we will discuss the benefits of using AWS EC2 Auto Scaling in with Hitachi GAD Cloud Quorum to ensure that your cloud quorum is up and running in a different unaffected zone if there is an Availability Zone outage.

Supporting Cluster Post to Include: AWS EC2 Auto Scaling


Terms to Know

  • AWS EC2 Auto Scaling
    • A service that monitors the health of running instances and automatically replaces impaired EC2 instances.
  • Auto Scaling Group
    • A collection of EC2 instances that are managed as a single group for scaling purposes.
  • Launch Templates
    • A template that provides all the information for instance configuration when a new EC2 is created.
  • Availability Zone
    • One or more discrete data centers with networking and connectivity in an AWS Region.
  • Global-Active-Device Cloud Quorum
    • An Amazon Machine Image in the AWS marketplace that automates setting up a GAD quorum in a cloud environment.

 

Testing Auto Scaling with GAD Cloud Quorum

By using AWS EC2 Auto Scaling with GAD Cloud Quorum, you do not need to worry about waiting for an Availability Zone to come back up to keep the GAD Cloud Quorum running. If you experience any hardware problems or extended Availability Zone outages, you can simply use the GAD Cloud Quorum AMI as the launch template for the GAD Cloud Quorum appliance, both available in the AWS Marketplace, and a new cloud quorum is automatically created in an unaffected Availability Zone. The process is easy because the GAD Cloud Quorum AMI launch templatetakes care of most set up required to launch a GAD Cloud Quorum in the cloud. You only need to specify the desired capacity, minimum capacity, maximum capacity, and the Availability Zones in which you want to launch the EC2 instances for the auto-scaling group configuration. Because we typically want one quorum available, we selected 1:1:1 for the capacity; however, if your environment uses multiple EC2 instances, you can use other values.



After we had Auto Scaling working with GAD Cloud Quorum, we wanted to see how the failover worked by replacing a quorum and launching a new one in a different Availability Zone. At first, it was challenging to see it in action because emulating an Availability Zone outage was not as simple as clicking a button to bring down an Availability Zone. However, after exploring the many technologies that AWS offers, we came up with a unique way to emulate an outage. We attached a Network Load Balancer and modified the security groups to make it appear as though no traffic could go through. As a result, Auto Scaling detected an unhealthy instance in a particular zone and launched a new instance in a different unaffected Availability Zone. When we checked the Activity History, we saw that the instance was replaced in a different Availability Zone, which was the Auto Scaling failover in action.




It was very cool to see Auto Scaling automatically replace an unhealthy instance. This is what we expected but did not want to wait until a real zone outage to see if it would actually work.

# Pros and Cons of Auto Scaling with GAD Cloud Quorum

PROS:

  • A new GAD Cloud Quorum is automatically created if there are any hardware issues or Availability Zone outages.
  • If a quorum is launched in a new zone because of an Availability Zone outage, you are notified through different forms of communication such as SMS and email.
  • Extra redundancy across Availability Zones.
  • No user interaction is required through AWS to launch a new quorum.
  • No extra cost is required to use AWS Auto Scaling with GAD Cloud Quorum.
  • Auto Scaling only needs to be set up once for use with GAD Cloud Quorum.

CONS:

  • To complete the setup process, user interaction with Hitachi Command Control Interface (CCI) or Hitachi Storage Navigator is still required.

 

# Tips and Reminders for Auto Scaling with GAD Cloud Quorum

If a new instance is launched in a different Availability Zone, the IP address changes because you are using a different subnet when migrating to a different Availability Zone. Additionally, because EBS volumes are specific to an Availability Zone, the EBS volume would be in a different zone. You can recover the EBS volume by creating a snapshot through AWS and using that snapshot to create a new EBS volume in a different Availability Zone. Because the quorum disk mainly stores metadata relating to the status of the primary and secondary storage systems, and is mainly used when a storage system goes down and the data becomes inconsistent, we found that recovering the EBS volume wasn’t worth the hassle when migrating quorums to a different Availability Zone. Also, after the migration is complete, the new EBS volume is refilled with the metadata relating to the status of the primary and secondary storage systems. Finally, there are a few steps that you must complete using Hitachi Storage Navigator or Hitachi CCI. So, you can complete the setup fully with six clicks in Hitachi Storage Navigator or six commands by using Hitachi CCI.

 

High Level Steps using Hitachi CCI raidcom commands:

  1. raidcom add external_iscsi_name
  2. raidcom add path
  3. raidcom add external_grp
  4. raidcom add ldev
  5. raidcom disconnect external_grp
  6. raidcom replace quorum

 

 

Supporting Cluster Post to Include: Hitachi CCI

Resources for AWS EC2 Auto Scaling

 

Summary

Overall, it was interesting to investigate a way to protect the GAD Cloud Quorum against extended Availability Zone outages. By looking at all the available AWS offerings, we found that AWS EC2 Auto Scaling, which is typically used to scale applications based on usage, could also solve other problems such as protecting a quorum against zone outages. Also, so we could see the Auto Scaling failover in action, we came up with a unique way to emulate a zone outage using a Network Load Balancer.

Visit AWS Marketplace

You can check out and use the Hitachi Global-Active-Device Cloud Quorum for free in the AWS Marketplace. It saves a lot of time and makes deploying a GAD quorum in the cloud very simple while ensuring that you follow the best practices for redundancy by having the quorum in a third site. Additionally, it has a friendly user interface, and you can easily make changes to your quorum setup using the menu.

8 comments
44 views

Permalink

Comments

12 days ago

crisp and clear

12 days ago

good read

15 days ago

Innovative and informative article for sure. Thanks for sharing it with all of us.

17 days ago

Very useful information.

17 days ago

Very useful and productive.

18 days ago

Thanks for sharing this informative article...

11-01-2022 21:01

Nice perspective on a use case that has a low probability of occurring but could have a big impact on customers' data.  I liked the approach you took to show how Network Load Balancer can be used to protect a quorum against zone outages

11-01-2022 07:09

Very helpful.