Data Protection

 View Only

Dual SVP Failover Testing with OpenShift (HSPC)

By Vikash Taank posted 02-13-2023 14:43

  

Table of Contents (Quick Links)

Executive Summary
Test Environment Configuration
Test Design and Implementation
Test Procedures

Environment Prerequisite
Test Items

Test Results
Summary

Executive Summary

Hitachi Virtual Storage Platform (VSP) 5000 series storage systems provide high availability and represents the industry’s highest performing and most scalable storage solution. VSP 5000 series offers high performance, high availability, and reliability for enterprise-class data centers and features the industry's most comprehensive suite of local and remote data protection capabilities. In a high-availability environment, the primary SVP is the active unit, while the secondary SVP acts as the hot standby. If the primary SVP fails, the hot standby SVP takes over. This means that the dual-SVP configuration eliminates single points of failure with the SVP.

In this case, we used an OpenShift environment along with Hitachi Storage Plug-in for Containers (HSPC) to test the high availability configuration for the SVP.

The primary purpose of the tests are as follows:

  • Check if the Dual SVP failover completes successfully (observe status change of Primary/Standby). Then try reinstating the SVP back to the original state after failover. Also, measure the time taken for the SVP failover (moving from Primary to Stand-by completely).
  • Run OpenShift operations during the failover. This includes a mix of storage operations and container operations (both using HSPC).

Test Environment Configuration

A detailed component summary of the test environment is provided in Table 1.

Table 1: Testbed Information

Testbed Configuration

The following image shows a high-level overview of the test bed configuration:

Figure 1: High-level Test Bed Configuration Overview

The following lists the environment prerequisites:

  • One VSP 5600-2N storage system used as the target storage system (Dual SVP configuration).
  • A six-node Redhat OpenShift Cluster installed using the “BareMetal (x86-64) Assisted Installer” from the RedHat official site.
  • The OpenShift Cluster consisted of three Worker nodes and three Controller/Master nodes.
  • HSPC v3.10 using Operator Hub (installed after the cluster is deployed).
  • After deploying HSPC v3.10, the following was configured:
    • Persistence Volume Claim (PVC) underline Persistence Volume (PV) is created by default.
    • StatefulSets app consisting of WordPress and MariaDB using the HELM tool was installed, and related PODs were created automatically.
    • Initially, two PODs for MariaDB were created for testing.

Test Items

The following lists the test items targeted in this project:

  • Test the failover of the SVP in a Dual SVP configuration.
  • During an SVP failover, the ‘Resize existing PVC (ONLINE- Storage Operation using HSPC)’ operation in OpenShift was triggered. We observed the behavior of the operation as recorded in the OpenShift UI.
  • During an SVP failover, the ‘Scale the StatefulSet Replica from 2 to 3 (DB Pods)’ operation in OpenShift was initiated. We observed the behavior of the operation as recorded in the OpenShift UI.
  • During an SVP failover, the ‘Scale down the underlying pods for StatefulSets app from 3 to 2’ operation in OpenShift was triggered. We observed the behavior of the operation as recorded in the OpenShift UI.
  • Test items 2, 3, and 4 were tested serially, and multiple ‘Switch SVP’ operations were triggered as needed.

Test Results

Checking the Failover of SVP in Dual SVP Configuration

In the lab, the ‘SWITCH SVP’ operation was triggered, and it was verified that the SVP switch works successfully. 

Observations are as follows:

  • ‘Switch SVP’ takes about 25 minutes to complete, with the maximum time spent on copying the configuration. The SVP is available for most of the time.
  • At the end of this process, there are approximately three minutes of actual switchover when the SVP stops responding (user connections to primary SVP using RDP timeout because of restart). These three minutes are the critical failover period, during which all HSPC and OpenShift testing occurs.
  • Following the switchover (failover), the SVP continued to operate with the same primary IP address.
  • Other than the change in desktops, identifying the primary and secondary SVP OS is very difficult.
  • Initiating another round of Switch SVP operation reinstates the SVPs to the original state.


The following are some screenshots of the full procedure:

Figure 1: Switch SVP initiate operation.

Figure 2: Transfer of configuration data from Master to standby SVP.

Figure 3: Started the Switch SVP operation.

Figure 4: As the actual switch process occurred, the primary RDP became inactive.

Figure 5: Switch SVP completed successfully.

Resizing the existing PVC (ONLINE- Storage Operation using HSPC)

The ‘Resize existing PVC’ operation was performed during the SVP failover period (approximately three minutes of downtime). The operation was triggered a minute after the actual switching started. The operation failed and awaited the availability of the storage. After the storage was available, it passed through, as seen in the OpenShift UI.

The screenshots and HSPC logs are as follows: 

Figure 6: Initial size of the PVC.

Figure 7: Depicts the target size for PVC.

Figure 8: Operation is being retried as the storage is unavailable.

Figure 9: Operation completes after the storage becomes available.

Figure 10: The HSPC logs show that the Resize operation failed because the storage was unavailable.

Figure 11: The HSPC logs show that the Resize operation completed when storage became available.

Figure 12: Kubectl snippet for PVC.

Scaling the StatefulSet Replica from two to three (DB Pods) Scaling up the Statefulset Replicas to three (DB Pods from two to three) has the same result as the others. The operation was triggered a minute after the SVP failover (approximately three minutes of downtime). The operation initially failed and waited for the storage to become available. As seen in the OpenShift UI, after the storage was available, it passed through.
The operation screenshots and HSPC logs are as follows:
Figure 13: Initial number of PODs.

Figure 14: Initiating the scale up operation of PODs (two to three).

Figure 15: The POD creation is in pending state because the storage is unavailable.

Figure 16: The operation continues to wait until the storage is available.

Figure 17: The operation completes after the storage is available.

Figure 18: The HSPC logs show that when the storage is unavailable, the scale up of POD operation fails.

Figure 19: The HSPC logs show that when the storage is available, the scale up of POD operation completes.
Scaling down the underlying pods for Stateful app (from three to two)                   
The final test run was ‘Scale down the underlying Pods of stateful app from three to two’. The operation was triggered a minute after the SVP failover began (approximately three minutes of downtime). OpenShift UI reports that even though the switch SVP operation is in progress, the Pod (PVC) is deleted. However, the persistence volume remains in ‘release’ status until the Switch SVP operation is completed, after which it is deleted.

The operation screenshots and HSPC logs are as follows:     

Figure 20: Depicts the initial number of PODs which will be scaled from three to two.

Figure 21: Triggered the scale down operation (from three to two)

Figure 22: Persistence volume moves to ‘release’ status until the switch SVP operation is completed.

Figure 23: Persistence volume gets deleted when the storage becomes available.

Figure 24: The HSPC logs show that the scale down of POD operation fails when the storage is unavailable.

Figure 25: The HSPC logs show that the scale down of POD operation completes when the storage is available.

Summary

  • ‘Switch SVP’ in a dual SVP configuration was successful and took about three minutes for the actual switch/failover. During this period, the SVP was unavailable (downtime).
  • When triggered during SVP failover (downtime), ‘Resize PVC’ fails. From OpenShift UI, it was observed that it keeps retrying until the Switch SVP completes. Later, it runs successfully.
  • When triggered during SVP failover (downtime), ‘Scaling up the Replica’ also fails. From OpenShift UI, it was observed that it keeps retrying until the Switch SVP completes. Later, it runs successfully.
  • When triggered during SVP failover (downtime), ‘Scaling down the underlying pods for Stateful app’ passes. However, the Persistent Volume remains in the ‘release’ state until the SVP is available.

The tests in lab conditions were successfully completed.

#DataProtection  #FlashStorage 

1 comment
39 views

Permalink

Comments

02-21-2023 23:13

exciting solution!! Thanks