Cris Danci

GAD - Active/Active and why it matters

Blog Post created by Cris Danci on May 12, 2015

After writing my previous series of blog posts, it was brought to my attention that I really didn't dedicated any space to GAD (Global-Actice Devices), a key feature which will be made available within a few months on the entire VSP G series (200, 400, 600, 800 (when it's released) and existing G1000 of course).


For those of you that don't know, GAD provides what would be strictly defined as active/active clustered storage. This allows a LUN to exist on two storage arrays at the same time, both which offer full read/write access and both which can serve host I/O. This is very different to something like HAMs (High Availability Manager) which was available on the HUSVM and the VSP and can be best described as active/passive clustered storage. HAMs, like GAD, essentially provided a LUN with the same ID across two storage arrays but, one LUN was always defined as primary which accepted host I/O and the other was secondary and would not accept host I/O. In a HAMs environment, multi-pathing software (HDLM) was used to redirect all host I/O to the primary LUN while still having active paths to the secondary LUN. In the event of a failure on the primary array or LUN, HAMs (using TrueCopy) would promote the secondary LUN (making it read/write) and the multi-pathing would route all I/O to the secondary array.


Regardless if you use active/passive or active/active storage,  during a failure in the storage layer operations always occur non-disruptively from a hosts perspective and the real difference is in the underlying operations and the effect it has on design. In active/active storage clustering, since each array can write to the volume directly, there is considerably less dependency on the link between the arrays. In Active/Passive all reads need to come from the primary LUN and it is possible for a two hop process to occur for write I/O before an acknowledgement is sent back to the host. This of course means more latency which directly effects how you might architect the layout of the compute and services running above it. Needless to say, storage running in an active/active storage cluster is more much efficient.


GAD is not the only strict active/active storage solution on that market that provides this capability. Both EMC with VPLEX and more recently IBM with updates to SVC offer this as a separate appliance. Most of the other major vendors such as HP with Peer Persistence, Dell with Live Volume and Netapp with MetroCluster, provide active/passive storage clustered solutions. There are also a couple of edge cases that don’t have a global presence like Fujitsu, Unisys and Huawei that also offer active/passive solutions.  A careful reader will notice the list of vendors providing these solutions are either all traditional storage vendors or technology behemoths. There is good reason for this as storage clustering solutions are VERY complex (particularly under the hood), there are generally very strict requirements and are intensive to develop and support, which is why none of the startups have such solutions.


So why is this so important? As I alluded to in a previous post, there has been an increased interest in 3rd platform style applications in recent times mainly those that are highly decoupled, scalable (outwards) and highly resilient to failure (they degrade gracefully) and  of course can run on clouds or cheap style commodity hardware. Most people that aren't programmers or technology people (so the business people inside the organisation that have some involvement in IT - Shadow IT?), don't really care about the semantics on how these are implemented, they're interested in the underlying characteristics such as availability and reliability. As I've also alluded to, the reality is the most applications these days don't have these characteristics, moreover they're built on Open systems (2nd platform concepts) which are designed to run on a single operating system instance. This makes it extremely difficult to obtain these characteristics without making fundamental changes to the applications itself or re-platforming to run in a 3rd platform manner. This in itself is not easy, and there are a myriad of challenges not only in development perspective, but also around organisational and process change. This means we cannot simply expect a shift to happen overnight. As we mature as an industry, and more importantly as our developers mature, re-platforming will become increasing easier as the ancillary concepts around it rapidly defuse into the market - however this could take years if not generations to occur.


In the meantime, we need to provide similar capabilities on the existing applications without modification. This is exactly what storage clustering technology helps enable. I say enable because it's only one part of the equation within in the infrastructure layer. It simply provides a global or distributed LUN spanning multiple arrays. It is great to have a LUN served from multiple arrays, but if we lose the operating system or the application in the middle of an operation, the underlying storage does nothing to prevent a loss of service.  There still needs to be intelligence to ensure that the applications and operation system state is retained and is resilient to failures as the underlying infrastructure.  We don't have to look far to see that virtualisation operates at the correct layer and is capable of already providing the capabilities required to maintain the operating system and application state during a failure. VMware Fault Tolerance (FT) allows a VM to be mirrored to another host within the same cluster through implementing a lock step process . This ensures that when a physical host is lost, the underlying VM state is not, and its running state is retained. The issue with VMwareFT (well one of them) was that it depended on shared storage. This means a VM running in a fault tolerant state would still be lost if the underlying storage system experienced an outage. Clustered storage when combined with VMwareFT solves this problem since you can run a VMwareFT configuration across two separate physical arrays. Two separate physical arrays can also mean two physical sites. This means when combined, it is possible to run an application across two physical data centres (network bandwidth permitting). In combination, these technologies actively protect against failure at every level in the stack. It's not quite the same as 3rd platform availability, but it provides the same value.


Now for the really important part of why it matters. HDS and VMware combined actually make this a viable solution for EVERYONE! With the release of the VSP Gx00 series, HDS enables this function for customers of all sizes, NO extra hardware appliances, NO extra maintenance and NO complex design (because it's active/active and happening natively)! You can build an active/active storage solution for 10s of thousands, not 100s of thousands of dollars. VMware, with the release of vSphere 6, have dropped VMwareFT into standard licenses (limited to 2vCPUs, 4vCPUs in Enterprise). The only two things this duo (HDS and VMware) don't provide is the network required to get this up and running, which unfortunately still requires a high bandwidth with low latency links (amongst other things). For many customers (particularly smaller ones) this might mean containing the environment within a single data centre, but it’s still damn cool. 


Of course we will need to wait for HDS and VMware to certify the solution (vSphere Metro Cluster in VMware terminology) on the VSP Gx00, however it's only a matter of time given that it's already certified for the VSP G1000 and as we all know from previous posts, all the  VSP G arrays run the same version of SVOS.