Cris Danci

GAD - Active/Active and why it matters - Part 2, the reality check

Blog Post created by Cris Danci on May 14, 2015

So less than 24 hours after my post on GAD - Active/Active and why it matters, someone reached out to me directly - with a slight caution.  This someone, a former customer, essentially wanted me to get this point across:  the HA solutions like the ones I described, do come with pitfalls. They said I somewhat glorified the solution, and I should set the record straight on where such solutions fit in regards to the bigger picture. 


The back story to go with this comes from this particular customer’s personal experience.  Some time ago, this customer went through an infrastructure refresh and was convinced by another vendor (not going to burn anyone here) that 'traditional DR' was dead and the new craze was around active/active data centres providing DR capabilities, which the vendor obviously could provide through a new solution.   This was actually a sales tactic the other vendor used since they weren’t the incumbent and had to differentiate themselves, and justify a higher cost solution. As usual, different aspects of the solution were perceived differently by different stakeholders in a traditional tree swing model.  The technical people saw it as a pure high availability solution as a driver to enabling things like VMware HA and vMotion over this distance (this particular customer did have dark fibre but the latency was too high to do something like FT over distance); the management people (mainly middle) saw it as an opportunity to simplify the overall DR process and remove the need for "expensive" licensing and long DR test times associated with running VMware SRM, and the business people saw it as a way to remove downtime entirely since that’s how both the vendor and middle management presented and sold the solution to them. 


Believing the dream, and against all good technical advice, this particular customer dramatically altered their BCP to incorporate an active/active data centre design, with traditional disk-to-tape backup providing the final fail-safe and for offsite copies.  Everything ran quite well for some time - the solution looked like it was holding up, the customer had several positive experiences with using the solution to keep services up during data centre maintenance events, by vMotion machines between compute nodes hosting at the other datacentre..... Until one faithful afternoon when a developer executed a delete SQL query against the wrong database....  fortunately IT (and the developers) had continued to run traditional backups, database maintenance jobs and proactive backups of the database before making any major changes.  This meant the database could easily be restored without any data loss, within a very short timeframe - fairly simple. That being said, that particular event, despite not resulting in data loss, was not well received by the business.  This came down to the simple fact that the business was convinced that a HA solution would eliminate the need for downtime when a disaster occurred. Of course, what a disaster actually is, was ill-defined, and the business’s perspective of a loss of customer records from a primary database, was definitely a disaster.  Unfortunately, the solution they forked out all that money for was pretty much useless in this scenario because they had to roll back to using traditional restores.   This brought into light questions on how they would recover the environment if there was a fundamental loss of the cluster itself, or issues with data consistency occurred on scale, as the data was replicated on both sites.  To add insult to injury, as both their data centres where essential in production, they concluded that they’d taken several steps backward architecturally, depending on tape for a large part of their recovery strategy, this particularly when they previously had a mature process around SRM. 


The realisation and critical point here is that HA != DR.  Active/active storage solutions that enable functionality like VMwareHA and VMwareFT over distance are HA solutions, not DR solutions.  They provide HA to address particular use cases for failures with a datacentre or hardware.  They provide no protection for logical errors that occur from the hypervisor layer upwards, including any operating systems, application failure or human errors.  A (DR) strategy still needs to be in place to protect the HA system itself from failure that can still meet the businesses RPOs and RTOs for all disaster use cases.


To be honest, I thought this was obvious as I was writing my previous post. Speaking to customers, everyone understands that traditional operating system clusters need to be backed up.  It was only when I was reflecting on the conversion last night, that I realised I have this conversion frequently with customers in a different context.  These days it's almost around the Cloud and way, highly decoupled distributed applications (running across availability zones) still need backup!  I don’t know why, but it’s like abstraction has gotten to everyone’s head, and they assume somewhere in the system, fail-safes will exist to cover every use case for failure.  Nothing can be further from the truth! This is especially evident in this day and age when programming standards have become somewhat bastardized when compared to its original core values, and changes in and to technologies are occurring so frequently.