Patrick Allaire

Lies, Damned Lies and Uptime Statistics

Blog Post created by Patrick Allaire Employee on Jan 30, 2017

Digital Transformation Reflections

 

The holidays were a welcome break from Silicon Valley’s hectic pace and I stepped away from technology marketing and enjoyed family time. I wanted to share some reflections from publications’ insights, customers’ deployments and enquiries that I received in the last few months.

 

I stumbled on some good reading in CIO’s “Guide to Digital Transformation” article from Forbes that HDS sponsored. The interesting point was that IT executives who are pursuing closer alignment between IT and lines of business will appreciate lessons from pioneers and a strawman plan.

 

Many IT executives recognized early that in a digital world, service level objectives and agreements need to be revised upward to the same level as hyperscale data centers (e.g. Google, Facebook, AWS). The simplest justification I found to reframe the customer experience is the below research on the impact of increased webpage load times on sales, traffic or user satisfaction. An outage incident is just much more extreme in its impact on revenue, brand, liabilities or user satisfaction (see below).

 

Milliseconds are Money.pngWhile the traditional justification for a business continuity solution requires sizing the business impact based on the risk and investment profile needed, I recognized that IT professionals who are not initiated to the business continuity practice are often confused about risk and investment profiles.

 

For customers already engaged in their own digital transformation, the review of the latest 2017 Society for Information Management (SIM) IT Trends Comprehensive report was a good reminder that for many similar projects, these initiatives are often more about data management and legacy system integration than infrastructure. This survey report confirms this generalization because the most prevalent software development investments reported are either in systems integration, legacy system improvements or customizations (see below).

 

SIM 2017 Software Development.png

So, does infrastructure still matter with digital transformation projects? Newer generations of mobile or internet of things (IoT) applications may be infrastructure resilient, but several web services often require integration with the system of record when payment, authentications or access to database are needed.

Judging from this 2017 SIM survey, business continuity rose to the #5 spot as the most important or worrisome IT management issue in 2016. It appears that these availability expectations are still top of

mind…

 

SIM 2017 Most Worrisome.png

Even if many IT leaders report that the “availability/uptime” metric is still the #1 measurement of internal or outsource IT performance, I suspect the up and down on this issue is, in part, influenced by the amount of bad press coverage.

 

 

I am surprised that industry analysts are not pushing back more on vendors’ uptime claims. Many AFA vendors claim six-9s (99.9999%) uptime, but The Register keeps reporting several “snafus” like HPE 3PAR storage SNAFU takes Australian Tax Office offline and XtremIO 'outages bork US hospital patient records system'.

 

Auditors should be wary of infrastructure deployments that hinge on a single access point after so many recent infamous airline outages in 2016, such as Delta computer outage costs $100mBritish Airways check-in system checks out: Staff flung back to cruel '90s world of paper, JetBlue blames Verizon after data center outage cripples flights.

 

Bottom line, if a single system’s uptime is relevant to your project, you should be concerned as this post will provide evidence that this specification is often irrelevant. Worse, if your architecture or vendor justification is based on single system uptime, update your resume quickly as you are about to make a career limiting move…

 

An Intel presentation at the 2016 Storage Developer Conference broke down the root causes of an outage in different environments.

 

Intel RAS.pngIn traditional enterprise data centers, hardware fault is responsible for about 20% of outages, 43% of outages are caused by software errors and 37% of service disruptions are caused by operator (configuration) faults.

 

Compare these statistics with the fact that many AFA vendors’ uptime claims are tied solely to the hardware, or are representative of the last 12 months in between two software releases, exclude any planned outage needed for an upgrade or technology refresh in a system migration, included support excludes network/server environment, and offer no commercial guarantee with financial penalties. Can you really rely on self-reported vendor uptime specifications?

 

Lies Damned Lies and Uptime Statistics .png

Hitachi Data Systems does not provide any narrowly-defined uptime specifications, we offer instead a 100% data availability guarantee. We also recognized that no vendor’s financial penalty can compensate for the business impact associated to a service disruption. This is why Hitachi Virtual Storage Platform (VSP) F series all-flash storage systems were designed with unified active-active capabilities.

 

The growth in Hitachi’s high availability solutions deployment validates that in the real world the expectations of business continuity are on the rise. Our customers are relying more and more on multi-data-center deployments to support their critical business. Here are a few factoids on our unified Global-Active Device solution available for block and file data: In the last 2 years, in a shrinking market, these active-active deployments have grown more than 40% quarter after quarter, across thousands of deployments supporting hundreds of petabytes. In any quarter, up to one out of three systems is targeted for deployment in an active-active cluster…

 

As AFA deployment capacity increases, the risk to the business keeps growing as more users and applications will be impacted by an outage incident. If IT agility matters to you, make sure your next AFA supports active-active deployment, and that clustering can be enabled nondisruptively if it’s needed in the future. In a digital world, the question is not IF but WHEN you will be asked to deliver the same class of service as Google, Facebook or AWS on your flash platform.

 

Happy new year.

 

Patrick Allaire

Outcomes