Patrick Allaire

Hitachi Artificial Intelligence: Cognitive Insight Against System Outages

Blog Post created by Patrick Allaire Employee on Feb 28, 2017

Like many Silicon Valley professionals, I am juggling my time between work related activities and being a parent; I have little time for myself until everybody is asleep.  At that point, I am so exhausted that I will last 30 minutes or so before I fall asleep. As a dad of 3-year-old son that is in pre-school, I experienced a different pattern where every winter brings a new strain of a nasty cold virus.  This winter being no different, one day, my son came back from school with a runny nose, sneezing and being so needy that avoiding these critters was like fighting gravity1

 

After a week of extra hand washing care, using hand sanitizer everywhere and dodging every sneeze, I still caught something nasty.  I felt it in the middle of the night with a scratchy throat and within 12 hours, it moved up to my sinus and my head wanted to explode. While you can call in a sick day, there is no sick day as a dad, even less when mom caught the same thing….

 

This experience reminded me of the value of health; there is no good time to be out sick when your family needs you.  While we take for granted that the we have the strength and energy to go over our daily tasks, a small and minuscule bug can bring down the healthiest parent…

 

Software-Bugsv2.jpgI see a similar parallel with the digital world IT professionals are building; I am not talking about the same bugs or critters your kids bring home but something even worse: quality defects found in hardware and software that once awakened can bring down any production application (see February 28th, 2017 Amazon cloud service outage example).

 

Early in my career, I thought some of my peers working in the government and financial industry were overzealous with IT changes and processes.  My position changed 180 degrees over the course of 20+ years as I saw many IT careers shattered by blind trust of a new technology.

 

 

broken_promises.jpgThe most sophisticated IT buyers reminds me that every vendor claims the same thing in terms of uptime, promising upward of six 9s (99.9999%) availability but only few deliver on their promise.  So, what is the value of quality for your organization? These same customers will be quick to say that there is no monetary value2 in trust but like a parent looking for a sitter for one night out, relying on a trusted family member to watch over your child keeps your mind at peace.

 

If the all-flash array (AFA) market uptime standard is 32 seconds of unplanned downtime annually, how can any vendor deliver on this promise when one incident can take hours to resolve?

Table of Nines.png

 

I reached out to our Hitachi Virtual Storage Platform (VSP) family and Storage Virtualization Operating System (SVOS) engineers, consulted HDS customer services and support (CS&S) organization to find the answer.  While Hitachi quality and resilience has always been considered the gold standard in the industry, I never understood why others failed to emulate the same methodology or practices.  Non-initiated buyers wrongly assumed that this trusted status is achievable by any vendors with the financial resources to commit to quality; the reality is quite the opposite; see prior blog “Lies, Damned Lies and Uptime Statistics” how vendors get away with their inflated uptime specifications.  No new AFA vendor can re-create Hitachi storage quality and resilience overnight as their engineers can’t predict the interaction of all edge conditions across the stack from the host down to the array.

 

With full access to Hitachi Data Systems support database, a random sample of 150 million operating-hours of Hitachi Virtual Storage Platform was gathered to better understand how and if we were delivering on the quality our customers expect.

 

Here are the facts I gathered on Hitachi VSP and SVOS quality and resilience from my research:

 

In the same time period, over 500,000 Storage Virtualization Operating System (SVOS) predictive monitoring system information message (SIM) alerts were collected by HDS annually.  And more than 50% of HDS support calls were initiated by these SVOS predictive SIM where HDS customer support informed the system administrator ahead of time that service was needed prior to customer noticing any issues.

 

I followed up with Hitachi engineering to better understand what these SIM alerts were.  In short, SVOS predictive monitoring was built over time in partnership with our support organization.  It is the fruit of more than 28 years of experience in system engineering and support embedded in every Hitachi storage platform.  SVOS on-going data collection and “phone home” reporting capability (aka SIM communication) ensure quality is maintained over time by monitoring over 450 hardware, software, environmental and data path metrics to pro-actively ensure the system is operating under ideal conditions. 

 

On a daily basis, SVOS predictive monitoring collects more than 6,000 system performance data points to track system quality of service; enabling quick root cause analysis of any response time abnormalities.  Annually, SVOS smart predictive monitoring analyze over 100 billion drive information events to ensure data integrity, data availability and identify quality issues before they impact customers.

 

The simplest way to describe what all this predictive capability means in the data center world is looking into a drive failure use case.  Over 9 out 10 of hard disk drive, solid state drive (SSDs) or flash module (FMD) failure incidents are triggered by SVOS predictive health insights which elect to spare a device to avoid long rebuild time and performance degradation prior to a device hard crash.

 

This superior resilience enabled by Hitachi SVOS can identify at its source early quality abnormality before a bug affects a production application.  Note that SVOS cognitive insight extends further than the system components, storage devices or software/firmware, it also monitors interoperability between the software function across logical and physical data path, to the network and the host.

 

Drilling down on high priority service requests helps me understand why application resilience can only be achieved with visibility of the entire stack from the host down.  The classic hardware failure represented about 12% of HDS service requests in that sample while software/firmware bugs was less than 5% of service calls.  To reduce the data availability and quality of service risks, a predictive monitoring engine needs to have visibility on configuration related issues which represented 19% of HDS service call and non-Hitachi data path issues call which represented 13% of HDS calls.

 

In this 150 million operating-hours system population, Hitachi delivers 100% data availability for greater than 99.9% of Hitachi Virtual Storage Platform customers.  Average system uptime in this sample was greater than six 9s (99.9999%) where 100% of acute system incidents were caused by either software, configuration, users or change management issues.  These acute system incidents were related to partial data access or quality of services issues  with zero hardware incidents caused a system down outage due to Hitachi Virtual Storage Platform fault-tolerant architecture.

 

Which AFA vendor can you rely on to compete in a digital world and have peace of mind when you get back to your family at night?

 

Sincerely,

 

 

Patrick Allaire

 

1I had no chance; according to the UK's National Health Service, it's possible for a cold virus to survive outside the body for more than one week. Viruses last longer in indoor environments.

 

2Analyst research on data center outage costs such as Ponemon Institute points to a median outage cost of $648,174.

Outcomes