Hitachi Vantara DataOps Advantage for Data Copies

By Hubert Yoshida posted 09-24-2019 05:59


IDC predicts that our global datasphere – the digital data we create, capture, replicate and consume – will grow from approximately 40 zettabytes of data in 2019 to 175 zettabytes in 2025. There are so many such predictions like this that we are no longer shocked by these pronouncements. What is different in this prediction is that not only will the amount of IoT data and real-time data balloon, but so will the amount of data that is created and managed by enterprises. IDC claims that By 2025, nearly 60% of the 175 zettabytes of existing data will be created and managed by enterprises versus consumers (compared to just 30% created and managed by enterprises in 2015). 

I can believe that more data will be captured by enterprises from consumer devices since consumer data will become more connected and more valuable to enterprises. However, this means that enterprises will be managing 105 zettabytes in 2025 versus about 12 zettabytes in 2019! Unless there is a major breakthrough in technology, I do not believe that we will have the manufacturing capability to provide that amount of storage capacity by 2025. The only way that we can keep up with this data deluge is to apply DataOps to manage this growth.

When one thinks of reducing storage capacity, the first thing that comes to mind is dedupe and compression. But that is a drop in the bucket compared to the number of copies of data that are generated from a single source, and the number of dormant or orphaned data copies. I have not been able to find any recent surveys on how much data are copies of data. There was a survey done in 2012 by IDC which reported that 65% of external storage systems’ capacity was used for non-primary data such as snapshots, clones, replicas, archives, and backup data. Other studies at that time estimated that 13 -20 copies were made of working data files, most of which were for data protection and disaster recovery.

Today the number of copies as a proportion of total data has gown much larger as applications and data are bound more closely together and copies of data are distributed across compute clusters for distributed processing and locality of reference. Copies serve a very useful purpose in an agile IT environment. Eliminating silos of data means creating ETL copies while DevOps means more copies of data for parallel development streams.

Last month I blogged about the deluge of data that is generated by AI/ML. Mike Foley, our marketing data scientist gave me an example of how many data files are created in the course of building an AI model. In his experience, a relatively small data science team would begin by designing a schema of approximately 30 analytical data views for their data science workbench. From that relatively small start, over the course of a couple of years, models would be developed, trained, tested and validated to produce predictions; and the volume of data generated would approach 10,500 data files! All these files would have to be retained in order to explain or prove the validity of the models and the training data that went into them.

This creates huge challenges in controlling the growth of copies while maintaining different types and methods of copies like, full, incremental, differential, time dependent, virtual and physical copies. Maintaining the consistency of copies, protecting the copies, ensuring governance, enabling the exploration and data engineering of copies for analytics purposes, will also be a major challenge. A coordinated approach to managing the growth of data copies, maybe the only hope we have of meeting the data challenges of the next few years.  

At Hitachi Vantara—the data arm of Hitachi Ltd—we have assembled a unique and extensive portfolio of products, solutions and intellectual property to successfully support the implementation of your DataOps initiatives around data copies. We allow your organization to successfully store, enrich, activate, and monetize ALL of the data you own from the edge, to the core to the cloud.  We call it “Your DataOps Advantage”. Your DataOps advantage is based on a suite of tools that will help you operationalize and monetize your data while managing the growth of data copies.

Here are some of the key building blocks in our DataOps portfolio

Hitachi Data Instance Director, HDID
Hitachi Data Instance Director is the most important tool for managing the growth of copies. HDID can simplify data protection and recovery by combining local operational recovery with remote business continuity and disaster recovery in a single workflow. It can instantly backup critical applications, create application snapshot and clone operations without impact on performance, and restore applications from a snapshot, clone or offsite replica in a single step. HDID data protection ensures continuous availability for applications and data while tracking and eliminating expired copies. 

In addition to data protection, HDID can manage the explosion of copies required for data discovery, reporting and analytics. As more applications compete for the same data, data sets are copied across multiple compute clusters as they are needed and deleted when they are no longer needed. While having local copies of data can help applications to be more efficient, HDID ensures that dormant copies are not forgotten. HDID can automatically create, manage, refresh and expire copies and enable better use of data, including the repurposing of backup data for other productive uses. 

Hitachi Content Platform HCP
Hitachi Content Platform (HCP) is an object storage software solution that connects data producers, users, applications and devices into a central cloud storage platform. It enables users to better understand, govern and control the degree of mobility of their data, as well as to identify insights and extract value for data-driven decisions and faster time to market.

Object storage is an architecture that manages storage as objects instead of a hierarchy as file systems do. This makes an object store like HCP very scalable and cost effective for large data stores. HCP also has built in mechanisms for high availability, durability, disaster recovery, privacy and immutability. HCP automates day-to-day IT operations like data governance and protection. This approach readily evolves to changes in scale, scope, regulatory compliance, applications, storage, server and cloud technologies over the life of data. HCP also automates the governance of data to ensure proper retention, access control, encryption and disposal of data, while simplifying e-discovery and search. HCP eliminates the need for backup since data is replicated and versioned when the data is updated. This would be the place to store all those data sets that are generated by AI/ML and retained to prove their validity. Versioning is a feature which eliminates the threat of ransomware without the need for multiple backup copies. In IT environments where data grows quickly or must live for years or even indefinitely, these capabilities are invaluable.

One of the biggest culprit for copies of data is the sharing of files. HCP has a feature, HCP Anywhere that enable the sharing of files without the need to copy the file to everybody on the email copy list. With HCP Anywhere we just point to a link that can be shared with multiple users rather than sending everyone their own copy.

Hitachi Content Intelligence HCI
Hitachi Content Intelligence automates the extraction, classification, enrichment and categorization of data residing on both Hitachi Vantara and third-party repositories, located on-premises and in clouds, and across heterogeneous data repositories (internal and external). This approach drastically reduces time spent searching for what is needed or recreating what already exists. HCI can be used to explore the data for fraudulent activities, money laundering or KYC. HCI could also provide pseudonymization of data which is a GDPR requirement to ensure privacy without sacrificing the analytic value of the data. With the appropriate understanding of what the pattern of sensitive data looks like, and what constitutes direct and indirect personally identifiable information, HCI could find the data and redact it before it is shared. 

Pentaho tightly couples data integration with business analytics, the Pentaho platform brings together IT and business users to ingest, prepare, blend and analyze all data that impacts business results. Pentaho’s open source heritage drives continued innovation in a modern, unified, flexible analytics platform that helps organizations accelerate their analytics data pipelines. Pentaho’s open, embeddable platform supports flexible analytics that both leverage existing data infrastructure and future-proof deployments against tomorrow’s inevitable changes. Intuitive data integration and preparation capabilities drastically reduce the hand coding required to bring data together for insight. Pentaho originated the concept of data lakes to support data analytics workflows. The clear advantage of this is the elimination of the need to copy data sets to various analytics clusters. At the same time, Pentaho Business Analytics provides a spectrum of analytics for all user roles, from visual data analysis for business analysts to tailored dashboards for executives. Pentaho is fast to deploy, easy to use, and purpose-built for the future of analytics. 

Hitachi Vantara’s DataOps advantage provides a suite of tools to manage data growth, operationalize, and monetize your data. The Hitachi Data Instance Director, Hitachi Content Platform, Hitachi Content Intelligence, and Pentaho tools are particularly suited to address the largest component of data growth which is the growth of data copies.