Hu Yoshida

Big Data System Can Count on Another V, Virtualization

Blog Post created by Hu Yoshida Employee on Sep 17, 2018

Big data refers to the use of data sets that are so big and complex that traditional data processing infrastructure and application software are challenged to deal with them. Big data is associated with the coming of the digital age where unstructured data begins to outpace the growth of structured data. Initially the characteristics of big data was associated with volume, variety, and velocity. Later the concepts of veracity and value were added as enterprises sought to find the ROI in the capture and storing of big data. Big Data systems are becoming the center of gravity in terms of storage, access, and operations; and businesses will look to build a global data fabric that will give comprehensive access to data from many sources and computation for truly multi-tenant systems. The following chart is a composite of IDC’s 2018 Worldwide Enterprise Storage systems Market Forecast 2018-2022and Volume of big data in data center storage worldwide from 2015 to 2021. This composite shows that Big Data will eventually be more than half of the capacity of enterprise data by 2021.

 

Big Data Growth.png

 

Unfortunately, the deployment of large data hubs, data warehouses, data lakes, ERP, Salesforce, and Hadoop, has resulted in more data silos that are not easily understood, related, or shared. Deployments of large data hubs over the last 25 years (e.g., data warehouses, master data management, data lakes, Salesforce and Hadoop) have resulted in more data silos that are not easily understood, related, or shared. In the May-June issue of the Harvard Business Review,  Leandro Dalle and Thomas Davenport, published an article “ What’s your Data Strategy”, in which they claimed that less than 50% of an organization’s structured data is actively used in making decisions and less than 1% of its unstructured data is analyzed or used at all. While the ability to manage and gain value from the increasing flood of data is more critical than ever, most organizations are falling behind. I contend that much of this is not due to the lack of interest or need but is due to the difficulty of accessing the silos of data. I believe we need to add another “V” word in association with big data. That word is virtualization.

 

In IT there have been two opposite approaches to virtualization. Server virtualization where you make one physical server look like multiple servers, and storage virtualization where you make multiple physical storage units look like a single storage system – a pool of available storage capacity that can be managed and enhanced by a central control unit. The virtualization of big data is more like storage virtualization, where multiple data silos can be managed and accessed as one pool of data. The virtualization of data is done through the use of meta data, which enables diverse data to be stored and managed as objects.

 

Object storage can handle the volume challenge. It is essentially boundless since it is a flat file and is not bound by directories. Hitachi’s HCP (Hitachi Content Platform) can scale the volume of data across internal and external storage systems, and from edge to cloud. HCP meta data management enables it to store a variety of unstructured data. Hitachi Vantara’s object store was designed for immutability and compliance and has added Hitachi Content Intelligenceto ensure the veracityof data that it stores.  HCP with Hitachi Content Intelligence provides a fully federated content and metadata search across all data assets, with the ability to classify, curate, enrich and analyze the data you have. Hitachi Vantara’s Hitachi Content Platform and Hitachi Content Intelligence provide the intelligent generation and curation of meta data to break down those silos of unstructured data. Pentaho with its Pentaho Data Integration (PDI) provides a similar capability to breakdown the silos of structured data.  In this way we can virtualize the silos of data for easier access and analysis and transform these separate data hubs into a data fabric.

 

Earlier, we mentioned that Velocity was one of the attributes of big data. That referred to the velocity at which unstructured data was being generated. It did not refer to the access speed of object storage systems. Since object storage systems use internet protocol, it cannot process data as fast as directly attached file or block systems. As a result, large analytic systems like Hadoop would ETL the data into a file system for processing. For years Hadoop has been the go to data storage and processing platform for large enterprises. As Hadoop has solved this problem of storing and processing massive amounts of data, it has created a new problem. Storing data in Hadoop is expensive and fault tolerance comes from 3x data redundancy. Storing this data indefinitely is expensive but this data is still valuable, so customers do not want to throw it away. HCP with its lower cost object storage options and 2x data redundancy, can solve the problem of storing and protecting massive amounts of data. But what about the lack of processing speed? This is where virtualization can also help.

 

Hitachi Content Platform (HCP) has partnered with Alluxio to utilize their memory-speed virtual distributed file system to deliver a certified solution that simplifies the challenges of connecting big data applications, like Hadoop, to Hitachi Content Platform to reduce storage costs, and provide high-performance and simplified access to data. Alluxio lies between compute and storage and is Hadoop and object storage compatible. Alluxio intelligently caches only required blocks of data, not the entire file which provides fast local access to frequently used data without maintaining a permanent copy of data. Existing data analytics applications, such as Hive, HBASE, and Spark SQL, can run on Alluxio without any code change and store data to HCP and access data with high performance memory-speed access

 

This is similar to the what HCP provides as a virtual bottomless file system for NFS and SMB filers, where files are accessed through a local Hitachi Data Ingestor which acts as a caching device to provide remote users and applications with seemingly endless storage. As the local file storage fills up, older data is stubbed out to an HCP system over S3. Cloud storage gateways like HDI, HCP Anywhere edge file servers or 3rd party tools’, act as a cloud gateway connecting remote sites to the data center without application restructuring or changes in the way users get their data. This eases the migration from traditional NAS-to-cloud-based file services.

 

Now is the time to add Virtualization to the Big Data “Vs” of Volume, Variety, Velocity, Veracity, Value.

Outcomes