Artificial intelligence: Everybody’s doing it. However, few are doing it well. According to the 2019 MIT SMR-BCG Artificial Intelligence Global Executive Study and Research Report, nine out of ten companies have made some investment in AI, but 70 percent said they have seen minimal or no impact from AI thus far. 2020 is the year that organization will have to show ROI on their investments in AI and ML. The problem, says Forrester in its 2020 AI predictions report, is “sourcing data from a complex portfolio of applications.” Data scientists often struggle to acquire, transform, and prepare the data they need to start a machine-learning (ML) project. Data lakes, data engineers, and data prep tools have helped, but the real problem is sourcing data from a complex portfolio of applications and convincing various data gatekeepers to play along.
Businesses continue to struggle with managing and gaining full value from their data in this new edge to core to multi-cloud world. Data breaches are common, rogue data sets propagate in silos, and companies’ data technology often isn’t up to the demands put on it. Few if any data management frameworks are business focused. They don’t promote efficient use of data and allocation of resources, or curate data to understand its meaning—as well as the technologies that are applied to it—so engineers can move and transform the essential pieces that data consumers need for data monetization. Customers need to shift their strategies towards more collaborative, unified and automated processes, so they can better leverage data to advance their business. The way to accomplish this is through DataOps.
DataOps is data management for the AI era. It unleashes data’s ultimate potential by automating the processes that enable the right data to get to the right place at the right time—all while ensuring it remains secure, trustworthy and accessible only to authorized employees. DataOps is needed to understand the meaning of data as well as the technologies that are applied to the data so that data engineers can move, automate and transform the essential data that data consumers need. Hitachi Vantara offers a proven, end-to-end, DataOps methodology that lets businesses deliver better quality, superior management of data and reduced cycle time for analytics.
Metadata is key to our approach to DataOps. Metadata is data that provides information about content and context of the data; it is used to summarize basic information about data which can make tracking and working with specific data easier. Comprehensive metadata enables data sets designed for a single purpose to be reused for other purposes and over the longer term.
A key management tool for meta data is a data catalog. It is a tool designed to help organizations find and manage large amounts of data – including tables, files and databases – in their enterprise data stores. Data catalogs centralize metadata in one location, providing a full view of each piece of data across databases and contain information about the data’s location, profile, statistics, summaries and comments. It enables collaboration between the various data gate keepers. This systematized service helps make data sources more discoverable and manageable for users and helps organizations make more informed decisions about how to use their data.
Hitachi Vantara has acquired Waterline Data, the pioneer and one of the key players in the Data Catalog space. The Waterline Data Catalog uses metadata and machine learning to catalog and tag data in multiple clouds like AWS, Azure, and Google Cloud Platform; as well as on-premise big data like Cloudera and MapR; cloud databases like Redshift and Snowflake, and on-premise relational data bases.
While most data has technical metadata – the names that developers gave it, this technical metadata is inconsistent and frequently misleading. Data consumers want to use business metadata – standard business terms to find the data that they need. Imagine an analyst looking for insurance claim numbers. The fields containing those numbers may have been called claim_num, cnum, claim_number, num, claimed, or a variety of other names. Trying to search through all possibilities is not practical. It gets even worst when the data consumer needs to find data sets that contain both claim number, customer name and customer tax id. Each can have dozens of possible technical names and the permutations of these dozens would require thousands of possible searches. This is what is called the operational gap – the gap that keeps data consumers from finding, understanding and using data that is only described using technical metadata.
Waterline’s machine learning tool, Aristotle, is a machine learning system for associating technical metadata with business terms or business metadata, thereby closing the operational gap. Aristotle uses patented fingerprinting technology to automate the discovery, classification, management and governance of data scattered across the enterprise. Fingerprinting uses AI- and rule-based techniques to automate the discovery, classification and analysis of distributed and diverse data assets to accurately and efficiently tag large volumes of data based on common characteristics.
For example, to correctly identify “insurance claim numbers” in a petabyte-scale data lake, Waterline Data requires only a single field to be identified as a claim number. The technology then generates a unique "fingerprint," which enables it to recognize and label all similar fields as “insurance claim numbers” across the entire data lake and beyond with extremely high precision – regardless of file formats, field names or data sources. This makes the discovery of valuable insights from data much easier. Waterline offers a pragmatic and advanced approach to data catalogs and metadata management.
For more information on Hitachi Vantara’s DataOps Advantage and the advantages of Waterline’s Data Catalog See the blog by Brad Surak, President, Digital Solutions, Hitachi Vantara. #Hu'sPlace#Blog