Lumada Data Lake: A DataOps Repository for Lumada Data Services

By Hubert Yoshida posted 02-27-2020 21:39


Lumada Data Services is a suite of products that is designed to help you to easily and securely connect data between data producers and data users without locking you into proprietary data stores or cloud silos. Lumada Data Services consists of horizontal solutions for Lumada Edge Intelligence, Lumada Data Lake, Pentaho, and Lumada Data Optimizer. There are also Lumada solutions for vertical applications for Lumada Maintenance InsightsLumada Manufacturing Insights and Lumada Visual Insights.

In my previous post I featured Lumada Edge Intelligence and explained how it was architected. In this post I will provide more detail on the Lumada Data Lake.


Data Lakes have been around for some time and have mainly been associated with Hadoop. However, only about 15% of Hadoop deployments have been successful due to the complexity of diverse workloads and the prohibitive costs of scaling as the data lakes expand. Data lakes have turned into swamps due to improper cataloging, curation, and governance of data. Public cloud based solutions also become expensive to scale and create undesirable vendor lock-in. Complexity creates an over reliance on IT for access requests which may take days or weeks and inhibit the agility of data users.


Hitachi’s Lumada Data Lake kick-starts the DataOps journey. Built on a flexible cloud-native architecture, it integrates and catalogs data onto a cost-effective and metadata-rich object store and offers simple self-service management with low to no coding. Lumada Data Lake is a proven, highly flexible and secure hybrid cloud solution. It offers multi-petabyte scalability with low total cost of ownership (TCO) from edge to multicloud architectures.


                                                       LUMADA DATA LAKE

Hitachi’s Lumada Data Lake is based on the award winning Hitachi Content Platform (HCP). HCP is a software-defined object storage solution, that addresses modern data lake requirements. It maintains high performance at hyperscale, is compatible with the Amazon Web Services (AWS) S3 API and supports strong data consistency. HCP conveniently avoids the cost of ownership and management limitations associated with Hadoop HDFS by allowing storage and compute resources to be scaled independently. It offers hybrid cloud storage that places the right data in the right place, resulting in better storage economics beyond public cloud offerings.

With an innovative, elastic, microservices-based architecture, HCP provides massive scalability to support hundreds of data nodes, trillions of objects and exabytes of data. It also features rich, policy-driven data management and enrichment capabilities with strong data

consistency. Its hardware-agnostic architecture can be deployed on any bare- metal server, virtual machine or container, on premises or in the public cloud. When combined with HCP S series nodes, (a plug and play storage node that provides 13 PB of storage in a single rack!) it delivers a highly dense, cost-optimized, on-premises storage with significantly lower TCO than Hadoop storage nodes. Erasure coding functionality and automated data integrity checking processes also ensure long-term data protection and availability.


Multi-cloud support is based on a global namespace that allows for unified management across multiple on-premises cloud deployments as well as AWS. Offering broad infrastructure flexibility, HCP supports any S3 storage endpoint. Hitachi’s multicloud data management lets you put the right data in the right place at the right time. Leverage public cloud services for specific use cases, and conserve on-premises compute resources by copying data to the appropriate cloud-service-based application, returning only related insights

to the on-premises location.


Data integration is facilitated through the containerized deployment of Hitachi Vantara’s Pentaho suite and Hitachi Content Intelligence. Pentaho provides broad, future-proofed flexibility to integrate with many popular big data stores. Connectors are available for Hadoop distributions, including Cloudera & MapR, AWS Elastic MapReduce (EMR), Google Cloud Platform and Microsoft Azure HDInsight, as well as popular NoSQL databases, such as MongoDB and Cassandra. With broad connectivity to any data type coupled with high-performance Spark and MapReduce execution, Pentaho simplifies and accelerates the process of integrating existing databases with new sources of structured and unstructured data. Lumada Data Lake also offers a containerized deployment of Hitachi Content Intelligence for indexing, querying and integrating unstructured data and metadata. Content Intelligence enables better under- standing of the stored files, how the content is being used and, ultimately, the inter-object relationships. Effective curation of this data can then be achieved by adding searchable metadata tag. When combined, Pentaho and Content Intelligence with Lumada Data Lake offer unmatched integration of structured, unstructured and semi-structured data.


In order to reduce the reliance on IT for access to data, Lumada Data Lake creates built-in data zones that promote industry best practices in the separation of raw and clean data for curation and compliance purposes. Lumada Data Lake offers data catalog and dataflow studio capabilities for self-service catalog and curation. This creates IT resource efficiencies while promoting a highly iterative and collaborative DataOps practice. The recent acquisition of Waterline’s Data Catalog with its patented fingerprinting technology to automate the discovery, classification, management and governance of data scattered across the enterprise will be used to accurately and efficiently tag large volumes of distributed and diverse data assets based on common characteristics.



For more information on Lumada Data Lake please see the following link

1 view