View Only

Simplify Data Pipelines across On-Premise and Cloud Hadoop Data Lakes

By Anand Sagar Rao Vala posted 02-13-2020 21:43


Organizations seek to run any workload from any location without the burden of re-architecting or refactoring applications, including data integration pipelines. For storage, they want to leverage their existing on-premise Hadoop investments and provide a seamless experience to data consumers when they migrate to the cloud to take advantage of the usability, scalability and elasticity of cloud-native solutions. Also, Hadoop, in the cloud with the complicated set-up already taken care is ready to be used immediately and on-demand.

A quick primer on multi-cluster support

What is multi-cluster support?

The transition of Hadoop from on-premise to cloud has created additional data silos. Previously accessing data from different distributions required building dedicated pipelines to each of them. Multi-cluster support allows users to access data from all these different Hadoop distributions utilizing a single pipeline without staging the data.

Why are organizations excited about this capability?

Individual pipelines can now be more diverse - connect across clusters, and seamlessly blend and integrate data across clusters. Overall, there is increased operational efficiency within production environments by configuring a single integration server instance to use multiple clusters.

What use cases does it solve?

This setup streamlines security. A single pipeline can be configured with different user credentials to seamlessly connect to any number of secured or unsecured data lake clusters. Organizations can respond to increasing data silos by optimizing the storage and cost of data processing workloads.


Why multi-cluster support with Pentaho?

With Pentaho, organizations can use a single Pentaho server instance to:

1.       Access data from multiple versions and vendor distributions of Hadoop clusters from within the same or multiple PDI Pipelines without having to create multiple configurations or restart the Pentaho server

2.       Simplify the setup within the development, test, and production environments without having to reconfigure and restart Spoon.

3.     Seamlessly setup multiple Hadoop clusters that contain either raw data, curated data, or publishable data.  This pipeline can cleanse, normalize and standardize raw data into curated data. The same pipeline can then anonymize, mask, merge, blend and prepare the curated data to be publishable.

Here is what a few of our customers were able to do:

A large multi-national bank wanted to reconcile data across different geographic locations within the US, EMEA and APAC for periodic reporting. They wanted to use a single pipeline to connect, read and write data to multiple Cloudera Hadoop clusters.

A different large bank wanted to move to a hybrid architecture with Cloudera onsite and AWS EMR in the cloud.

The multi-cluster capability will enable these Pentaho pipelines to connect to any number of secured or unsecured data lake clusters with the same or different user credentials and avoid using a temporary store and break the pipeline into multiple pieces that execute in different Pentaho instances.


How to use PDI for multi-cluster support?


Let’s now look at a transformation inside of the PDI client that will utilize the Multi-cluster functionality

This transformation implements two defined Named clusters, one Cloduera CDH and one Hortonworks HDP cluster, thus giving us a multi-cluster capable transformation

The pipeline blends data from each cluster in Stream Lookups. Zip Code and Product data from the incoming source raw file is fed to the PDI stream and then processed with data from a CDH Hive Table. After completing processing, we write the output files in the AVRO, ORC, Parquet and CSV formats to each of the two clusters.



Organizations are increasing their sophistication around hybrid and multi-cloud approaches, leading to "multi-hybrid" architectures. Assumptions that a given cloud provider has the lowest or best prices, or that the cost of networking between clouds is prohibitive, has become less and less true. Enterprises will continue to invest in hybrid cloud and look for greater inter-operability between private and public clouds for all workloads, including legacy as well as cloud native. Each such enterprise will benefit from Pentaho Data Integration’s new multi-cluster support.

1 comment



05-04-2022 13:53