Blogs

Pentaho 9.0: Faster Data Delivery with Less Pipeline Friction

By Arik Pelkey posted 10-09-2019 05:18

by Arik Pelkey
Sr. Director, Product Marketing

“Can you please just send me the data?”

Last September, our chairman Toshiaki Tokunaga, issued a “Better Together” blog reiterating Hitachi Vantara’s vision of helping our customers extract more value from their data and improve their businesses, while powering good for our society.

It sounds trite, but to become a more data-driven enterprise and society, analytics data pipelines are often the name of the game.

This is where the latest release of our Pentaho suite, Pentaho 9.0, comes in.

Pentaho 9.0 is an exciting and shining example of how Hitachi Vantara is reducing even more data pipeline friction. With this release, we’re tackling some of the most challenging friction points associated with delivering data faster to data consumers such as business and data analysts, data scientists, applications or even AI bots. We are doing so by helping shift practices towards more automated, collaborative, unified approaches – all the tenets of modern data operations practices.

Here’s a summary of what we’re announcing:

1. Lumada Data Flow Studio: Manage Your Data Lifecycle

A. Data Flow Orchestration Capabilities
A big friction point between data and insight is giving end users rapid access to the data they need.

Today we are turning the “80% is just data prep” paradigm on its head.

We are introducing a new data flow orchestration capabilities inside a modern collaborative data flow environment designed to improve data pipeline management for structured and unstructured data.

With these new orchestration capabilities, data analysts can now apply data flow templates to perform automated data integration tasks for on-boarding and preparing data. What we’re delivering are “self-service data pipelines,” a natural evolution of self-service analytics. What’s more, and what’s unique about our approach, is that analysts can connect to ANY data source.

Here’s how it works:

The Old Way: Slow and Resource Intensive
- The business analyst would send an email with a new data request (and then wait one week, maybe more for unstructured data requests).
- A data engineer would build a new data pipeline for each new data request from the business (and spend hours, days or weeks doing so).

The New Way: Faster and Easier with Self-Service Data Pipelines
- A data engineer builds one template for each common data integration pattern they expect a data analyst to need. The template includes the types of data sources, targets and filters that can be selected and easily configured by the data analyst.

- Data analysts – or any business user for that matter – can then select the specific sources and targets exposed by the data engineer to get self-service access to their own curated, prepared data sets using an approachable drag and drop UI.

Early feedback from customers using data flow deploy capabilities in beta have described three benefits:

- Empowering business users with rapid access to new data

- Allowing data engineers and ETL developers to focus on higher value projects

- Enabling their organizations to become more data-driven.

Productivity and more data-driven decisions – who could ask for more?

B. Data Flow Monitor Capabilities

Many enterprises we talk to have hundreds or even thousands of data pipelines in production with even more on the horizon. However, pipelines can be fragile and often break, leading to frustrated data consumers, and potentially bad decisions. Some broken pipelines take a long time to fix because developers need to get involved to understand the problem. Was it a bug? Is it an infrastructure problem? Or is it the data itself? Understanding all of this quickly is another common friction point.

With 9.0, we’re introducing a web-based data flow monitoring capabilities to help you intelligently monitor and manage pipelines in a single operational console. These monitoring capabilities improve operational efficiency by providing smart monitoring recommendations, and enable DataOps teams to respond faster to production data pipeline incidents with integrated logging and troubleshooting. Over time this will include alerting, troubleshooting suggestions, audit trails, and lineage to deliver even more operational efficiency using active metadata like canonical data structure and semantics, expected SLAs, change events, resource availability, etc.

The new Lumada Data Flow Studio can manage standalone Pentaho instances as well as well as other Lumada Data Services we announced at Next 2019 to help customers build intelligent DataOps practices. You will also see us coming out with more data services over time.

2. Now Deployable in Modern Edge-to-Cloud Architectures

Data and supporting big data infrastructure is more distributed and complex than ever before. SLAs from the business have become more difficult to fulfill and DevOps has turned software deployment upside down. To address these complexities, we are introducing three new capabilities:

Containerization: Containers have fast become the de facto way to deploy enterprise applications. Pentaho 9.0 now provides standardized containers in addition to our server-based deployment option. This makes it easy for DevOps professionals to use container orchestration frameworks such as Kubernetes to run and manage Pentaho containers in the cloud, in your data center or in edge environments, such as on factory floors.

Worker Nodes: Some common challenges we’ve seen in enterprise accounts are the need to meet analytics delivery SLAs and to get more out of existing IT resources. Under normal circumstances a large enterprise may only need to process 30 to 50 transformation jobs, such as accessing data, creating reports, etc. But during peak times like budget planning or end-of-quarter periods, demand may peak at 300 jobs. If you have limited server capacity the only way to get data through was to run jobs sequentially, which made it difficult to hit SLAs. And no sales manager or Chief Marketing Officer in the world ever wants to wait overnight for a sales report – especially if it’s the end of a quarter.

To solve this, Pentaho 9.0 enables you to process more jobs in parallel at peak load times across multiple nodes using a technology called worker nodes. At peak times you can now run more jobs in parallel, so you are no longer bound by the constraints of a Java Virtual Machine.

Take a simple example of a banking customer that currently processes five transformation jobs on one server node. With worker nodes, that banking customer can now process these five jobs in parallel on four different nodes at the same time, and run the job four times faster. Wow!

Support for Multiple Hadoop Clusters: Hadoop isn’t going away any time soon, but it sure is in transition. Whatever happens, most organizations will still have multiple instances of Hadoop if you count on-prem distributions (say, Cloudera), distros in the cloud (say, EMR or HDInsight) and in different lines of business (throw in some MapR for the data science team). There are also likely to be different versions of the same distribution (Cloudera 6.1 and 6.2, for example) installed in different BUs. Some large global customers maintain multiple Hadoop clusters – often using different versions – to ensure data sovereignty and other organizational compliance requirements. Previously, accessing data from all these distributions required building unique pipelines for every version and distribution, and creating a staging area.

With Pentaho 9.0 customers can connect to all Hadoop clusters and versions of Hadoop clusters across an enterprise using same centralized Pentaho instance, which saves enormous time. Your data consumers in data science or in the LOB also benefit because they now have access to even more data.

3. Expanded Analytics Ecosystem Support

Pentaho 9.0 continues to expand support for third-party products and technologies that help optimize data pipelines and extract value out of all your data.

Snowflake bulk load. One thing slowing the stampede to the cloud is simply getting data into the cloud. Today, it can be a highly manual process. Building on our Snowflake connector from Pentaho 8.3 to enable blending, enrichment and analysis of Snowflake data along with other data sources, we are introducing a bulk loader.

Today, the most common way to move data into Snowflake is through repetitive SQL scripting to orchestrate bulk loads. Using our new Snowflake bulk load capabilities to automate loading Snowflake customers can significantly reduce load time with higher performance and apply policies and schedules for when data onboarding should occur.

Mainframe data access. Mainframes are still crucial in many industries such as banking, insurance, healthcare, aviation, retail and others. Yet, there are challenges that come with mainframe data access, privacy and security. To address this, we are introducing a new drag and drop step available in Pentaho Data Integration to convert data from the most common mainframe format, EBCDIC, and make the data available in data pipelines. By making traditionally siloed mainframe data available, you can now govern the data, blend it with other modern enterprise data sources and answer more business questions with greater precision in use cases such as fraud detection.

Conclusion:

Legacy data management practices just don’t work anymore. They’re just too slow to enable an agile approach that business demands today. Pentaho 9.0 delivers a more automated, policy-based approach to data management to get the right (governed) data, to the right place, at the right time and enables customers on their hybrid or cloud journeys to meet demanding SLAs.

But above all, 9.0 helps us continue our mission of helping customers extract more value from their data and improve their businesses, while powering good for our society.

Are you ready to extract more value from your data? Visit www.HitachiVantara.com/Pentaho to learn more. Better yet, download the 30-day trial and try it out yourself!

Disclaimers: