Skip navigation


3 Posts authored by: Evan Cropper Employee

by Carl Speshock, Senior Technical Product Manager


1. Why Model Management


Data Scientists are constantly exploring and working with the latest in Machine Learning (ML) and Deep Learning (DL) libraries. This has increased in the number of model taxonomies Data Engineers manage and monitor in production environments. To address this need, Hitachi Vantara developed a model management solution that supports the evolving requirements of the data scientist and engineer community.


Pentaho offers the concept of the Model Management Reference Architecture as shown below.




2. Why Model Management with Pentaho?


Data Engineers choose PDI for model management for the following reasons, it:


  • Operationalizes ML/DL Models for Production usage, expanding upon Pentaho’s Data Science abilities via the ability to implement a Model Management solution.
  • Enables data engineers to work with PDI GUI drag and drop environment to implement Python script files received from Data Scientists within the new PDI Python Executor Step.
  • Utilizes input(s) from previous PDI steps as variables, Python Numpy, and/or Pandas DataFrame objects within DL processing. This integrates DL Python Script file with PDI workflow and data stream


The Pentaho model management architecture includes:

  • PDI: Performs Data Wrangling/Feature Engineering performed by the Data Engineer to fulfill production data set request(s) from the Data Scientist
  • Python Executor Step: Executes Python scripts containing ML and DL libraries, and more.
  • The ML/DL Model: Developed by a Data Scientist, within a Data Science notebook such as Jupyter Notebook, etc.
  • The Model Catalog: Stores versioned model within a RDBMS, NoSQL database, and other data stores).
  • Champion/Challenger Multi-Model Analysis: Pulls models from Model Catalog and runs iteratively with hyperparameters to analyze and evaluate model performance.


3. How do we Manage Models within PDI?


Components referenced are:

  • Pentaho 8.2, PDI Python Executor Step
  • 2.7.x or Python 3.5.x

Review Pentaho 8.2 Python Executor Step in Pentaho online help for list of dependencies.

Basic process:


1. Build a Model Management workflow in PDI following the example below::


  • Create Model Catalog: Build models and store model run info, evaluation metrics, model version, etc. into the Model Catalog.
  • Setup Champion/Challenger runs: For each run competing models will compete against the Champion Model. – which isthe current model in production that Data Scientists are using.
  • Run Champion/Challenger Models: Pre-determined Challenger models will be utilized from the Model Catalog and run against the current Champion. The Model with the best evaluation metrics will be set to Champion. This keeps the current Champion or replaces it with a more accurate Challenger model.




2. In PDI, use the new 8.2 Python Executor Step




3. Model Experimentation is the first part of the Model Management workflow, which builds and trains models as well as stores evaluation metric information in a Model Catalog (figure below showcases 5 functional areas of processing). PDI offers a GUI drag and drop environment that could also add Drift Detection, Data Deviation Detection, Canary Pipelines and more.




This example uses four models. One of the models has injected in-line model hyperparameters using a subset of the labeled production data. All of this this can be adjusted to fit your environment, (i.e. modified to incorporate labeled or unlabeled and non-production data to accommodate your requirements).



4. Next stage in the workflow is the Setup Champion/Challenger run. Place a model into a Champion folder and the remaining models in the Challenger folder. This can be accomplished via using the PDI Copy Files step with an example shown below:






5. Now, perform a subset of a Champion/Challenger run with the three highest performing models. Decide on an evaluation metric or other value for comparing which model performed the best. The output of the Champion/Challenger run can produce log entries or send emails, to save the results.



The example shows the models using production data, which could be modified to incorporate either labeled or unlabeled, non-production data to accommodate your requirements. PDI offers a GUI drag and drop environment that could add Drift Detection, Data Deviation Detection, Canary Pipelines and more.



6. Run the PDI Job to execute the Model Management Workflow. It will write its output to the PDI log, with the Champion/Challenger results displayed (see below):




4. Why do Organizations use PDI and Python for Deep Learning?

It is no longer an option to operationalize your models in a production environment without a dedicated data preparation tool, it is a necessity. Pentaho 8.2 can assist you in this effort.


  • Ø PDI makes the implementation and execution of Model Management solutions easier using its' graphical development environment for related pipelines and workflows. It is easy to customize your Model Management workflow to meet specific requirements.
  • Ø Data Engineers and Data Scientist can collaborate on the workflow and utilize their skills and time more efficiently.
  • Ø The Data Engineer can use PDI to create the workflow in preparation of the Python scripts/models received from the Data Scientist and implemented in the Model Catalog.

Pentaho Data Integration (PDI) and Data Science Notebook Integration

Evan Cropper – PMM - Analytics

1. Using PDI, Data Science Notebook, and Python Together


Data Scientists are great at developing analytical models to achieve specific business results. However, deploying a model in a production environment requires a different skillset than data exploration and model development. The result is wasted human capital. The data scientist spends a significant amount of engineering the data, which is often re-written by your data engineers for production deployments.


So why not allow the data scientist and data engineer focus on what they do best? With Pentaho PDI’s Drag and Drop GUI environment, data engineers can prepare and orchestrate data to seamlessly flow into a Data Scientist’s Notebook, i.e. Jupyter (which is the focus of this blog). With Pentaho, Data Scientists can explore analytical models using the Python programming language and related Machine Learning and Deep Learning frameworks with cleansed data. The Result? Models ready for production environments at a fraction of the cost and delivered in a fraction of the time.


2. Why would you want to use PDI and the Jupyter Notebook together to develop models in Python?

Pentaho allows Data Scientists to spend their time on data science models instead of data prep tasks and makes it easier to share Python scripts between data scientists and data engineers. By choosing Pentaho to operationalize data science, organizations can:


  1. Utilize a graphical drag and drop development environment, which makes data engineering easier with a toolbox of connectors to many data sources - easily configured instead of coded, tools that can blend data from multiple sources, and transformation steps to cleanse and normalize the data.
  2. Migrate data to production environments with minimal changes.
  3. Scale seamlessly to address growing production data volumes, and
  4. Share production quality data Sets and Python scripts between Data Engineers and Data Scientists, as shown below in a collaborative workflow between the two personas:



3. How do you develop Python models using PDI and Jupyter Notebooks?


Dependencies and components tested with:

  • Pentaho versions applicable with: 8.1/8.2
  • 2.7.x or Python 3.5.x
  • Jupyter Notebook 5.6.0
  • Python JDBC package dependencies, i.e. JayDeBeApi and jpype

Pentaho PDI Data Service On-line help link that includes configuration, installation, client jars, etc. (


Basic process:


  1. In PDI, create a new Transformation connected to the Pentaho Server repository. Implement all of your data connections, blending, filtering, cleansing, etc., as shown in below example,




2. Use PDI's Data Service feature to export rows from the PDI transformation (which later will be consumed in a Jupyter Notebook). Create a New Data Service by right-clicking on the last step in the transformation. Test the Data Service within the UI and select Save Transformation As to save the Data Service to Pentaho Server.




3. Before the Data Scientist can work in the Jupyter Notebook utilize a Data Grid Step to review the Data Grid Fields and Data Values. These input variables will flow into the Python Executor Step. Note they can be easily changed by the Data Engineer for new PDI Data Services.







4. Below, the Python Executor – Create Jupyter Notebook –Python API contains Python Script, Input and Output references and more. From here, the Data Engineer can create the Jupyter Notebook for the Data Scientist to consume.



5. Python Executor – Create Jupyter Notebook –Python API step automatically populates the Jupyter Notebook (shown below) with the cleansed and orchestrated data from the transformation. The Data Scientist is connected directly to the PDI Data Service created earlier by the Data Engineer.




6. Data Scientists will retrieve, i.e. Enterprise Data Catalog, File Share, etc., the Jupyter Notebook file created by the Data Engineer and PDI. The Data Scientist will confirm the output from the Python Pandas Data Frame named df in the last cell.


7. From here, the Data Scientist can begin building, evaluating, processing and saving the machine and deep learning models by utilizing the Pandas Data Frame named df. An example is shown below using a Machine Learning Decision Tree Classifier.







4. How can Data Engineers and Data Scientists using Python collaborate better with PDI and Jupyter Notebooks?


  1. PDI’s graphical development environment makes data engineering easier.
  2. Data Engineers can easily migrate PDI applications to production environments with minimal changes.
  3. Data Engineers can scale PDI applications to meet production data volumes.
  4. Data Engineers can quickly respond to Data Scientist’s data set requests with PDI Data Services.
  5. Data Scientists can easily access Jupyter Notebook templates connected to a PDI Data Service.
  6. Data Scientists can quickly pull data on demand from the Data Service and get to work on what they do best!

Anand Rao, Principal Product Marketing Manager, Pentaho



1 Deep Learning – What is the Hype?


According to Zion Market Research, the deep learning (DL) market will increase from $2.3 billion in 2017 to over $23.6 billion by 2024. With annual CAGR of almost 40%, DL has become one of the hottest areas for Data Scientists to create models[1]. Before we jump into how Pentaho can help operationalize your organization’s DL models within product environments, let’s take a step back and review why DL can be so disruptive. Below are some of the characteristics of DL, it:






  • Uses Artificial Neural Networks that have multiple hidden layers that can perform powerful image recognition, computer visioning/object detection, video stream processing, natural language processing and more. Improvements in DL offerings and in processing power, such as the GPU, cloud, have accelerated the DL boom in last few years.
  • Attempts to mimic the activity in the human brain via layers of neurons, DL learns to recognize patterns in digital representations of sounds, video streams, images, and other data.
  • Reduces need to perform feature engineering prior to running the model through use of multiple hidden layers, performing feature extraction on the fly when the model runs.
  • Improves on performance and accuracy over traditional Machine Learning algorithms due to updated frameworks, availability of very large data sets, (i.e. Big Data), and major improvements in processing power, i.e. GPUs, etc.
  • Provides development frameworks, environments, and offerings, i.e. Tensorflow, Keras, Caffe, PyTorch, etc that make DL more accessible to data scientists.


2 Why should you use PDI to develop and operationalize Deep Learning models in Python?


Today, Data Scientists and Data Engineers have collaborated on hundreds of data science projects built in PDI. With Pentaho, they’ve been able to migrate complex data science models to production environments at a fraction of the costs as compared to traditional data preparation tools. We are excited to announce that Pentaho can now bring this ease of use to DL frameworks, furthering Hitachi Vantara’s goal to enable organizations to innovate with all their data. With PDI’s new Python executor step, Pentaho can:

  • Integrate with popular DL frameworks in a transformation step, expanding upon Pentaho’s existing robust data science capabilities.
  • Easily implement DL Python script files received from Data Scientists within the new PDI Python Executor Step
  • Run DL models on any CPU/GPU hardware, enabling organizations to use GPU acceleration to enhance performance of their DL models.
  • Incorporate data from previous PDI steps, via data pipeline flow, as Python Pandas Data Frame of Numpy Array within the Python Executor step for DL processing
  • Integrate with Hitachi Content Platform (HDFS, Local, S3, Google Storage, etc.) allowing for the movement and positioning of unstructured data files to a locale, (i.e. Data Lake, etc.) and reducing DL storage and processing costs.



  • PDI supports most widely used DL frameworks, i.e. Tensorflow, Keras, PyTorch and others that have a Python API, allowing Data Scientists to work within their favorite libraries.
  • PDI enables Data Engineers and Data Scientists to collaborate while implementing DL
  • PDI allows for efficient allocation of skills and resources of the Data Scientist (i.e. build, evaluate and run DL models) and the Data Engineer (Create Data pipelines in PDI for DL processing) personas.


3 How does PDI operationalize Deep Learning?


Components referenced are:

  • Pentaho 8.2, PDI Python Executor Step, Hitachi Content Platform (HCP) VFS
  • 2.7.x or Python 3.5.x
  • Tensorflow 1.10
  • Keras 2.2.0

Review Pentaho 8.2 Python Executor Step in Pentaho On-line Help for list of dependencies. Python Executor - Pentaho Documentation

Basic process:


1. HCP VFS file location within a PDI step. Copy and stage unstructured data files for use by DL framework processing within PDI Python Executor Step.



Additional info: 2. Utilize a new Transformation that will implement workflows for processing DL frameworks and associated data sets, etc. Inject Hyperparameters (values to be used for tuning and execution of models) to evaluate the best performing model. Below is an example that implements four DL framework workflows, three using Tensorflow and one using Keras, with the Python Executor step.









3. Focusing on the Tensorflow DNN Classifier workflow (which implements injection of hyperparameters), utilize a PDI Data Grid Step, ie named Injected Hyperparameters, with values used by corresponding Python Script Executor steps.



4. Within the Python Script Executor step use Pandas DF and implement the Injected Hyperparameters and values as variables in the Input Tab



5. Execute the DL related Python script (either via Embedding or a URL to a file) and reference a DL framework and Injected Hyperparameters from inputs. Also, you can set the Python Virtual Environment to a path other than what is the default Python install.



6. Verify that you have Tensorflow installed, configured and is correctly importing into a Python shell.




7. Going back to the Python Executor Step, click on the Output Tab and then click on the Get Fields button. PDI will do a pre-check on the script file as to verify it for errors, outputs, etc.



8. This completes the configurations for the running of the transformation.


4 Does Hitachi Vantara also offer GPU offerings to accelerate Deep Learning execution?


DL frameworks can benefit substantially from executing with a GPU rather than a CPU because most DL frameworks have some type of GPU accelerators. Hitachi Vantara has engineered and delivered the DS225 Advanced Server with NVIDIA Tesla V100 GPUs in 2018. This is Hitachi Vantara’s first GPU server designed specifically for DL implementation.



More details about the GPU offering are available here:


5 Why Organization use PDI and Python with Deep Learning:


  • Intuitive drag and drop tools: PDI makes the implementation and execution of DL frameworks easier with its' graphical development environment for DL related pipelines and workflows.
  • Improved collaboration: Data Engineers and Data Scientist can work on a shared workflow and utilize their skills and time efficiently.
  • Better allocation of valuable resources: The Data Engineer can use PDI to create the workflows, move and stage unstructured data files from/to HCP, and configure injected hyperparameters in preparation for the Python script received from the Data Scientist.
  • Best-in-Class GPU Processing: Hitachi Vantara offers the DS225 Advanced Server with NVIDIA Tesla V100 GPUs that allow DL frameworks to benefit from GPU acceleration.




1. Global Deep Learning Market Will Reach USD 23.6 Billion By 2024 End: Zion Market Research