Evan Cropper

Pentaho Data Integration (PDI), Python and Model Management

Blog Post created by Evan Cropper Employee on Dec 12, 2018

by Carl Speshock, Senior Technical Product Manager


1. Why Model Management


Data Scientists are constantly exploring and working with the latest in Machine Learning (ML) and Deep Learning (DL) libraries. This has increased in the number of model taxonomies Data Engineers manage and monitor in production environments. To address this need, Hitachi Vantara developed a model management solution that supports the evolving requirements of the data scientist and engineer community.


Pentaho offers the concept of the Model Management Reference Architecture as shown below.




2. Why Model Management with Pentaho?


Data Engineers choose PDI for model management for the following reasons, it:


  • Operationalizes ML/DL Models for Production usage, expanding upon Pentaho’s Data Science abilities via the ability to implement a Model Management solution.
  • Enables data engineers to work with PDI GUI drag and drop environment to implement Python script files received from Data Scientists within the new PDI Python Executor Step.
  • Utilizes input(s) from previous PDI steps as variables, Python Numpy, and/or Pandas DataFrame objects within DL processing. This integrates DL Python Script file with PDI workflow and data stream


The Pentaho model management architecture includes:

  • PDI: Performs Data Wrangling/Feature Engineering performed by the Data Engineer to fulfill production data set request(s) from the Data Scientist
  • Python Executor Step: Executes Python scripts containing ML and DL libraries, and more.
  • The ML/DL Model: Developed by a Data Scientist, within a Data Science notebook such as Jupyter Notebook, etc.
  • The Model Catalog: Stores versioned model within a RDBMS, NoSQL database, and other data stores).
  • Champion/Challenger Multi-Model Analysis: Pulls models from Model Catalog and runs iteratively with hyperparameters to analyze and evaluate model performance.


3. How do we Manage Models within PDI?


Components referenced are:

  • Pentaho 8.2, PDI Python Executor Step
  • Python.org 2.7.x or Python 3.5.x

Review Pentaho 8.2 Python Executor Step in Pentaho online help for list of dependencies. https://help.pentaho.com/Documentation/8.2/Products/Data_Integration/Transformation_Step_Reference/Python_Executor

Basic process:


1. Build a Model Management workflow in PDI following the example below::


  • Create Model Catalog: Build models and store model run info, evaluation metrics, model version, etc. into the Model Catalog.
  • Setup Champion/Challenger runs: For each run competing models will compete against the Champion Model. – which isthe current model in production that Data Scientists are using.
  • Run Champion/Challenger Models: Pre-determined Challenger models will be utilized from the Model Catalog and run against the current Champion. The Model with the best evaluation metrics will be set to Champion. This keeps the current Champion or replaces it with a more accurate Challenger model.




2. In PDI, use the new 8.2 Python Executor Step




3. Model Experimentation is the first part of the Model Management workflow, which builds and trains models as well as stores evaluation metric information in a Model Catalog (figure below showcases 5 functional areas of processing). PDI offers a GUI drag and drop environment that could also add Drift Detection, Data Deviation Detection, Canary Pipelines and more.




This example uses four models. One of the models has injected in-line model hyperparameters using a subset of the labeled production data. All of this this can be adjusted to fit your environment, (i.e. modified to incorporate labeled or unlabeled and non-production data to accommodate your requirements).



4. Next stage in the workflow is the Setup Champion/Challenger run. Place a model into a Champion folder and the remaining models in the Challenger folder. This can be accomplished via using the PDI Copy Files step with an example shown below:






5. Now, perform a subset of a Champion/Challenger run with the three highest performing models. Decide on an evaluation metric or other value for comparing which model performed the best. The output of the Champion/Challenger run can produce log entries or send emails, to save the results.



The example shows the models using production data, which could be modified to incorporate either labeled or unlabeled, non-production data to accommodate your requirements. PDI offers a GUI drag and drop environment that could add Drift Detection, Data Deviation Detection, Canary Pipelines and more.



6. Run the PDI Job to execute the Model Management Workflow. It will write its output to the PDI log, with the Champion/Challenger results displayed (see below):




4. Why do Organizations use PDI and Python for Deep Learning?

It is no longer an option to operationalize your models in a production environment without a dedicated data preparation tool, it is a necessity. Pentaho 8.2 can assist you in this effort.


  • Ø PDI makes the implementation and execution of Model Management solutions easier using its' graphical development environment for related pipelines and workflows. It is easy to customize your Model Management workflow to meet specific requirements.
  • Ø Data Engineers and Data Scientist can collaborate on the workflow and utilize their skills and time more efficiently.
  • Ø The Data Engineer can use PDI to create the workflow in preparation of the Python scripts/models received from the Data Scientist and implemented in the Model Catalog.