Introducing Plug-in Machine Intelligence
Today, the need to swiftly operationalize machine learning based solutions to meet the challenges of businesses is more pressing than ever. The ability to create, deploy and scale a company’s business logic to quickly take advantage of opportunities or react to changes is exceeding the capabilities of people and legacy thinking. Better and more machine learning is vital going forward but, more importantly, easier machine learning is essential. Leveraging an organization’s existing staff levels, business domain knowledge, and skillsets by lowering the entry into the realm of data science can dramatically expand business opportunities.
“Everytime I am in PMI, I am seeing more and more of its value!!! Great stuff!!!”
Carl Speshock - Hitachi Vantara Product Manager, Hitachi Vantara Analytics Group
The world of Machine Learning is empowering an ever-increasing breadth of applications and services from IoT to Healthcare to Manufacturing to Energy to Telecom, and everything in between. Yet the skills gap between business domain knowledge and the analytic tools used to solve these challenges needs to be bridged. People are doing their part through education, training and experimentation in order to become data scientists, but that’s only half of the equation. Making the analytic tools easier to use can help bridge this gap quickly. Throw in the ability to access and blend different data sources, cleanse, format and engineer features into these datasets, and you have a unique and powerful tool. In fact, the combination of PDI and PMI is an evolution of the PDI tool suite for deeper analytics and data integration capabilities
"While exploring solutions with a major healthcare provider that was using predictive
analytics to reduce the costs and negative patient care incurred from complications
from surgery, we were given internal access to PMI. Working with python and using
the Scikit-Learn library required 2 weeks of coding and prototyping to perform just the
machine learning model selection and training. With PDI and PMI, I was able to prep
the data, engineer in the features and train the models in about 3 hours. And, I could
include other machine learning engines from R and Weka and evaluate the results. The
combination of PDI and PMI makes machine learning solutions easier to use and maintain."
Dave Huh - Data Scientist - Hitachi Vantara Analytics Services
Hitachi Vantara Labs is excited to introduce a new PDI capability, Plug-in Machine Intelligence (PMI) to the PDI Marketplace. PMI is a series of steps for Pentaho Data Integration (PDI) that provides direct access to various supervised machine learning algorithms as full PDI steps that can be designed directly into your PDI data flow transformations. Users can download the PMI plugin from the Hitachi Vantara Marketplace or directly from the Marketplace feature in PDI (automatic download and install). Installation Guides for your platform, the Developer's Document, and the sample transformation, and datasets are available here. The motivation for PMI is:
- To make machine learning easier to use by combining it with our data integration tool as a suite of easy to consume steps, and ensuring these steps guide the developer through its usage. These supervised machine learning steps work “out-of-the-box” by applying a number of “under-the-cover” pre-processing operations and algorithm specific "last-mile data prep" to the incoming dataset. Default settings work well for many applications, and advanced settings are still available for the power user and data scientist.
- To combine machine learning and data integration together in one tool/platform. This powerful coupling between machine learning and data integration allows the PMI steps to receive row data as seamlessly as any other step in PDI. AND! No more jumping between multiple tools with inconsistent data passing methods or, complex and tedious performance evaluation manipulation.
- To be extensible. PMI provides access to 12 supervised Classifiers and Regressors “out-of-the-box”. The majority of these 12 algorithms are available in each of the four underlying execution engines that PMI currently implements: WEKA, python scikit-learn, R MLR and Spark MLlib. New algorithms and execution engines can be easily added to the PMI framework with its dynamic PDI step generation feature.
PMI also incorporates revamped versions of the Weka steps that have been originally part of the Pentaho Data Science pack. Essentially, this could be looked at as version 2 of the Data Science Pack. These include:
- PMI Forecasting for deploying time-series models learned in WEKA’s time series forecasting environment.
- PMI Scoring for deploying trained supervised and unsupervised ML models. This includes new features to support evaluation/monitoring of existing supervised models on fresh data (when class labels are available).
- PMI Flow Executor for executing arbitrary WEKA Knowledge Flow processes. This revamped step supports WEKA’s new Knowledge Flow execution engine and UI.
PMI tightly integrates, into the PDI “Data Mining” category, four popular machine learning “engines” via their machine learning libraries. These four engines are, Weka, Python, R and Spark. This first phase of PMI incorporates the supervised machine learning algorithms from these four engines from their associates machine learning libraries - Weka, Scikit-Learn, MLR and MLlib, respectively. Not all of the engines support all of the same algorithms evenly. Essentially, there are 12 new PMI algorithms added to the Data Mining category that executes across the four different engines;
- Decision Tree Classifier – Weka, Python, Spark & R
- Decision Tree Regressor – Weka, Python, Spark & R
- Gradient Boosted Trees – Weka, Python, Spark & R
- Linear Regression – Weka, Python, Spark & R
- Logistic Regression – Weka, Python, Spark & R
- Naive Bayes – Weka, Python, Spark & R
- Naive Bayes Multinomial – Weka, Python & Spark
- Random Forest Classifier – Weka, Python, Spark & R
- Random Forest Regressor – Weka, Python & Spark
- 1Support Vector Classifier – Weka, Python, Spark & R
- Support Vector Regressor – Weka, Python, & R
- Naive Bayes Incremental – Weka
As such, eventually the existing “Weka Scoring” step will be deprecated and replaced with the new “PMI Scoring” step. This step can consume (and evaluate for model management monitoring processes) any model produced by PMI, regardless of which underlying engine is employed.
I know what you’re thinking, “why implement machine learning across four engines?”. Good question. Believe it or not, data scientists are picky and set in their ways, and… not all engines and algorithms perform (think accuracy and speed) the same or yield the same accuracies for any given dataset. Many analysts, data scientists, data engineers and others that look to these tools to solve their challenges, tend to use their favorite tool/engine. With PMI, you can compare up to four different engines and up to 12 different algorithms against each other to determine the best fit for your requirement.
What Happens When Data Patterns Change?
An important benefit to PMI is the evaluation metrics used to measure accuracy is uniform and unified. Since all steps are now built into the same PMI framework - unified, the resulting metrics are all calculated uniformly and can be used to easily compare performance even across the different engine's algorithms. This unique characteristic has resulted in a whole new use case in the form of model management. Concepts around model management with the PMI framework has enabled the ability to Auto-Retrain models, Auto-Re-evaluate, Dynamic-Deploy models and so on. A concept that we have recently proven with demonstration is the Champion / Challenger model management strategy. This Champion / Challenger strategy easily allows currently active model(s) to be re-evaluated and compared with other candidate models' performance and "hot-swap' deploy the new Champion model. A more detailed discussion on Machine Learning Model Management can be found with this accompanying blog called "4-Steps to Machine Learning Model Management".
The Fail Fast Approach
Thomas Edison is quoted as saying "I have not failed. I've just found 10,000 ways that won't work.". And back in the days leading up to the invention of the light bulb, this 10,000 ways that won't work took years to iterate through. What if you could eliminate candidates in days or hours? PMI allows a “Fail Fast” approach to achieving results. With the ease of using PMI on datasets, many combinations of algorithms and configurations can be tried and testing very fast, weeding out the approaches that won’t work and narrowing down to promising candidates quickly. The days of churning on code until it finally works, then finding out the results aren’t good enough and a new approach is needed, are coming to an end.
Over the next few month, Hitachi Vantara Labs will continue to provide blogs and videos to demonstrate how to use PMI, how to extend the PMI framework and how to add additional algorithms to PMI.
It is important to point out that this initiative is not formally supported by Hitachi Vantara, and there are no current plans on the Enterprise Edition roadmap to support PMI at this time. It is recommended that this experimental feature be used for testing only and not used in production environments. PMI is supported by Hitachi Vantara Labs and the community. Hitachi Vantara Labs was created to formally test out new ideas, explore emerging technologies and as much as possible, share our prototypes with the community and users through the Hitachi Vantara Marketplace. We like to refer to this as "providing early access to advanced capabilities". Our hope is that the community and users of these advanced capabilities will help us improve and recommend additional use cases. Hitachi Vantara has forward thinking customers and users, so we hope you will download, install and test this plugin. We would appreciate any and all of your comments, ideas and opinions.