Skip navigation


11 Posts authored by: Ken Wood Employee

“New Hey Ray!” or “Hey Shin Ray!” or “Hey Ray!”

HeyShinRayVideoDemo3 - YouTube

(click on this link to watch a video of "Hey Shin Ray!" in action on youtube)



by Ken Wood


HVLabs_RdI_Logo.pngThe word 'shin' means 'new' in Japanese


Back in November, Mark Hall and I shared with you a uniquely interactive demonstration that used the Plugin Machine Intelligence’s (PMI) new feature, Deep Learning with DL4j, to analyze an x-ray film through voice commands and speech responses. This original demonstration was featured at Hitachi NEXT 2018 in September in 2018 and was used to demonstrate how Pentaho could use deep learning in an application.


The apparatus used to build the original “Hey Ray!” had multiple practical functions. First, it was easy to build since we were up against a deadline for getting “Hey Ray!” up and running in time for the conference. Second, “Hey Ray!” is an example application on how to use PMI, deep learning, Pentaho, speech recognition, IoT, text-to-speech and a Raspberry Pi in a uniquely integrated way to solve a particular problem and demonstrate the power of deep learning and Pentaho. And finally, the steampunk themed design, or as I like to say “… a Jules Verne inspired artificial intelligence”, allowed us to let everyone interested in this application understand that this is not a product or solution to be purchased but an example of the simplicity of building advance solutions with Pentaho and the new Hitachi Vantara Labs plugin, PMI.


As you can imagine, there were a couple of snafus during the conference. Speech recognition in a crowded, noisy environment is challenging at best. So, to overcome the loud ambient noise, I quickly built a remote control on my iPhone to mimic the voice command set that “Hey Ray!” used. Since we used the IoT protocol MQTT, it was easy to slide in commands into the demonstration apparatus through a remote command queue. Another hiccup was the physical x-ray film. I only have 16 usable (purchased off eBay) x-ray film pictures to choose from, so it started to get a little monotonous rotating through these physical x-rays. While this isn’t a major problem and it flowed nicely with the whole interactive steampunk theme of the demonstration, x-ray film is a little too antiquated and having additional access to digital x-ray images would improve the overall experience. Lastly, the initial size and "non-slickness" of the physical apparatus and the number of components made it difficult to travel with. Since I live in San Diego, the location of the 2018 conference, it was easy for me to transport and setup this demonstration.


So, with that, I started on two major upgrades; the first was to reduce to size of the demo to something that could fit in one pelican shipping case (or smaller), and the second was to change the format of the “Hey Ray!” demonstration. I only used the first upgrade once and it was fine, but still bulky, and it still only used physical x-ray film.


Over the 2018-2019 holidays I learned Apple Swift and created an iPhone application that used the internal camera or the internal photo library as the source of x-ray images. Overall, this is so much better. I can browse the internet looking for interesting x-ray images and save them to the photo library or I can use the internal camera to take live pictures of physical x-rays for the more interactive experience we were originally looking for. The resulting analysis is still a text-to-speech response of the analysis, but currently I have dropped the speech recognition portion for the time being. However, since the Swift environment has access to the Siri framework, I could still incorporate speech recognition into the overall application later.


The iPhone application is basically a user interface to the analytics server. There is still a Pentaho server running Pentaho Data Integration (PDI) and using PMI using deep learning to analyze the incoming image. In fact, the original analytic transformation is mostly the same and the two deep learning models used to detect injuries and identify the body part being analyzed is the same as the original. A text analysis is formed with the findings and sent back to the iPhone where the results are spoken using the iPhone voice synthesizer, the same voice synthesizer that Siri uses. The iPhone based application is still able to store all analysis artifacts to a Hitachi Content Platform (HCP) system and now it includes the ingestion of custom metadata (the results of the deep learning analysis - body part and probability, and injury detection and probability), and can tweet a movie of the analysis - a rendering of the audio analysis and the x-ray image, to twitter.


I am planning to build similar iPhone applications with other datasets to use as demonstrations and as examples of how to build artificial intelligence based applications with Pentaho and PMI, and possibly other Hitachi Vantara Labs experimental prototypes. Stay tune for more...

Blog 1 of 2



I have a few pent up blogs that need to be written, but I’ve been waiting for a couple of Pentaho User Meeting speaking events to finish before I share them. I don’t want to "steal my own thunder" if you know what I mean.


I did start tweeting some thoughts the other day on one of these viewpoints – hidden results from machine learning models. Which led me to start writing this first of 2 blogs on the matter.


Much of this idea comes from the “Hey Ray!” demonstration that uses deep learning in Pentaho with the Plugin Machine Intelligence plugin. In this demonstration, occasionally a “wrong” body part would be identified. I say wrong, because I personally see a “forearm” but “Hey Ray!” sees an “elbow”. In working with this example application and the interpretation of the results, it began to dawn on me, “maybe the neural networks in this transformation does 'see' a forearm”. Technically, both answers, forearm and elbow, are correct. Then you have to ask yourself, “what is the dominant feature of the image being analyzed” or more importantly, “what image perspective is being analyze?”. When looking at the results from the deep learning models used, you can see that both classes score high as a probability output. So what are you supposed to do with no "clear" result?




This is where hidden results come in. In my talks, I typically describe starting and using machine learning on problems and datasets that can be described with 2 classes or binary classes, "Yes or No", "True or False", "Left or Right", "Up or Down", and so on. Ask the question about the data for example, “is this a fraudulent transaction, Yes or No?”. With supervised machine learning, you have the answers from historical data. You are just training the models to recognize the patterns in the historical data to recognize future fraudulent transactions in live or production data.


Here's the issue, there could actually be three possible answers from binary classes – "Yes, No or Unknown". However, you have to interpret the Unknown answer with thresholds and flag the implied Unknown results. Typically, the Type:Yes_Predicted_prob outcomes from a machine learning model (assuming you are using PMI with Pentaho) are based on the halfway point – 0.5 or 50%. Any prediction above the 50% line is the Yes class and any prediction below the 50% line is the No class. This means that a Type:No_Predicted_prob at 0.49 or 49%. These statistical “ties” need to be contained in a range and this range needs to be defined as your policy.


For example, you could test and retest your models and define a range of results for Type:Yes_Predicted_prob or Type:No_Predicted_prob and the extrapolated Unknown results. The predicted results would be,


  • 4 predicted “yes” diabetic,
  • 4 predicted “no” diabetic
  • and 3 “unknown” or “undetermined” predictions.


The figure below shows the results of a PMI Scoring step of last 11 case study records from the Pima Indians diabetes dataset study.



For experimental purposes, of the 768 records in the diabetes case study dataset, I split the dataset into 3 separate datasets; a training dataset, a validation dataset and “production" dataset. For the production dataset, I use the last 11 records from the dataset and remove the “outcome” column (the study’s results) and score this dataset for the results you see below.





This example illustrates that any result predicted with a probability score between 0.30 to 0.70 (30% to 70%) should be interpreted as Unknown. While to top 30% and Botton 30% ranges should be used to define you Yes or No results. The results will need to score a predicted Yes of 0.70 or higher for the results to be interpreted as a Yes, and a score of 0.30 or lower to be interpreted as a No, all other results should be considered Unknown.


If you only used the type_MAX_prob to determine the outcome, the results would be,


  • 5 predicted “yes” diabetic,
  • 6 predicted “no” diabetic


So, use thresholds and ranges to define your results from machine learning models. Do not just go by the type_MAX_prob, Type:No_Predicted_prob or Type:Yes_Predicted_prob. By doing so, you will discover hidden results in binary classes.


Hidden results are more pronounced when working with datasets with more than 2 classes. Multi-class datasets may actually include combinations or answers as well as unknown results. This will be the subject of the next blog. Stay tuned!


Easy to Use Deep Learning and Neural Networks with Pentaho

By Ken Wood and Mark Hall



Hitachi Vantara Labs is excited to release a new version of the experimental plugin, Plugin Machine Intelligence version 1.4. Along with several minor fixes and enhancements is the addition of a new execution engine for performing deep learning and executing other machine learning algorithms using neural networks. The whole mission of Pentaho and Hitachi Vantara Labs is to make complex technology simple to use and deploy, and the Plugin Machine Intelligence (PMI) is a huge advancement towards making machine learning and artificial intelligence part of this mission.


Back in October, I shared a glimpse of what's coming with a blog, Artificial Intelligence with Pentaho, that describes a demonstration using artificial intelligence elements. PMI and Pentaho Data Integration with deep learning is the main artificial intelligence element capability that enables that demonstration. Feel free to ask us more questions about the use of deep learning models in PDI transformations. We will also be blogging more details and "how to" about that demonstration and how to do some of those elements with PDI.


We call this plugin "experimental" because it is a research project from HV Labs and is released openly for the Pentaho community and users to try out and experiment with. We refer to this as "early access to advance, experimental capabilities". As such, it is not a supported product or service at this time.


Deep learning is a recent addition to the artificial intelligence domain of machine learning. PMI initially focuses on supervised machine learning schemes which means there is a continuous or categorical target variable that is being "learned" from a dataset of labeled training data. This deep learning integration is also a supervised learning scheme.




The new release of PMI v1.4 can be downloaded and installed from the PDI and spoon Marketplace. If you are already running a previous version of PMI, check the installation documentation for guidance on getting your system ready for PMI v1.4. If you are not using PMI at all, the Marketplace will install the new PMI v1.4 for you. During the PMI v1.4 installation from the Marketplace, PMI will automatically install, as included machine learning engines, WEKA, Spark MLlib and Deep Learning for java (DL4j). You will need to install and setup python with the scikit-learn, and R with Machine Learning with R (MLR), machine learning libraries, at which point the installation process will configure them into PMI if they are installed and setup correctly. Again, check with the installation documentation for your system.


This means there are now 5 machine learning execution engines integrated in PMI for PDI providing you with many options for training, building, evaluating and executing machine learning models. PMIDLLogo.pngIn fact, some of the existing machine learning algorithms that are available for WEKA, scikit-learn, MLlib and MLR, can also execute on DL4j, like Logistic Regression, Linear Regression and Support Vector Classifier. There are also 2 new machine learning algorithms "exposed" from the scikit-learn, Weka and MLR libraries. They are the Multi-layer Perceptron Classifier and a Multi-layer Perceptron Regressor. These algorithms were exposed from the scikit-learn library to help us write some additional developer documentation on how to expose algorithms to the PMI framework.


Of course the most exciting part of this release is the ability to train, build, evaluate and execute deep learning models with PDI. Stated another way, the ability to analyze unstructured data with PDI. In addition, by using DL4j, you can TrainingTimes.pngtrain your deep learning models using a locally attached graphic processing unit (GPU) that is either internal to your system or externally attached, like a eGPU. DL4j uses the CUDA API from NVidia and thus only uses NVidia GPUs at this time. The speed up in training time for image processing is super fast when compared to training time on a CPU.






There is a lot of reference material available to help you get started with PMI including some new installation documents to help setup PMI v1.4 and how to setup your GPU and CUDA environment for DL4j. The list of materials and references can be found at this location.






It is important to point out that this initiative is not formally supported by Hitachi Vantara, and there are no current plans on the Enterprise Edition roadmap to support PMI at this time. It is recommended that this experimental feature be used for testing, educational and exploration purposes only. PMI is supported by Hitachi Vantara Labs and the community. Hitachi Vantara Labs was created to formally test out new ideas, explore emerging technologies and as much as possible, share our prototypes with the community and users through the Hitachi Vantara Marketplace. We like to refer to this as "providing early access to advanced capabilities". Our hope is that the community and users of these advanced capabilities will help us improve and recommend additional use cases. Hitachi Vantara has forward thinking customers and users, so we hope you will download, install and test this plugin. We would appreciate any and all of your comments, ideas and opinions.

Please Note!

Due to remodeling construction in the lobby area where this sensor is located, the data feeds are down. As such, this phase of this project is complete. Sorry for the abrupt closure, but this is an unforeseen and sudden event. However! We are planning a new sensor setup soon in a in a convenient store environment.

Stay Tune! 



In addition to the LiDAR Motion Sensor real-time data feed from the 8th floor lobby of the HLDS facility, we've added another sensor to the configuration. The new real-time sensor data PMDustSensor.pngcomes from a prototype sensor that is being developed by the same LiDAR Hitachi LG Data Systems (HLDS) development team. This sensor is a Particulate Matter sensor, or dust sensor. We thought it would be an interesting combination of sensor data to detect human traffic AND the amount of dust or particles being "kicked up" from this traffic. The lobby is a carpeted area.



DustSensor8thFloorLobby.pngIn Korea, there is an increasing concern with particulate matter and pollution in the environment PMStandard.pngcoming from their neighboring country. This new sensor allows monitoring of air quality by the detection of particulate matter. There is a Particulate Matter, or PM, standard for defining dust in the air. While the eventual sensor device will be used both indoor and outdoor, today we are deploying the sensor indoor and making the data from this sensor available to everyone to analyze. In the future, we will deploy an outdoor sensor to monitor the air pollution in the city of Seoul.


The PM sensor data uses MQTT to publish its data. The real-time data feed can be accessed at the following MQTT broker and topic.



There is a problem with the original broker and we have moved this
data stream to a new broker. Please note the new broker URL below.
Sorry for the inconvenience.



Broker location - tcp://


Topic - hlds/korea/8thFloor/lobbyDust



The data streamed from this sensor is a json formatted message that has the following definition,


  • Event: AirQuality - the event type
  • Time: TimeStamp - time of the sample
  • PM1_0: Particulate Matter at 1 micrometer and smaller - quantity of sample
  • PM2_5: Particulate Matter at 2.5 micrometer and smaller - quantity of sample
  • PM10: Particulate Matter at 10 micrometer and smaller - quantity of sample


Here is a screen shot of MQTT Spy inspecting these messages.



What kind of Pentaho transformation, dashboards and analysis can you create with this data? is there a correlation of human traffic through the lobby and the amount of dust detected? We want to see your creations. Please share your work in the comments are below, or write-up your own blog and share it with us. Who knows, there might be something in it for you.

There are currently 3 Installation Guides to accompany the Plug-In Machine Intelligence (PMI) plug-in and one Developers Guide. Also, the demonstration transformations and sample datasets are available. These sample transformations and sample datasets are for demonstration and educational purposes. They are downloadable at the following,


Download Link and Document Name
PMI_1.4_Installation_Linux.pdfInstallation guide for the Linux OS platform.
PMI_1.4_Installation_Windows.pdfInstallation guide for the Windows OS platform.
PMI_1.4_Installation_Mac_OSX.pdfInstallation guide for Mac OS X platform.
PMI_Developer_Docs.pdfA developer's guide to extending and contributing to the PMI framework.
PMI_MLChampionChallengeSamples.zipThis zip file contains all of the sample transformations, sample folder layouts and datasets for running the Machine Learning demonstrations and the Machine Learning Model Management samples. This is for demonstration and educational purposes.
PMI_AddingANewScheme.pdfThis documents describes the development process of exposing the Multi-Layer Perceptron (MLP) regressor and classifier in the Weka and scikit-learn engines.

Please Note!

Due to remodeling construction in the lobby area where these sensors are located, the data feeds are down. As such, this phase of this project is complete. Sorry for the abrupt closure, but this is an unforeseen and sudden event. However! We are planning a new sensor setup soon in a in a convenient store environment and will include additional data beyond Human Direction Direction. Stay Tune! 



REAL! Real-time IoT data stream available for Pentaho Analysis and Visualization

Everyone knows how hard it is to get access to real-time data feeds. Well, here is a chance to access real-time data using a 3D LiDAR motion sensor.





There has been a lot of talk about the new 3D LiDAR (Light Radar) motion sensor from Hitachi LG Data Systems LiDARs2.png(HLDS) recently. The 3D LiDAR is a Time of Flight (ToF) motion sensor that calculates distance by measuring the time it takes for an infrared laser to emit light and receive the reflection back. Because it measures a pixel-by-pixel image via the sensor, it shows the shape, size and position of a human and/or an object in 3D at 10 to 30 fps (frames per second), so it is possible to detect and track the motion, direction, height, volume, etc. of humans or objects.


Unfortunately, general access to this sensor it a bit difficult to come by at the moment and setting one up in a useful location, like a bank, retail store or casino, is also a challenge. So, in a partnership with HLDS, we have setup a LiDAR configuration at a company lobby on the 8th floor at HLDS in Seoul South Korea and will make the real-time output stream available to Hitachi Vantara Pentaho developers to use and develop to. The real-time data stream will be published from an MQTT broker at,



There is a problem with the original broker and we have moved this
data stream to a new broker. Please note the new broker URL below.
Sorry for the inconvenience.


Broker location – tcp:// tcp://

Topic – hlds/korea/TOFData



An example .json formatted data record published from this broker and topic looks like this,




The data stream will be published in clear text. The data is not sensitive. We are looking for real-time dashboards, visuals, analytics and integration transformations.


To help start this off, there is a collection of transformations to start from here.





The setup scenario is a “Human Direction Detection” challenge using the filter processor "Human Counter Pro". There are two zones being monitored by the 2 ceiling mounted LiDARs (the two LiDARs are grouped together to cover the wide area). The first zone is the entrance area called “entrance” and the second zone is the lobby area called the “hallway”. What can be happening in this configuration scenario is that,


  • People arrive (out of the elevator) and enter the “entrance” area, then they enter the “hallway” area, and are either walking towards the South Wing doorway or the North Wing doorway. This is the most common scenario and is basically employees arriving on their floor and heading to their work area.
    • This scenario can also happen in reverse order where people enter in the "hallway" from either the North Wing or South Wing and enter the "entrance" signifying leaving.
  • Someone enters and stays in the “hallway” for a period of time. Someone or others arrive in the entrance area and the group heads to one of the doorways. This scenario is basically an employee waiting for visitors to be escorted to a meeting or other activity.
  • Someone or a group crosses the “hallway” from the South Wing to the North Wing, or from the North Wing to the South Wing. This is a scenario where people are crossing over from one side of the building to the other side.
  • Someone enters the “hallway” area and stays there for a period of time, then heads to one of the doorways. In this scenario, someone is probably looking at one of the live demos or items in the lobby’s display area.
  • There could be other scenarios that you can identify with the data from the LiDARs, these are just a few that we came up with.






The published data stream will have identified and tracked people as they move into the “entrance” area and then move to the “hallway” area. Timing information of when each person enters (Appear) in the zones and when they leave (Disappear) the zone. Duration time in the zones area will need to be calculated yourself.


Lastly, remember South Korea is 16 hours ahead of pacific time, so the work day and work week activity is very skewed. It will be busy in the evening pacific time, and it will be the weekend on Friday pacific time.


You can use a MQTT inspection tool like "MQTT Spy" to explore and examine the data coming from the sensor.




Some background


Originally, this was going to be setup for me, then it was discussed that since this is an MQTT design, we can open this up company wide. Access to real world IoT data is hard to come by.


There are other Processor Filters in the LiDAR device middle-ware suite that provide different functions from the sensor. We are starting with the Human Counter Pro because this one publishes via MQTT. If this is successful, the other Processor Filters will also be integrated with MQTT as a simple mechanism for integrating Pentaho to the LiDAR sensor, and future physical sensors and Processing Filters.


No special plugin development is required to integrate to a state-of-the-art motion sensor to Pentaho. We’ve had access to MQTT steps for PDI for a few years now. There are a few blogs in the Vantara Community here and here describing how to use MQTT with Pentaho.


Some analysis ideas,


  • How many people entered the “entrance” only and then “Disappeared” (wrong floor?)?
  • How many people exited from “entrance”?
  • How many people went to North Wing?
  • How many people went to South Wing?
  • How many people crossed the “hallway”?
  • How long did people stay in the “hallway”?
  • Most people in the “hallway” at what times of the day?
  • Does the time of day matter?
  • What reports, visuals, dashboards and/or real-time dashboards can be created from this data?


Please share what you come up with in the comments section and/or submit your own write-up or blog. Who knows, there might be some recognition in it for you. Enjoy!



What Can You Do with Deep Learning in Pentaho?


By Ken Wood and Mark Hall


For those of you that have installed and are using the Plugin Machine Intelligence (PMI) plugin that Hitachi Vantara Labs released to the Pentaho Market Place back in March 2018, get ready for an exciting new

PMIDLLogo.pngupdate. This fall, we will release PMI version 1.4 as an update to the existing PMI which is an experimental plugin for Pentaho Data Integration (PDI). Our initial release of PMI focused on classical machine learning and the ability to build, use and manage machine learning models from four popular machine learning libraries – Python’s Scikit-Learning, R’s Machine Learning with R, Spark’s Machine Learning library and WEKA.


I say classical machine learning because traditionally classic machine learning has its best success executing on structured data. With the next release of PMI, we integrate a new machine learning library, what we refer to as “execution engines” – Deep Learning for Java (DL4J). This means PMI can now perform deep learning operations - training, validating, testing, building, evaluating and using deep learning models - directly from PDI.




Deep Learning is gaining lots of attention in the industry for its ability to operate on unstructured data like images, video, audio etc. Deep Learning is a recent addition to the Artificial Intelligence domain of machine learning, though technically the technology has been around for quite some time.




Deep learning to some degree gets its name from the deep, complex, hidden, neural network layers the technology creates to analyze data. To be clear, both machine learning and deep learning can operate on both structured and unstructured data, it’s just that the current general practice and greater success rate of applying deep learning to unstructured data and applying classical machine learning to structured data is the state of understanding at tis time.


The reason we’re blogging about this now is because we showcased and demonstrated PMI v1.4 with deep learning at Hitachi NEXT 2018 in San Diego. Along with a series of one-on-one workshops showing the new deep learning step with PDI and PMI v1.4, we demonstrated an example application using deep learning in an interactive apparatus that uses two deep learning models in a PDI transformation, and then uses PDI to drive the entire application.




This PDI transformation contains several parts when called,

  • The “Data Capture and Data Preparation” phase
    • This portion of the transformation starts by narrating what the entire transformation will do
    • Then communicates with a Raspberry Pi to capture a picture of a physical x-ray - essentially analog to digital conversion
    • Information about the image is then transformed into image metadata. Basically, an in-memory location of the actual digital image
  • The PDI transformation then executes the two deep learning models on the x-ray image. The two deep learning models vectorizes the image into usable numbers, determines the probability of identifying the body part focused on in the image and detecting whether an injury or anomaly exists.
  • The results of the two deep learning models is the probabilities of,
    • A multi-class classifier – Shoulder, Humerus, Elbow, Forearm and Hand
    • And a 2-class classifier, injury or anomaly detected – yes or no
    • These probabilities are numbers between 0 and 1
  • The next phase of the PDI transformation, “Results Preparation” takes the output probabilities (numbers between 0 and 1) from the deep learning models and prepares the result for use.
    • Determine the most likely value – max value is the “answer”
    • Format the 5 decimal digit value into a percentage and into a string
      • This formatting allows the next phase to say “Forty seven percent” instead of "4, 7, percent sign"
  • The last phase, “Confidence Dialog Preparation”, builds logic for the different speaking phrases and applies confidence to the result as an analysis.
    • For example, instead of saying, “There is a 98% chance that this elbow is injured.”. Just say “I detect that this elbow is injured.”. At 98%, we’ve determined that it is injured, but at 47%, we’re not too sure, so the spoken analysis would be “I detect a 47% probability that this elbow is injured, you might want to have it checked out.”.
    • This confidence logic applies to both the body part identification and the injury detection parts of the spoken analysis.


A diagram of the "Deep Learning Pipeline" can be seen here.

  • We use a "Speech Recognition Module" written in python to capture spoken phrases and determines the actions to be taken.
  • In case the environment is too noisy for sound, a special remote control application is available to manually HeyRayTweet.pngexecute the "Hey Ray!" command set.
    • A main transformation is used to interpret the incoming tasks and orchestrate the execution of other transformations as needed.
      • The tasks includes,
        • Introduction narration
        • Help on how to use "Hey Ray!"
        • Analyze the x-ray film and provide the results speech
        • the current analysis session can be saved to the Hitachi Content Platform (HCP)
          • During this operation, the content, x-ray image and analysis phrases, are converted into a single image movie file, then all of the content is saved to HCP
        • You can have "Hey Ray!" tweet the movie file
        • Provide insightful thoughts and opinions
        • And finally, "Hey Ray!" can tell radiologist jokes




We call this demonstration “Hey Ray!”. “Hey Ray!” is just an example of applying deep learning to a situation. We came up with "Hey Ray!" because of the dataset we had access to, it just happens to be x-ray images. We could have created something with flowers, food, automobiles, etc. We also decided to speak the results and add speech recognition for demonstration and "Wow Factor" for the Hitachi NEXT conference. Also, we felt that creating charts of probability distributions of number between 0 and 1 would take to long to explain, so why not have the demonstration state the results. This demonstration turned out to be highly interactive as the attendees could select a x-ray picture, insert it into the x-ray viewing screen and tell the device to "Analyze the x-ray".






We will be providing more blogs about PMI 1.4 with deep learning and other information on the artificial intelligence that goes into “Hey Ray!” in the coming months to help support this release. Stay tuned!


What can you do with machine learning and now deep learning in Pentaho?




It is important to point out that this initiative is not formally supported by Hitachi Vantara, and there are no current plans on the Enterprise Edition roadmap to support PMI at this time.  It is recommended that this experimental feature be used for testing, educational and exploration purposes only. PMI is supported by Hitachi Vantara Labs and the community. Hitachi Vantara Labs was created to formally test out new ideas, explore emerging technologies and as much as possible, share our prototypes with the community and users through the Hitachi Vantara Marketplace. We like to refer to this as "providing early access to advanced capabilities". Our hope is that the community and users of these advanced capabilities will help us improve and recommend additional use cases. Hitachi Vantara has forward thinking customers and users, so we hope you will download, install and test this plugin. We would appreciate any and all of your comments, ideas and opinions.

We apologize! This blog from Mark Hall went offline when the website was replaced with the site. We know many of our followers and supporters in the Pentaho community, as well as the data science community, still refer to this great piece of work. So, here it is back online at its new location here in the Hitachi Vantara Community. Hopefully, this wasn't a huge inconvenience. Thank you for your understanding.





by Mark Hall | March 14, 2017



The power of Pentaho Data Integration (PDI) for data access, blending and governance has been demonstrated and documented numerous times. However, perhaps less well known is how PDI as a platform, with all its data munging[1] power, is ideally suited to orchestrate and automate up to three stages of the CRISP-DM[2] life-cycle for the data science practitioner: generic data preparation/feature engineering, predictive modeling, and model deployment.



By "generic data preparation" we are referring to the process of connecting to (potentially) multiple heterogeneous data sources and then joining, blending, cleaning, filtering, deriving and denormalizing data so that it ready for consumption by machine learning (ML) algorithms. Further ML-specific data transformations, such as supervised discretization, one-hot encoding etc. can then be applied as needed in an ML tool. For the data scientist, PDI can be used to remove the repetitive drudgery involved with manually performing similar data preparation processes repetitively, from one dataset to the next. Furthermore, Pentaho's Streamlined Data Refinery can be used to deliver modeling-ready datasets to the data scientist at the click of a button, removing the need to burden the IT department with requests for such data.                                                            The CRISP-DM Process


When it comes to deploying a predictive solution, PDI accelerates the process of operationalizing machine learning by working seamlessly with popular libraries and languages, such as R, Python, WEKA and Spark MLlib. This allows output from team members developing in different environments to be integrated within same framework, without dictating the use of a single predictive tool.


In this blog, we present a common predictive use case, and step through the typical workflow involved in developing a predictive application using Pentaho Data Integration and Pentaho Data Mining.


Imagine that a direct retailer wants to reduce losses due to orders involving fraudulent use of credit cards. They accept orders via phone and their web site, and ship goods directly to the customer. Basic customer details, such as customer name, date of birth, billing address and preferred shipping address, are stored in a relational database. Orders, as they come in, are stored in a MongoDB database. There is also a report of historical instances of fraud contained in a CSV spreadsheet.


Step 1



With the goal of preparing a dataset for ML, we can use PDI to combine these disparate data sources and engineer some features for learning from it. The following figure shows a transformation demonstrating an example of just that, and includes some steps for deriving new fields. To begin with customer data is joined from several relational database tables, and then blended with transactional data from MongoDB and historical fraud occurrences contained in a CSV file. Following this, there are steps for deriving additional fields that might be useful for predictive modeling. These include computing the customer's age, extracting the hour of the day the order was placed, and setting a flag to indicate whether the shipping and billing addresses have the same zip code.



Blending data and engineering features


This process culminates with output of flattened (a Data Scientist’s preferred data shape) data in both CSV and ARFF (Attribute Relational File Format) data, the latter being the native file format used by PDM (Pentaho Data Mining, AKA WEKA). We end up with 100,000 examples (rows) containing the following fields:



























From this list, for the purposes of predictive modeling, we can drop the customer name, ID fields, email addresses, phone numbers and physical addresses. These fields are unlikely to be useful for learning purposes and, in fact, can be detrimental due to the large number of distinct values they contain.


Step 2




So, what does the data scientist do at this point? Typically, they will want to get a feel for the data by examining simple summary statistics and visualizations, followed by applying quick techniques for assessing the relationship between individual attributes (fields) and the target of interest which, in this example, is the "reported_as_fraud_historic" field. Following that, if there are attributes that look promising, quick tests with common supervised classification algorithms will be next on the list. This comprises the initial stages of experimental data mining - i.e. the process of determining which predictive techniques are going to give the best result for a given problem.


The following figure shows an ML process, for initial exploration, designed in WEKA's Knowledge Flow environment. It demonstrates three main exploratory activities:


    1. Assessment of variable importance. In this example, the top five variables most correlated with "reported_as_fraud_historic" are found, and can be visualized as stacked bar charts/histograms.
    2. Knowledge discovery via decision tree learning to find key variable interactions.
    3. Initial predictive evaluation. Four ML classifiers—two from WEKA, and one each from Python Scikit-learn and R respectively—are evaluated via 10-fold cross validation.



Exploratory Data Mining


Visualization of the top five variables (ordered from left-to-right, top-to-bottom) correlated with fraud show some clear patterns. In the figure below, blue indicates fraud, and red the non-fraudulent orders. There are more instances of fraud when the billing and shipping zip codes are different. Fraudulent orders also tend to have a higher total dollar value attached to them, involve more individual items and be perpetrated by younger customers.



Top Drivers of Fraud


The next figure shows visualizing attribute interactions in a WEKA decision tree viewer. The tree has been limited to a depth of five in order to focus on the strongest (most likely to be statistically stable) interactions – i.e., those closest to the root of the tree. As expected, given the correlation analysis, the attribute "billing_shipping_zip_equal" forms the decision at the root of the tree. Inner (decision) nodes are shown in green, and predictions (leaves) are white. The first number in parenthesis at a leaf shows how many training examples reach that leaf; the second how many were misclassified. The numbers in brackets are similar, but apply to the examples that were held out by the algorithm to use when pruning the tree. Variable interactions can be seen by tracing a path from the root of the tree to a leaf. For example, in the top half of the tree, where billing and shipping zip codes are different, we can see that young, first-time customers, who spend a lot on a given order (of which there are 5,530 in the combined training and pruning sets), have a high likelihood of committing credit card fraud.



Variable Interactions


The last part of the exploratory process involves an initial evaluation of four different supervised classification algorithms. Given that our visualization shows that decision trees appear to be capturing some strong relationships between the input variables, it is worthwhile including them in the analysis. Furthermore, Because WEKA has no-coding integration with ML algorithms in the R [4] statistical software and the Python Scikit-learn[5] package, we can get a quick comparison of decision tree implementations from all three tools. Also included is the ever-popular logistic regression learner. This will give us a feel for how well a linear method does in comparison to the non-linear decision trees. There are many other learning schemes that could be considered, however, trees and linear functions are popular starting points.



Four Different Supervised Classification Algorithms


The WEKA Knowledge Flow process captures metrics relating to the predictive performance of the classifiers in a Text Viewer step, and ROC curves - a type of graphical performance evaluation - are captured in the Image Viewer step. The figure below shows WEKA's standard evaluation output for the J48 decision tree learner.



Evaluation Output for the J48 Decision Tree


It is beyond the scope of this article to discuss all the evaluation metrics shown in the figure but, suffice to say, decision trees appear to perform quite well on this problem. J48 only misclassifies 2.7% of the instances. The Scikit-learn decision tree's performance is similar to that of WEKA's J48 (2.63% incorrect), but the R "rpart" decision tree fares worse, with 14.9% incorrectly classified. The logistic regression method performs the worst with 17.3% incorrectly classified. It is worth noting that default settings were used with all four algorithms.


For a problem like this — where a fielded solution would produce a top-n report, listing those orders received recently that have the greatest likelihood of being fraudulent according to the model — we are particularly interested in the ranking performance of the different classifiers. That is, how well each does at ranking actual historic fraud cases above non-fraud ones when the examples are sorted in descending order of predicted likelihood of fraud. This is important because we'll want to manually investigate the cases that the algorithm is most confident about, and not waste time on potential red herrings. Receiver Operating Curves (ROC) graphically depict ranking performance, and the area under such a curve is a statistic that conveniently summarizes the curve[6]. The figure below shows the ROC curves for the four classifiers, with the number of true positives shown on the y axis and false positives shown on the x axis. Each point on the curve, increasing from left to right, shows the number of true and false positives in the n rows taken from the top of our top-n report. In a nutshell, the more a curve bulges towards the upper left-hand corner, the better the ranking performance of the associated classifier is.



Comparing Performance with ROC Curves


At this stage, the practitioner might be satisfied with the analysis and be ready to build a final production-ready model. Clearly decision trees are performing best, but is there a (statistically) significant difference between the different implementations? Is it possible to improve performance further? There might be more than one dataset (from different stores/sites) that needs to be considered. In such situations, it is a good idea to perform a more principled experiment to answer these questions. WEKA has a dedicated graphical environment, aptly called the Experimenter, for just this purpose. The Experimenter allows multiple algorithm and parameter setting combinations to be applied to multiple datasets, using repeated cross-validation or hold-out set testing. All of WEKA's evaluation metrics are computed and results are presented in tabular fashion, along with tests for statistically significant differences in performance. The figure below shows the WEKA Experimenter configured to run a 10 x 10-fold cross-validation[3] experiment involving seven learning algorithms on the fraud dataset. We've used decision tree and random forest implementations from WEKA and Scikit-learn, and gradient tree boosting from WEKA, Scikit-learn and R. Random forests and boosting are two ensemble learning methods that can improve the performance of decision trees. Parameter settings for implementations of these in WEKA, R and Python have been kept as similar as possible to make a fair comparison.


The next figure shows analyzing the results once the experiment has completed. Average area under the ROC is compared, with the J48 decision tree classifier set as the base comparison on the left. Asterisks and "v" symbols indicate where a scheme performs significantly worse or better than J48 according to a paired correctedt-test. Although Scikit-learn's decision trees are less accurate than J48, when boosted they slightly (but significantly) outperform boosted versions in R and WEKA. However, when analyzing elapsed time, they are significantly slower to train and test than the R and WEKA versions.



Configuring an Experiment



Analyzing Results


Step 3



Now that the best predictive scheme for the problem has been identified, we can return to PDI to see how the model can be deployed and then periodically re-built on up-to-date historic data. Rebuilding the model from time-to-time will ensure that it remains accurate with respect to underlying patterns in the data. If a trained model is exported from WEKA, then it can be imported directly into a PDI step called Weka Scoring. This step handles passing each incoming row of data to the model for prediction, and then outputting the row with predictions appended. The step can import any WEKA classification or clustering model, including those that invoke a different environment (such as R or Python). The following figure shows a PDI transformation for scoring orders using the Scikit-learn gradient boosting model trained in WEKA. Note that we don't need the historic fraud spreadsheet in this case as that is what we want the model to predict for the new orders!



Deploy a Predictive Model in PDI


PDI also supports the data scientist who prefers to work directly in R or Python when developing predictive models and engineering features. Scripting steps for R and Python allow existing code to be executed on PDI data that has been converted into data frames. With respect to machine learning, care needs be taken when dealing with separate training and test sets in R and Python, especially with respect to categorical variables. Factor levels in R need to be consistent between datasets (same values and order); the same is true for Scikit-learn and, furthermore, because only numeric inputs are allowed, all categorical variables need to be converted to binary indicators via the one-hot-encoding (or similar). WEKA's wrappers around MLR and Scikit-learn take care of these details automatically, and ensure consistency between training and test sets.


Step 4



The following figure shows automating the creation of a predictive model using the PDI WEKA Knowledge Flow step. This step takes incoming rows and injects them into a WEKA Knowledg Flow process. The user can select either an existing flow to be executed, or design one on-the-fly in the step's embedded graphical Knowledge Flow editor. Using this step to rebuild a predictive model is simply an exercise in adding this it to the end of our original data prep transformation.



Building a WEKA Model in PDI


To build a model directly in Python (rather than via WEKA's wrapper classifiers), we can simply add a CPython Script Executor step to the transformation. PDI materializes incoming batches of rows as a pandas data frame in the Python environment. The following figure shows using this step to execute code that builds and saves a Scikit-learn gradient boosted trees classifier.



Scripting to Build a Python Scikit-Learn Model in PDI



A similar script, as shown in the figure below,  can be used to leverage the saved model for predicting the likelihood of new orders being fraudulent.



Scripting to Make Predictions with a Python Scikit-Learn Model


This predictive use-case walkthrough demonstrates the power and flexibility of Pentaho afforded to the data engineer and data scientist. From data preparation through to model deployment, Pentaho provides machine learning orchestration capabilities that streamline the entire workflow.


[1] Also known as data wrangling, is the process of manually converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data by semi-automated tools.

[2] The Cross Industry Standard Process for Data Mining.

[3] 10 separate runs of 10-fold cross validation, where the data is randomly shuffled between each run. This results in 100 models being learned and 100 sets of performance statistics for each learning scheme on each dataset.

[4] 132 classification and regression algorithms in the MLR package.

[5] 55 classification and regression algorithms.

[6] The under the ROC has a nice interpretation as the probability (on average) that a classifier will rank a randomly selected positive case higher than a randomly selected negative case.

Introducing Plug-in Machine Intelligence

by Mark Hall and Ken Wood



HVLabsLogo.pngToday, the need to swiftly operationalize machine learning based solutions to meet the challenges of businesses is more pressing than ever. The ability to create, deploy and scale a company’s business logic to quickly take advantage of opportunities or react to changes is exceeding the capabilities of people and legacy thinking. Better and more machine learning is vital going forward but, more importantly, easier machine learning is essential. Leveraging an organization’s existing staff levels, business domain knowledge, and skillsets by lowering the entry into the realm of data science can dramatically expand business opportunities.


Everytime I am in PMI, I am seeing more and more of its value!!! Great stuff!!!”

Carl Speshock - Hitachi Vantara Product Manager, Hitachi Vantara Analytics Group


The world of Machine Learning is empowering an ever-increasing breadth of applications and services from IoT to Healthcare to Manufacturing to Energy to Telecom, and everything in between. Yet the skills gap between business domain knowledge and the analytic tools used to solve these challenges needs to be bridged. People are doing their part through education, training and experimentation in order to become data scientists, but that’s only half of the equation. Making the analytic tools easier to use can help bridge this gap quickly. Throw in the ability to access and blend different data sources, cleanse, format and engineer features into these datasets, and you have a unique and powerful tool. In fact, the combination of PDI and PMI is an evolution of the PDI tool suite for deeper analytics and data integration capabilities


"While exploring solutions with a major healthcare provider that was using predictive
analytics to reduce the costs and negative patient care incurred from
from surgery
, we were given internal access to PMI. Working with python
and using
the Scikit-Learn library required
2 weeks of coding and prototyping to perform
just the
learning model selection and training. With PDI and PMI, I was able to
the data, engineer in the features and train the models
in about 3 hours. And, I could
include other machine learning engines from R and Weka and evaluate the results. The
combination of PDI and PMI makes machine learning solutions easier to use and maintain."

Dave Huh - Data Scientist - Hitachi Vantara Analytics Services


Hitachi Vantara Labs is excited to introduce a new PDI capability, Plug-in Machine Intelligence (PMI) to the PDI Marketplace. PMI is a series of steps for Pentaho Data Integration (PDI) that provides direct access to various supervised machine learning algorithms as full PDI steps that can be designed directly into your PDI data flow transformations. Users can download the PMI plugin from the Hitachi Vantara Marketplace or directly from the Marketplace feature in PDI (automatic download and install). Installation Guides for your platform, the Developer's Document, and the sample transformation, and datasets are available here. The motivation for PMI is:


  1. To make machine learning easier to use by combining it with our data integration tool as a suite of easy to consume steps, and ensuring these steps guide the developer through its usage. These supervised machine learning steps work “out-of-the-box” by applying a number of “under-the-cover” pre-processing operations and algorithm specific "last-mile data prep" to the incoming dataset. Default settings work well for many applications, and advanced settings are still available for the power user and data scientist.
  2. To combine machine learning and data integration together in one tool/platform. This powerful coupling between machine learning and data integration allows the PMI steps to receive row data as seamlessly as any other step in PDI. AND! No more jumping between multiple tools with inconsistent data passing methods or, complex and tedious performance evaluation manipulation.
  3. To be extensible. PMI provides access to 12 supervised Classifiers and Regressors “out-of-the-box”. The majority of these 12 algorithms are available in each of the four underlying execution engines that PMI currently implements: WEKA, python scikit-learn, R MLR and Spark MLlib. New algorithms and execution engines can be easily added to the PMI framework with its dynamic PDI step generation feature.


PMI also incorporates revamped versions of the Weka steps that have been originally part of the Pentaho Data Science pack. Essentially, this could be looked at as version 2 of the Data Science Pack. These include:PMIVantaraLogo2.png


  • PMI Forecasting for deploying time-series models learned in WEKA’s time series forecasting environment.
  • PMI Scoring for deploying trained supervised and unsupervised ML models. This includes new features to support evaluation/monitoring of existing supervised models on fresh data (when class labels are available).
  • PMI Flow Executor for executing arbitrary WEKA Knowledge Flow processes. This revamped step supports WEKA’s new Knowledge Flow execution engine and UI.


PMI tightly integrates, into the PDI “Data Mining” category, four popular machine learning “engines” via their machine learning libraries. These four engines are, Weka, Python, R and Spark. This first phase of PMI incorporates the supervised machine learning algorithms from these four engines from their associates machine learning libraries - Weka, Scikit-Learn, MLR and MLlib, respectively. Not all of the engines support all of the same algorithms evenly. Essentially, there are 12 new PMI algorithms added to the Data Mining category that executes across the four different engines;



  1. Decision Tree Classifier – Weka, Python, Spark & R
  2. Decision Tree Regressor – Weka, Python, Spark & R
  3. Gradient Boosted Trees – Weka, Python, Spark & R
  4. Linear Regression – Weka, Python, Spark & R
  5. Logistic Regression – Weka, Python, Spark & R
  6. Naive Bayes – Weka, Python, Spark & R
  7. Naive Bayes Multinomial – Weka, Python & Spark
  8. Random Forest Classifier – Weka, Python, Spark & R
  9. Random Forest Regressor – Weka, Python & Spark
  10. 1Support Vector Classifier – Weka, Python, Spark & R
  11. Support Vector Regressor – Weka, Python, & R
  12. Naive Bayes Incremental – Weka


As such, eventually the existing “Weka Scoring” step will be deprecated and replaced with the new “PMI Scoring” step. This step can consume (and evaluate for model management monitoring processes) any model produced by PMI, regardless of which underlying engine is employed.


I know what you’re thinking, “why implement machine learning across four engines?”. Good question. Believe it or not, data scientists are picky and set in their ways, and… not all engines and algorithms perform (think accuracy and speed) the same or yield the same accuracies for any given dataset. Many analysts, data scientists, data engineers and others that look to these tools to solve their challenges, tend to use their favorite tool/engine. With PMI, you can compare up to four different engines and up to 12 different algorithms against each other to determine the best fit for your requirement.



What Happens When Data Patterns Change?


An important benefit to PMI is the evaluation metrics used to measure accuracy is uniform and unified. Since all steps are now built into the same PMI framework - unified, the resulting metrics are all calculated uniformly and can be used MLMMDiagram3.pngto easily compare performance even across the different engine's algorithms. This unique characteristic has resulted in a whole new use case in the form of model management. Concepts around model management with the PMI framework has enabled the ability to Auto-Retrain models, Auto-Re-evaluate, Dynamic-Deploy models and so on. A concept that we have recently proven with demonstration is the Champion / Challenger model management strategy. This Champion / Challenger strategy easily allows currently active model(s) to be re-evaluated and compared with other candidate models' performance and "hot-swap' deploy the new Champion model. A more detailed discussion on Machine Learning Model Management can be found with this accompanying blog called "4-Steps to Machine Learning Model Management".






The Fail Fast Approach


Thomas Edison is quoted as saying "I have not failed. I've just found 10,000 ways that won't work.". And back in the days leading up to the invention of the light bulb, this 10,000 ways that won't work took years to iterate through. What if you could eliminate candidates in days or hours? PMI allows a “Fail Fast” approach to achieving results. With the ease of using PMI on datasets, many combinations of algorithms and configurations can be tried and testing very fast, weeding out the approaches that won’t work and narrowing down to promising candidates quickly. The days of churning on code until it finally works, then finding out the results aren’t good enough and a new approach is needed, are coming to an end.


Over the next few month, Hitachi Vantara Labs will continue to provide blogs and videos to demonstrate how to use PMI, how to extend the PMI framework and how to add additional algorithms to PMI.




It is important to point out that this initiative is not formally supported by Hitachi Vantara, and there are no current plans on the Enterprise Edition roadmap to support PMI at this time.  It is recommended that this experimental feature be used for testing only and not used in production environments. PMI is supported by Hitachi Vantara Labs and the community. Hitachi Vantara Labs was created to formally test out new ideas, explore emerging technologies and as much as possible, share our prototypes with the community and users through the Hitachi Vantara Marketplace. We like to refer to this as "providing early access to advanced capabilities". Our hope is that the community and users of these advanced capabilities will help us improve and recommend additional use cases. Hitachi Vantara has forward thinking customers and users, so we hope you will download, install and test this plugin. We would appreciate any and all of your comments, ideas and opinions.

Eliminating Machine Learning Model Management Complexity

By Mark Hall and Ken Wood




HVLabsLogo.pngLast year in 4-Steps to Machine Learning with Pentaho, we looked at how the Pentaho Data Integration (PDI) product provides the ideal platform for operationalizing machine learning pipelines – i.e. processes that, typically, ingest raw source data and take it all the way through a series of transformations that culminate in actionable predictions from predictive machine learning models. The enterprise-grade features in PDI provide a robust and maintainable way to encode tedious data preparation and feature engineering tasks that data scientists often write (and re-write) code for, accelerating the process of deploying machine learning processes and models.



“According to our research, two-thirds of organizations do not have an automated
process to update their predictive analytics models seamlessly. As a result, less than
one-quarter of machine learning models are updated daily, approximately one third
are updated weekly, and just over half are updated monthly. Out of
date models can
create a significant risk to organizations.”

- David Menninger, SVP  & Research Director, Ventana Research



It is well known that, once operationalized, machine learning models need to be updated periodically in order to take into account changes in the underlying distribution of the data for which they are being used to predict. That is, model predictions can become less accurate over time as the nature of the data changes. The frequency that models get updated is application dependent, and itself can be dynamic. This necessitates an ability to automatically monitor the performance of models and, if necessary, swap the current best model for a better performing alternative one. There should be facilities for the application of business rules that can trigger re-building of all models or manual intervention if performance drops dramatically across the board. These sorts of activities fall under the umbrella of what is referred to as model management. In the original diagram for the 4-Steps to Machine Learning with Pentaho blog, the last step was entitled “Update Models.” We could expand the original "Update Models" step and detail the underlying steps that are necessary to automatically manage the models. Then relabel this step to "Machine Learning Model Management" (MLMM). The MLMM step includes the 4-Steps to Machine Learning Model Management, “Monitor, Evaluate, Compare, and Rebuild all Models” in order to cover what we are describing here. This concept now looks like this diagram.





The 4-Steps to Machine Learning Model Management, as highlighted, include Monitor, Evaluate, Compare and Rebuild. Each of these steps implements a phase of a concept called a "Champion / Challenger" strategy. In a Champion / Challenger strategy applied to machine learning, the idea is to compare two or more models against each other in order to promote the one model that performs the best. There can be only one Champion model, in our case the model that is currently deployed, and there can be one or more Challengers, in our case other models that are trained differently, use different algorithms and so forth, but all running against the same dataset. The implementation of the Champion / Challenger strategy for MLMM goes like this,


  1. Monitor - constant monitoring of all of the models is needed to determine the performance accuracy of the models in the Champion / Challenger strategy. Detecting a degraded model's performance should be viewed as a positive result to your business strategy in that the characteristic of the underlying data has changed. This can be viewed as the behaviors you are striving for are being achieved, resulting in different external behaviors to overcome your current model strategy. In the case of our retail fraud prediction scenario, the degradation of our current Champion model's performance is due to a change in the nature of the initial data. The predictions worked and is preventing further fraudulent transactions, therefore new fraud techniques are being leveraged which the current Champion model wasn't trained to predict.
  2. Evaluate - an evaluation of the current Champion model needs to be performed to provide evaluation metrics of the model's current accuracy. This evaluation results in performance metrics on the current 4-stepsMLMM2.pngsituation and can provide both a detailed set visual and programmatic data to use to determine what is happening. Based on business rules, if the accuracy level has dropped to a determined threshold level, then this event can trigger notifications of the degraded performance or initiate automated mechanisms. In our retail fraud prediction scenario, since the characteristic of the data has changed, the Champion model's accuracy has degraded. Evaluation metrics from the evaluation can be used to determine that model retraining, tuning and/or a new algorithm is needed. Simultaneously, all models in the Champion / Challenger strategy could be evaluated against the data to ensure an even evaluation on the same data.
  3. Compare - by comparing the performance accuracy of all the models against each other from the evaluation step, the Champion and the Challenger models can be compared against each other to determine which model performs best, at this time. Since the most likely case is that the current Champion and all the Challenger models were built and trained against the initial state of the data, these models will need to be rebuilt.
  4. Rebuild - by rebuilding (retraining) all the models against the current state of the data, the best performing model on the current state of the data, is promoted to Champion. The new Champion can be hot-swapped and deployed or redeployed into the environment by using a PDI transformation to orchestrate this action.


This 4-Steps to Machine Learning Model Management is a continuous process, usually scheduled to run on a periodic basis. This blogs describes how to implement a Champion / Challenger strategy using PDI as both the machine learning and the model management orchestration.


The new functionality that provides a new set of supervised machine learning capabilities and the model management enablers to PDI is called Plug-in Machine Intelligence (PMI). PMI provides a suite of steps to PDI that gives direct access to various supervised machine learning algorithms as full PDI steps that can be designed directly into your PDI data flow transformations with no coding. Users can download the PMI plugin from the Hitachi Vantara Marketplace or directly from the Marketplace feature in PDI (automatic download and install). The motivation for PMI is:


  • To make machine learning easier to use by combining it with our data integration tool as a suite of easy toconsume steps that do not require writing code, and ensuring these steps guide the developer through its usage. These supervised machine learning steps work “out-of-the-box” by applying a number of “under-the-cover” pre-processing operations and algorithm specific "last-mile data prep" to the incoming dataset. Default settings work well for many applications, and advanced settings are still available for the power user and data scientist. PMIVantaraLogo2.png
  • To combine machine learning and data integration together in one tool/platform. This powerful coupling between machine learning and data integration allows the PMI steps to receive row data as seamlessly as any other step in PDI. No more jumping between multiple tools with inconsistent data passing methods or, complex and tedious performance evaluation manipulation.
  • To be extensible. PMI provides access to 12 supervised Classifiers and Regressors “out-of-the-box”. The majority of these 12 algorithms are available in each of the four underlying execution engines that PMI currently supports: WEKA, python scikit-learn, R MLR and Spark MLlib. New algorithms and execution engines can be easily added to the PMI framework with its dynamic step generation feature.


A more detailed introduction of the Plug-in Machine Intelligence plug-in can be found in this accompanying blog.


PMIList.pngPMI also provides a unified evaluation framework. That is, the ability to output a comprehensive set of performance evaluation metrics that can be used to facilitate model management. We call this unified because data shuffling, splitting and the computation of evaluation metrics is performed in the same way regardless of which of the underlying execution engines is used. Again, no coding is required which, in turn, translates into significant savings in time and effort for the practitioner. Evaluation metrics computed by PMI include (for supervised learning): percent correct, root mean squared error (RMSE) and mean absolute error (MAE) of the class probability estimates in the case of classification problems, F-measure, and area under the ROC (AUC) and precision-recall curves (AUPRC). Such metrics provide the input to model management mechanisms that can decide whether a given “challenger” model (maintained in parallel to the current “champion”) should be deployed, or whether champion and all challengers should be re-built on current historical data, or whether something fundamental has been altered in the system and manual intervention is needed to determine data processing problems or to investigate new models/parameter settings. It is this unified evaluation framework that enables PDI to do model management.



Implementing MLMM in PDI

The PDI transformations below are also included in the PMI plugin download complete with the sample datasets.


The following figure shows a PDI transformation for (re)building models and evaluating their performance on the retail fraud application introduced in the 4-Steps to Machine Learning with Pentaho blog. It also shows some of the evaluation metrics produced under a 2/3rd training - 1/3rd test split of the data. These stats can be easily visualized within PDI via DET (Data Exploration Tool), or the transformation can be used as a data service for driving reports and dashboards in the Business Analytics (BA) server.





The following figure shows a PDI transformation that implements champion/challenger monitoring of model performance. In this example, an evaluation metric of interest (area under the ROC curve) is computed for three static models: the current champion, and two challengers. Models are arranged on the file system such that the current champion always resides in one directory and challenger models in a separate directory. If the best challenger achieves a higher AUC score than the current champion, then it is copied to the champion directory. In this way, hot-swapping of models can be made on-the-fly in the environment.




PMI provides the ability to build processes for model management very easily. This, along with its no-coding access to heterogeneous algorithms, automation of “last mile” algorithm-specific data transformations, and when combined with enterprise-grade features in PDI – such as data blending, governance, lineage and versioning – results in a robust platform for addressing the needs of citizen data scientists and modern MI deployments.


Installation documentation for your specific platform and a developer's guide, as well as, the sample transformations and datasets used in this blog can be found at here. The sample transformations and sample datasets are for demonstration and educational purposes.


It is important to point out that this initiative is not formally supported by Hitachi Vantara, and there are no current plans on the Enterprise Edition roadmap to support PMI at this time.  It is recommended that this experimental feature be used for testing only and not used in production environments. PMI is supported by Hitachi Vantara Labs and the community. Hitachi Vantara Labs was created to formally test out new ideas, explore emerging technologies and as much as possible, share our prototypes with the community and users through the Hitachi Vantara Marketplace. We like to refer to this as "providing early access to advanced capabilities". Our hope is that the community and users of these advanced capabilities will help us improve and recommend additional use cases. Hitachi Vantara has forward thinking customers and users, so we hope you will download, install and test this plugin. We would appreciate any and all comments, ideas and opinions.

"IoT Hello World"




It was brought to my attention at PCM2017 in November, that I still owe the third blog on using the MQTT plugin for Pentaho Data Integration (PDI). So, thanks for reminding me. For those that haven’t been keeping track, there were two other blogs written earlier this year introducing the MQTT plugin at Pentaho + MQTT = IoT and applying some security techniques to MQTT in Securing IoT Data with Pentaho’s MQTT Plugin. So, to close out this 3 blog series, let’s kick it up a bit. I would like to describe a small project I’ve been working on that uses PDI and MQTT in a bi-directional communication configuration. This means using both the MQTT Publisher and MQTT Subscriber steps within the same transformation. I have tweeted snippets of this project throughout the year as I developed it, and now I’ll explain how it works. Refer to the diagram in Diagram-1 which shows the full architecture of this project, which I call “IoT Hello World”. I use this project for demonstrations, education and as a way to generate streaming IoT data for participants in the #hitachicode1 hack-a-thon events.



Diagram-1: The overall architecture of project "Pentaho - IoT Hello World"


Project Overview


A quick description of what this project does before I explain how it works. This is a 6 degree of freedom (DoF) robotic arm that performs several robotic routines - Displaying "Hello" & "World", then picking up and placing a car on the track (refer to the video above). While looping through these routines, components of the robotic arm, the servo motors, will begin to heat up. Temperature sensors have been attached to all 6 servo motors so that the temperature of the servo motors can be sensed and reported. The data stream is published in comma separated variables (csv) format, that is generated from this data includes,


  • Robotic Arm Controller (RAC) serial number (read from a configuration file)
  • A RAC system timestamp
  • The "Shoulder" servo motor temperature
  • The "Twist" servo motor temperature
  • The "Wrist" servo motor temperature
  • The "Elbow" servo motor temperature
  • The "Base" servo motor temperature
  • The "Claw" servo motor temperature
  • An MD5 hash of first 8 fields of this message
  • The MQTT Topic that this message is published under


An example MQTT message looks like this and is published under the topic "Pen/Robot/Temp",


SN84112-2,2017-07-26 13:57:28,26.0,25.0,25.0,27.0,24.0,24.0,6c18c1515dea2c5ddfcc6c69a18cbedf,Pen/Robot/Temp


There is a another message that is published by the RAC during startup. When the RAC is booted and the "Start" button is pushed, a registration message is published. This message "registers" the RAC with the Corporate Server Application which is stored in the device registration table. The message consists of the following fields in csv format,


  • RAC serial number (read from a configuration file)
  • A RAC description (read from a configuration file)
  • A RAC system timestamp
  • IP address of the RAC
  • The node name of the RAC
  • The PDI software version
  • Latitude
  • Longitude


An example of the device registration MQTT message looks like this and is published under the topic "Pen/Robot/DeviceRegister",


SN84112-2,SN84112-2 Robot Arm Hello World 6 DOF 6 Servos,2017-07-26 13:33:50.138,,ironman,,30.6291065,-100.995013667


Again, refer to Diagram-1. The four main sections of this project, with the fourth component being optional, of this project are,


  1. The “Corporate Server Application” subscribes and publishes to several message queues that,
    1. Subscribes to the "Device Registration" message queue to receive a one time registration information message when the RAC comes online
    2. Subscribes to the "Device Data Stream" message queue for the operational information from the RAC
    3. Publishes the response to the "Corporate Response" messages queue about corporate operating temperature specifications
    4. Publishes a message stream to the "Mobile Stream" message queue for a mobile application to monitor the robot arm remotely
  2. The Robot Controller Application and Sensor Monitoring which registers the RAC with the Corporate Server, collects sensor data and other controller information, and manages all the control buttons, LED indicators and the robotic arm itself.
  3. The MQTT Broker, which is a free MQTT Broker at (but any MQTT Broker can be used) for hosting the four message queues used in this system. The four message queues used are defined as,
    1. Device Data Stream – used for the device data stream
    2. Device Registration – used for the device registration message
    3. Corp Response – used for the corporate server app to send messages back to the robot arm controller.
    4. Mobile Stream – a dedicated message stream for the mobile device app from the corporate server app
  4. A fourth component is a mobile app for remote monitoring. This is not required and the system will run with or without the mobile device present.


As mentioned earlier, what’s happening with this system is while the robot arm is running its routines, the servo motors will begin to heat up. The sensors attached to each servo motor are read continuously and a status LED associated with the temperature readings provide a local visual status. This is called the “Local Vendor Temperature Status” and it covers a temperature range that corresponds to Green (OK), Yellow (WARNING) and Red (OVERHEAT) LEDs. The other set of status LEDs are called the “Corporate Temperature Specifications”. The Corporate Server Application is subscribing to the RAC’s data stream in real-time and responding based on a different definition of Green (OK), Yellow (WARNING) and Red (OVERHEAT) conditions. Basically, the corporate temperature specifications are lower than the vendor temperature specifications.


I called this the “fishing boat situation”. If you've ever ask a fishing boat captain if the boat will go faster, they will respond with “...yes, the vessel is designed to go faster, but by running the engines at half speed, the engines will last twice as long”. This is the same situation here. The robot arm vendor will say that the robot can run at a higher temperature, but the owner of the robot arm (Corporation ABC) wants to operate the robot arm at a lower temperature, or in this case, indicate that the robot arm is operating at a higher temperature than they want it to. This, of course, is just a simulated scenario in order to tell the story of what’s going on.



The Robot Arm Controller


The robot, robotic arm controller (RAC) and monitoring application is a collection of orchestrated PDI transformations that execute through a job that all run on a Raspberry Pi. These PDI components do the following,


  1. Registers the robot with the current GPS location using a GPS module installed on the RPi, various system information, the device’s serial number and device description. This registration information is formatted and published to the “Device Registration” message queue.
  2. The Data Collection component monitors the 6 RobotArmUI.pngtemperature sensors installed on the RAC's servo motors and are connected the RPi. This component publishes the formatted message to the “Device Data Stream” message queue.
  3. Subscribes to the “Corp Response” message queue to receive the current corporate operating temperature specifications.
  4. PDI also orchestrates a suite of python programs that are used to do the physical IO associated with the RAC like,
    1. Turn LEDs indicators on and off
    2. Collect temperature readings from the temperature sensors
    3. Interface with the control buttons
    4. Collect the latitude and longitude data from the GPS module



All the physical general purpose IO (GPIO) connections for LEDs, sensors, GPS coordinates, and robotic arm manipulation and control are performed through various python programs executed from PDI. An example python program looks like this for reading the 6 temperature sensors and outputting a list of temperatures. I picked this code sample to show you because it reads all the sensors in parallel versus sequentially. Each sensor takes several seconds to read and the time adds up if read in sequence.



The Corporate Server


The Corporate Server plays the role of maintaining the company’s operational specification on preferred temperature levels versus the vendor's operating specification. Let’s take a closer look at the transformation running the Corporate Server. In the transformation shown below, see Transformation-1, we can see that this transformation uses both the MQTT Subscriber and MQTT Publisher step simultaneously within the same transformation.


From left to right, the transformation subscribes to the topic “Pen/Robot/Temp” at the MQTT Broker. We then copies the message stream into four threads.


  1. This thread goes to another transformation that builds another message and publishes it to the "Mobile Device" message queue.
  2. This thread splits the message into fields and updates a table in the database (the commit for this database is set to 1 so every record is immediately committed).
  3. This thread also gets a copy of the message and splits out the temperature values and does some calculations on the temperatures from the RAC's temperature sensors. This value is then used in the Switch/Case step to compare it to a table of corporate define temperature values. Base on the result, a MQTT message is published to the “Corp Response” queue on the MQTT Broker with a Green (OK), Yellow (WARNING) or Red (OVERHEAT) status. This status is then used by the RAC to light the corporate LED indicators. This is what I’m referring to as bi-directional communication or full-duplex communication using MQTT. The transformation is subscribing to a topic and receiving a message, then within the same transformation publishing a message.
  4. This thread is just dumping the raw messages straight into a log file locally for prosperity purposes.



Transformation-1: The main transformation for the Corporate Server Application.


There is a second (floating) transformation within this main transformation for catching the Device Registration information when the RAC is started. This should work when there are multiple controllers using this system. In fact, the mobile device stream is a join of the Device Registration table (to get the GPS latitude and longitude, the robot arm controller description information, and the information from the table storing the temperature readings already stored in the database.


That's it! The entire process on the RAC is set up when the Raspberry Pi boots up. The push button interface kicks of scripts that launches kitchen to start the RAC application. The Corporate Server Application is started with a script to start pan.


RACandPhone.pngThis is a great project that has great IoT demonstration value, plus it was just a blast to do. It is also portable so I can travel with it as necessary. I have started on a second version of this robot concept with many more sensors, more complicated routines and other updated components. I looking forward to showing that project some day soon.


Let me know if you have any questions or comments. I know this project is less industrial and more hobby-ish in nature, but it is a great tool for demonstrating complex concepts with Pentaho and generating real IoT data streams.