Blog 1 of 2
I have a few pent up blogs that need to be written, but I’ve been waiting for a couple of Pentaho User Meeting speaking events to finish before I share them. I don’t want to "steal my own thunder" if you know what I mean.
I did start tweeting some thoughts the other day on one of these viewpoints – hidden results from machine learning models. Which led me to start writing this first of 2 blogs on the matter.
Much of this idea comes from the “Hey Ray!” demonstration that uses deep learning in Pentaho with the Plugin Machine Intelligence plugin. In this demonstration, occasionally a “wrong” body part would be identified. I say wrong, because I personally see a “forearm” but “Hey Ray!” sees an “elbow”. In working with this example application and the interpretation of the results, it began to dawn on me, “maybe the neural networks in this transformation does 'see' a forearm”. Technically, both answers, forearm and elbow, are correct. Then you have to ask yourself, “what is the dominant feature of the image being analyzed” or more importantly, “what image perspective is being analyze?”. When looking at the results from the deep learning models used, you can see that both classes score high as a probability output. So what are you supposed to do with no "clear" result?
This is where hidden results come in. In my talks, I typically describe starting and using machine learning on problems and datasets that can be described with 2 classes or binary classes, "Yes or No", "True or False", "Left or Right", "Up or Down", and so on. Ask the question about the data for example, “is this a fraudulent transaction, Yes or No?”. With supervised machine learning, you have the answers from historical data. You are just training the models to recognize the patterns in the historical data to recognize future fraudulent transactions in live or production data.
Here's the issue, there could actually be three possible answers from binary classes – "Yes, No or Unknown". However, you have to interpret the Unknown answer with thresholds and flag the implied Unknown results. Typically, the Type:Yes_Predicted_prob outcomes from a machine learning model (assuming you are using PMI with Pentaho) are based on the halfway point – 0.5 or 50%. Any prediction above the 50% line is the Yes class and any prediction below the 50% line is the No class. This means that a Type:No_Predicted_prob at 0.49 or 49%. These statistical “ties” need to be contained in a range and this range needs to be defined as your policy.
For example, you could test and retest your models and define a range of results for Type:Yes_Predicted_prob or Type:No_Predicted_prob and the extrapolated Unknown results. The predicted results would be,
- 4 predicted “yes” diabetic,
- 4 predicted “no” diabetic
- and 3 “unknown” or “undetermined” predictions.
The figure below shows the results of a PMI Scoring step of last 11 case study records from the Pima Indians diabetes dataset study.
For experimental purposes, of the 768 records in the diabetes case study dataset, I split the dataset into 3 separate datasets; a training dataset, a validation dataset and “production" dataset. For the production dataset, I use the last 11 records from the dataset and remove the “outcome” column (the study’s results) and score this dataset for the results you see below.
This example illustrates that any result predicted with a probability score between 0.30 to 0.70 (30% to 70%) should be interpreted as Unknown. While to top 30% and Botton 30% ranges should be used to define you Yes or No results. The results will need to score a predicted Yes of 0.70 or higher for the results to be interpreted as a Yes, and a score of 0.30 or lower to be interpreted as a No, all other results should be considered Unknown.
If you only used the type_MAX_prob to determine the outcome, the results would be,
- 5 predicted “yes” diabetic,
- 6 predicted “no” diabetic
So, use thresholds and ranges to define your results from machine learning models. Do not just go by the type_MAX_prob, Type:No_Predicted_prob or Type:Yes_Predicted_prob. By doing so, you will discover hidden results in binary classes.
Hidden results are more pronounced when working with datasets with more than 2 classes. Multi-class datasets may actually include combinations or answers as well as unknown results. This will be the subject of the next blog. Stay tuned!