AutoML + Pentaho + Grafana for fast solution prototyping

By Elena Salova posted 05-28-2020 09:15


Written by Gwyn Evans and Elena Salova

When building machine learning solutions, data scientists are often following their noses to select appropriate features, modelling techniques and hyperparameters to tackle the problem at hand. In practice, this involves drawing on experience (and code!) from previous projects, incorporating domain knowledge from subject matter experts, and many (many, many, many, many, many….!?!) iterative development cycles.

Whether the experimentation phase is fun, excruciatingly tedious or somewhere in between, one thing is for sure; this development process does not lend itself well to fast solution prototyping, especially for new or unfamiliar problem domains.

In this post, we’ll cover the use of automated machine learning (AutoML) with Pentaho (our data integration platform of choice) and Grafana (visualisation) to rapidly design, build and evaluate a number of modelling approaches to predict air pressure system (APS) failures in heavy Scania trucks (Industrial Challenge 2016 set at The 15th International Symposium on Intelligent Data Analysis). We wanted to see if this approach could:

(1)    Speed up time to an adequately performing model.

(2)    Help us explore alternative modelling/data prep approaches.

(3)    Give us a repeatable, rapid development template for future projects.


We decided to use the dataset released by Scania CV AB on the UCI Machine Learning Repository. It’s separated into a training and test set, both having 171 attributes. Attribute names were anonymised due to proprietary reasons, but each field corresponds to sensors on the truck. The training set had 60,000 examples in total, where 59,000 belong to the negative class and 1000 belong to the positive. To deal with the many missing values in the dataset we used a median imputation technique. The dataset is highly unbalanced and in order to undersample the negative class we used a combination of Random Undersampling and the Condensed Nearest Neighbour Rule Undersampling in order to create a balanced dataset. As for the scoring, we used the same indicative values of costs as in this article.  In this case, each False Positive will cost $10 (unnecessary check), with False Negatives costing $500 for associated repairs. Based on these values, we will be calculating total cost.

Our AutoML Toolkit

AutoML libraries aim to automate the process of feature engineering and model selection. We decided to use the Python TPOT (Tree-based Pipeline Optimization Tool) library, which uses an evolutionary algorithm to evaluate multiple machine learning techniques, feature engineering steps and hyperparameters. In other words, TPOT tries out a particular machine learning pipeline, evaluates its performance and randomly changes parts of the pipeline in search of better overall predictive power.

In addition to this, we used Pentaho to build a reusable AutoML template for future projects, curate a real-time model catalogue and manage data prep. Grafana was used to visualise the performance of the various pipeline/algorithm combinations as the search progressed.

Getting started

In Pentaho we imported the Scania APS data, sampled to re-balance classes and embedded the required Python code into the data flow to make use of the TPOT functions. 

Figure 1: Data cleaning and undersampling in Pentaho

Our current task is a classification task (APS system failure/non-failure), so we’ll be using the TPOT Classifier. TPOT usage and syntax is similar to any other sklearn classifier. A full list of classifier parameters can be found here ( For our development purposes we used:

generations: Number of iterations to the run pipeline optimization process.

population_size: Number of individuals to retain in the genetic programming population every generation.

mutation_rate: Tells the evolutionary algorithm how many pipelines to apply random changes to every generation.

scoring: Function used to evaluate the quality of a given pipeline for the classification problem. We  used “recall” for starters, as it made sense to initially optimise for identifying all of the True APS failures due to the larger associated cost, rather than minimising unnecessary call outs.

periodic_checkpoint_folder: If supplied, a folder in which TPOT will periodically save pipelines in pareto front so far while optimizing. We wanted to record the best performing pipelines as the search progressed real-time to get a feel for the best performing algorithms and record them for comparison later.

verbosity: How much information TPOT communicates while it's running. We wanted to see everything as the search progressed.

import pandas as pdimport numpy as npimport pickleimport timefrom tpot import TPOTClassifierfrom sklearn import preprocessingfrom sklearn.model_selection import train_test_splitfrom sklearn.externals import joblib#scaling training data and saving scalermin_max_scaler = preprocessing.MinMaxScaler()x_scaled = min_max_scaler.fit_transform(x)scaler_name='/home/demouser/Desktop/Aldrin/scalers/scaler'+str(int(time.time()))+'.save'joblib.dump(min_max_scaler, scaler_name)X=np.asarray(x_scaled)#splitting train and CV setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=None)#Training TPOT classifiertpot = TPOTClassifier(generations=50, verbosity=3,population_size=100, periodic_checkpoint_folder=path_to_folder, scoring='roc_auc'  , mutation_rate=0.9)                                  , y_train)output_score=str(tpot.score(X_test, y_test))tpot.export(path_to_pipeline)pickle.dump(tpot.fitted_pipeline_, open(file_name, 'wb'))


Building and Visualising the Model Catalogue

We also ran a Pentaho job to collect the ML pipelines generated by TPOT and create a real-time model catalogue (Figures 2 & 3). We visualised the search for the best performing pipelines in Grafana (Figure 4), which allowed us to tweak our initial TPOT search parameters to get the right balance of running time vs accuracy.   

Figure 2: Pentaho job collecting TPOT pipelines and streaming results to Grafana.

Figure 3: Pentaho transformation for data ingestion, wrangling and executing the TPOT model search algorithm.


Figure 4: An initial example of the Grafana AutoML dashboard for tracking TPOT search progress, including best performing model pipelines and recall score over time.

Fast prototyping?

Our test set consisted of 16k samples, where 375 of them are class Positive, failures in the APS System. To put that into context, these would failures cost 375 * $500 = $187,500 for Scania.

We were able to get pretty good results on our first pass: 82% recall and 78% precision on the test set’s positive class in less than 30 minutes with our training dataset. Overall however, the cost is quite high - $33,400. After a 30min run, TPOT suggested using an XGBoost based model. How long you have to run TPOT for it to work well will depend on data size, dimensionality, accuracy requirements and the computational power you have to throw at it. Normally users run the TPOT algorithm for hours or even days to get the best pipeline.  We are aiming to get to the overall cost <$20k, decreasing the initial cost by a factor of two.

Model Explainability

Next, we wanted to understand how much each of the 170 features (in this case sensor feeds) contributed to our APS failure predictions. This would help us to understand which sensor feeds are important (given our lack of domain knowledge specific to Scania heavy trucks!), as well as those which actively hinder our success. Using the SHAP (SHapley Additive exPlanations) library and embedding the required Python code into our Pentaho data flow, we were able to identify the 30 most important features out of 170 and reduce the number used in the model training. 

Exploring different modelling approaches

Our AutoML dashboard in Grafana allowed us to see which modelling approaches were performing well for our use case (example view in Figure 4). Further to this, the SHAP values allowed us to explore which features are most important for model predictions, the distributions for each feature contributing to the probability of a positive classification, and the dependencies between different features (Figure 5).

Figure 5: Visualising SHAP values in Grafana to explore which features are most important for model predictions and the dependencies between different features.


Initial Results

After running our refined data flow for 8 hours TPOT came up with a RandomForest model, which is  inline with the previous work [1], where RandomForests also produced the best results. We identified a threshold of 0.2 as the most promising, in terms of balancing the False Positive and False Negatives. Our final model with 0.2 threshold for positive class probability identified 357 True Positive, 639 False Positive, 14986 True Negative, 18 False Negative or 95.2% Recall and 35.8% Precision. This equated to a total cost of $6390 for unnecessary maintenance call outs and $9000 for missed APS failures, comparing well to the costs reported in previous work and  during the intial  challenge .  The total cost was $15390, which is more than x2 times improvement from the initial run.

Further improvements

We were able to significantly improve the model performance, but it is still far from the best model performance shown in [1]. We decided to play with the hyperparameter tuning for the best pipeline outputted by TPOT. After performing a hyperparameter grid search for the Random Forest model (best parameters n_estimators=200 , max_depth=10 ) we were able to get to an overall cost of $12320, with 13 False Negatives and 582 False Positives, 38.3% precision and 96.5% recall.  This brings us closer to the best cost produced by models in [1] and the original challenge.


The TPOT library paired with SHAP provides a nice toolset for exploring a variety of models, whilst building a better understanding of which features are influencing model behaviour. Pros’ to this approach include the relative ease of usage and the minimum number of parameters that require configuration. On the other hand, TPOT was able to find the best model configuration, but couldn’t optimally tune the hyperparameters within a 10-hour running period.  This highlights one of the main con’s from our perspective - the computational complexity of both methods (TPOT and SHAP). One would need to wait for quite a long time (e.g 10 hours) to get a to a result, but as stated by the creators of TPOT, AutoML algorithms are not meant to run for half an hour.  In all, we felt that the AutoML approach reduced our experimentation time or “time to an adequate prototype solution”, given our unfamiliarity with the dataset. There may be little benefit in terms of reduced development time when working with familiar data sources, however TPOT could be used in an ancillary fashion to search for alternative modelling approaches, or to justify the current modelling techniques used.  


[1] APS Failure at Scania Trucks by mrunal sawant in @thestartup_

[2]  Ferreira Costa, Camila & Nascimento, Mario. (2016). IDA 2016 Industrial Challenge: Using Machine Learning for Predicting Failures. 381-386. 10.1007/978-3-319-46349-0_33.





11 days ago

Thanks for sharing

13 days ago

Great Information

07-09-2020 05:18

thank you for sharing