Skip navigation
1 2 3 Previous Next

Pentaho

63 posts

   

 

What Can You Do with Deep Learning in Pentaho?

 

By Ken Wood and Mark Hall

 

For those of you that have installed and are using the Plugin Machine Intelligence (PMI) plugin that Hitachi Vantara Labs released to the Pentaho Market Place back in March 2018, get ready for an exciting new

PMIDLLogo.pngupdate. This fall, we will release PMI version 1.4 as an update to the existing PMI which is an experimental plugin for Pentaho Data Integration (PDI). Our initial release of PMI focused on classical machine learning and the ability to build, use and manage machine learning models from four popular machine learning libraries – Python’s Scikit-Learning, R’s Machine Learning with R, Spark’s Machine Learning library and WEKA.

 

I say classical machine learning because traditionally classic machine learning has its best success executing on structured data. With the next release of PMI, we integrate a new machine learning library, what we refer to as “execution engines” – Deep Learning for Java (DL4J). This means PMI can now perform deep learning operations - training, validating, testing, building, evaluating and using deep learning models - directly from PDI.

 

AIDomainsDiagram.png

 

Deep Learning is gaining lots of attention in the industry for its ability to operate on unstructured data like images, video, audio etc. Deep Learning is a recent addition to the Artificial Intelligence domain of machine learning, though technically the technology has been around for quite some time.

 

 

NNComparison.png

 

Deep learning to some degree gets its name from the deep, complex, hidden, neural network layers the technology creates to analyze data. To be clear, both machine learning and deep learning can operate on both structured and unstructured data, it’s just that the current general practice and greater success rate of applying deep learning to unstructured data and applying classical machine learning to structured data is the state of understanding at tis time.

 

The reason we’re blogging about this now is because we showcased and demonstrated PMI v1.4 with deep learning at Hitachi NEXT 2018 in San Diego. Along with a series of one-on-one workshops showing the new deep learning step with PDI and PMI v1.4, we demonstrated an example application using deep learning in an interactive apparatus that uses two deep learning models in a PDI transformation, and then uses PDI to drive the entire application.

 

DLTransformation.png

 

This PDI transformation contains several parts when called,

  • The “Data Capture and Data Preparation” phase
    • This portion of the transformation starts by narrating what the entire transformation will do
    • Then communicates with a Raspberry Pi to capture a picture of a physical x-ray - essentially analog to digital conversion
    • Information about the image is then transformed into image metadata. Basically, an in-memory location of the actual digital image
  • The PDI transformation then executes the two deep learning models on the x-ray image. The two deep learning models vectorizes the image into usable numbers, determines the probability of identifying the body part focused on in the image and detecting whether an injury or anomaly exists.
  • The results of the two deep learning models is the probabilities of,
    • A multi-class classifier – Shoulder, Humerus, Elbow, Forearm and Hand
    • And a 2-class classifier, injury or anomaly detected – yes or no
    • These probabilities are numbers between 0 and 1
  • The next phase of the PDI transformation, “Results Preparation” takes the output probabilities (numbers between 0 and 1) from the deep learning models and prepares the result for use.
    • Determine the most likely value – max value is the “answer”
    • Format the 5 decimal digit value into a percentage and into a string
      • This formatting allows the next phase to say “Forty seven percent” instead of "4, 7, percent sign"
  • The last phase, “Confidence Dialog Preparation”, builds logic for the different speaking phrases and applies confidence to the result as an analysis.
    • For example, instead of saying, “There is a 98% chance that this elbow is injured.”. Just say “I detect that this elbow is injured.”. At 98%, we’ve determined that it is injured, but at 47%, we’re not too sure, so the spoken analysis would be “I detect a 47% probability that this elbow is injured, you might want to have it checked out.”.
    • This confidence logic applies to both the body part identification and the injury detection parts of the spoken analysis.

 

A diagram of the "Deep Learning Pipeline" can be seen here.

  • We use a "Speech Recognition Module" written in python to capture spoken phrases and determines the actions to be taken.
  • In case the environment is too noisy for sound, a special remote control application is available to manually HeyRayTweet.pngexecute the "Hey Ray!" command set.
    • A main transformation is used to interpret the incoming tasks and orchestrate the execution of other transformations as needed.
      • The tasks includes,
        • Introduction narration
        • Help on how to use "Hey Ray!"
        • Analyze the x-ray film and provide the results speech
        • the current analysis session can be saved to the Hitachi Content Platform (HCP)
          • During this operation, the content, x-ray image and analysis phrases, are converted into a single image movie file, then all of the content is saved to HCP
        • You can have "Hey Ray!" tweet the movie file
        • Provide insightful thoughts and opinions
        • And finally, "Hey Ray!" can tell radiologist jokes

 

 

HeyRayDemoConfiguration.png

 

We call this demonstration “Hey Ray!”. “Hey Ray!” is just an example of applying deep learning to a situation. We came up with "Hey Ray!" because of the dataset we had access to, it just happens to be x-ray images. We could have created something with flowers, food, automobiles, etc. We also decided to speak the results and add speech recognition for demonstration and "Wow Factor" for the Hitachi NEXT conference. Also, we felt that creating charts of probability distributions of number between 0 and 1 would take to long to explain, so why not have the demonstration state the results. This demonstration turned out to be highly interactive as the attendees could select a x-ray picture, insert it into the x-ray viewing screen and tell the device to "Analyze the x-ray".

 

 

NEXTDemoPictureLabeled.png

 

 

We will be providing more blogs about PMI 1.4 with deep learning and other information on the artificial intelligence that goes into “Hey Ray!” in the coming months to help support this release. Stay tuned!

 

What can you do with machine learning and now deep learning in Pentaho?

 

 

IMPORTANT NOTE:

It is important to point out that this initiative is not formally supported by Hitachi Vantara, and there are no current plans on the Enterprise Edition roadmap to support PMI at this time.  It is recommended that this experimental feature be used for testing, educational and exploration purposes only. PMI is supported by Hitachi Vantara Labs and the community. Hitachi Vantara Labs was created to formally test out new ideas, explore emerging technologies and as much as possible, share our prototypes with the community and users through the Hitachi Vantara Marketplace. We like to refer to this as "providing early access to advanced capabilities". Our hope is that the community and users of these advanced capabilities will help us improve and recommend additional use cases. Hitachi Vantara has forward thinking customers and users, so we hope you will download, install and test this plugin. We would appreciate any and all of your comments, ideas and opinions.

Sandra Wagner is part of our Customer Success & Support team dedicated to Pentaho and Analytics. You might also know her as The Goddess of Best Practices from the Support Portal. We want to make sure all customers who are using Pentaho know where to find helpful resources including Support, Best Practices and so much more.

 

 

Confused about how to upgrade Pentaho?

 

Upgrading to Pentaho 8.1 can seem like a complicated process, but it does not have to be difficult. We have published guidelines and best practices that answer some common questions about upgrading Pentaho. We’ve included a checklist of steps to take, such as what the upgrade path to use to get to Pentaho 8.1, what to back up and restore, when to update the design tools, and more:

 

Picture1.png

 

You should have all necessary information and software available to you, and then it will be a simple matter of following your upgrade path from its beginning to its end. There is a comprehensive and downloadable version of this checklist to help you record and keep track of the information you’ll need to upgrade.

 

Picture1.png

If you have custom configurations, contact your CSM, then Support and let them know before upgrade.

 

 

There is also a pdf version available for download at the bottom of  Guidelines for Successfully Upgrading to Pentaho 8.1. Here are a few more links that you might find helpful:

 

 

Click here to download a full guide on Upgrading to Pentaho 8.1

MyRepublic, one of the fastest growing telecom operators in Asia-Pacific, is disrupting the traditional telecommunications market with the introduction of TelcoTech, which uses data and new open source technologies, analytics and machine learning to create new business models.

 

One of the key tenets of the TelcoTech vision is providing telecommunications operators with the ability to enter markets quickly and provide services rapidly.

 

MyRepublic partners with Hitachi Vantara to revolutionize TelcoTech. “The implementation of Pentaho has strengthened MyRepublic’s TelcoTech strategy across the region which will help us scale quickly and expand our offerings to other markets in future.” Eugene Yeo, Group Chief Information Officer, MyRepublic.

 

The efficiencies gained from integrating the Pentaho open platform and leveraging the extensive library of data integration connectors helps MyRepublic further enhance the ability of its platform to deliver on this promise.

 

To learn more about MyRepublic’s success checkout their case study and podcast.

 

“While we have made significant manpower savings on data integration and reporting, the bigger benefit is the robust data pipeline that has been built. Pentaho allows us to add data to this pipeline rapidly, which is important to this vision. It paves the way for us to create new data monetization models, which will lead to innovation in the industry, just like what FinTech players achieved with the financial services industry.”

 

Eugene Yeo, Group Chief Information Officer, MyRepublic

download.jpg

After much research I was able to run Pentaho Carte on Raspberry Pi3, it was necessary to customize the SWT library, adjust packages in the operating system and implement the Armi7 architecture in Pentaho.

 

 

We will have news soon

 

Pentaho PDI Kettle Carte on Raspberry Pi 3 - YouTube

Running Pentaho Kettle PDI Carte on Raspberry Pi 3 - YouTube

I'm excited to spotlight one of our amazing employees, Sandra Wagner. She is part of our Customer Success & Support team dedicated to Pentaho and Analytics. You might also know her as The Goddess of Best Practices from the Support Portal. We want to make sure all customers who are using Pentaho know where to find helpful resources including Support, Best Practices and so much more.

 

 

Wagner_Sandra.jpg

I’ve been with Pentaho (now part of Hitachi Vantara) for nearly six years. Coming to Pentaho at that early stage was very exciting, and I feel lucky that I’ve been able to learn a variety of things. When I first arrived here as a senior technical writer, the team was quite small and, out of necessity, we all wore many hats to deliver our product documentation. Over the years, my role has morphed from technical writer to project lead to team lead to process owner to editor, often all at once! I am currently leading a small group to craft and maintain the Best Practices suite of documentation for the Customer Care group, Support, and wearing all the previously mentioned hats.

 

Since most people have never heard of technical writing or technical writers, here is what we do: we explain hard technical stuff using simplified language. We need to be curious enough about technology to want to play with it, be able to use it if possible, and then we need to be able to show others how to use it as well.

 

Once we figure those things out, our goal is to explain, and arrange the information so that it is as easy to absorb and use as possible. If the material is too dense, full of jargon and unnecessary information, then no one will want to wade through all of that to find what they need! That is why we try to keep things as simple as possible.

 

How do I Get to the Pentaho Documentation?

Hitachi Vantara has several types of documentation to help you learn about and use Pentaho software. This can include knowledge base articles, best practices, Pentaho documentation, webinars, and videos, among others.

 

It helps a bit to think of the content called Pentaho Documentation as a virtual set of “product manuals” that typically come with any new product, while the content on the Pentaho Customer Portal can best be described as a collection of knowledge base articles, best practices, guidelines, and webinars that cover methods on the best ways to do things while using the Pentaho suite “out in the wild”.

We’ll be talking chiefly about the articles found on the Customer Portal here, explaining a bit about the different types of content and showing you how to find the information you need.

 

About the Hitachi Vantara – Pentaho Customer Portal

 

Screen Shot 2018-06-29 at 1.27.14 PM.png

The idea behind the original Pentaho Customer Portal was to give customers a single point of entry to access everything to do with Pentaho, and to ask for help if they need it. For example, you can quickly find Pentaho evaluation materials, best practices, knowledge base articles, Pentaho training, Pentaho documentation, and software / service-pack downloads from the front page of the portal. You can subscribe to any of the articles available in the Pentaho Customer Portal after you sign in to the portal.

 

To access the Pentaho product support portal, click here.

 

Where Do We Get Our Ideas?

We gather ideas about topics from customers asking questions, from support tickets, from Solution Architects, from Services; if it is being asked frequently, we'll figure out a way to produce some content about it. If you have an idea or request, please comment below.

 

Knowledge Base Articles

Hitachi Vantara’s Technical Support Engineers for Pentaho create and update our Knowledge Base articles daily. You can access the Knowledge Base after you log into the Customer Portal and select the Knowledge Base widget from the front page. The knowledge base primarily consists of troubleshooting tips and how-to articles. If you would like to subscribe to any article hosted through the Portal, just click the Subscribe button next to the page title.

 

When you click on the Best Practices button from the Portal, you will find this:

 

Screen Shot 2018-06-29 at 1.25.36 PM.png

 

Hitachi Vantara’s Webinar Series on Pentaho

Our webinars are created and conducted by Solutions Architects and members of the Services team to engage with and educate our customers, giving them an opportunity to learn from experts and ask questions. The webinars are conducted live on the last Tuesday of the month. After the webinar is over, all the associated materials – video presentations, supplemental documentation and videos, FAQs, related links – are published together in the Customer Portal.

 

To find all upcoming webinars, click here.

 

Best Practices and Guidelines for Pentaho

The Best Practices and Guidelines are developed over time by our Solution Architects and Services teams during customer implementations. These spring out of the internal notes that each architect would create on site; we wanted to capture that information to share with everyone. We produce content that explains things the best ways to configure environments, to supplement the product documentation with details from the field, and to give guidance on optimal integration of Pentaho with 3rd party tools. Eventually, we had so many field-tested best practices, guidelines, and how-to articles, all published on individual pages, that we ended up restructuring everything so that information would be easier to find.

 

For all Best Practices available through the Pentaho Customer Portal, click here.

We apologize! This blog from Mark Hall went offline when the Pentaho.com website was replaced with the HitachiVantara.com site. We know many of our followers and supporters in the Pentaho community, as well as the data science community, still refer to this great piece of work. So, here it is back online at its new location here in the Hitachi Vantara Community. Hopefully, this wasn't a huge inconvenience. Thank you for your understanding.

 

 

Line.png

 

by Mark Hall | March 14, 2017

Original4StepsML.png

 

The power of Pentaho Data Integration (PDI) for data access, blending and governance has been demonstrated and documented numerous times. However, perhaps less well known is how PDI as a platform, with all its data munging[1] power, is ideally suited to orchestrate and automate up to three stages of the CRISP-DM[2] life-cycle for the data science practitioner: generic data preparation/feature engineering, predictive modeling, and model deployment.

crisp-dm-process2.jpg

 

By "generic data preparation" we are referring to the process of connecting to (potentially) multiple heterogeneous data sources and then joining, blending, cleaning, filtering, deriving and denormalizing data so that it ready for consumption by machine learning (ML) algorithms. Further ML-specific data transformations, such as supervised discretization, one-hot encoding etc. can then be applied as needed in an ML tool. For the data scientist, PDI can be used to remove the repetitive drudgery involved with manually performing similar data preparation processes repetitively, from one dataset to the next. Furthermore, Pentaho's Streamlined Data Refinery can be used to deliver modeling-ready datasets to the data scientist at the click of a button, removing the need to burden the IT department with requests for such data.                                                            The CRISP-DM Process

 

When it comes to deploying a predictive solution, PDI accelerates the process of operationalizing machine learning by working seamlessly with popular libraries and languages, such as R, Python, WEKA and Spark MLlib. This allows output from team members developing in different environments to be integrated within same framework, without dictating the use of a single predictive tool.

 

In this blog, we present a common predictive use case, and step through the typical workflow involved in developing a predictive application using Pentaho Data Integration and Pentaho Data Mining.

 

Imagine that a direct retailer wants to reduce losses due to orders involving fraudulent use of credit cards. They accept orders via phone and their web site, and ship goods directly to the customer. Basic customer details, such as customer name, date of birth, billing address and preferred shipping address, are stored in a relational database. Orders, as they come in, are stored in a MongoDB database. There is also a report of historical instances of fraud contained in a CSV spreadsheet.

 

Step 1

GENERIC DATA PREPARATION/FEATURE ENGINEERING

 

With the goal of preparing a dataset for ML, we can use PDI to combine these disparate data sources and engineer some features for learning from it. The following figure shows a transformation demonstrating an example of just that, and includes some steps for deriving new fields. To begin with customer data is joined from several relational database tables, and then blended with transactional data from MongoDB and historical fraud occurrences contained in a CSV file. Following this, there are steps for deriving additional fields that might be useful for predictive modeling. These include computing the customer's age, extracting the hour of the day the order was placed, and setting a flag to indicate whether the shipping and billing addresses have the same zip code.

 

blending_data_engineering_features.png

Blending data and engineering features

 

This process culminates with output of flattened (a Data Scientist’s preferred data shape) data in both CSV and ARFF (Attribute Relational File Format) data, the latter being the native file format used by PDM (Pentaho Data Mining, AKA WEKA). We end up with 100,000 examples (rows) containing the following fields:

 

customer_name            

customer_id               

customer_billing_zip           

transaction_id           

card_number               

expiry_date               

ship_to_address     

ship_to_city              

ship_to_country          

ship_to_customer_number             

ship_to_email            

ship_to_name              

ship_to_phone            

ship_to_state            

ship_to_zip               

first_time_customer            

order_dollar_amount            

num_items           

age            

web_order           

total_transactions_to_date          

hour_of_day                               

billing_shipping_zip_equal

reported_as_fraud_historic

 

From this list, for the purposes of predictive modeling, we can drop the customer name, ID fields, email addresses, phone numbers and physical addresses. These fields are unlikely to be useful for learning purposes and, in fact, can be detrimental due to the large number of distinct values they contain.

 

Step 2

TRAIN, TUNE, TEST MACHINE LEARNING MODELS TO

IDENTIFY THE MOST ACCURATE MODEL

 

So, what does the data scientist do at this point? Typically, they will want to get a feel for the data by examining simple summary statistics and visualizations, followed by applying quick techniques for assessing the relationship between individual attributes (fields) and the target of interest which, in this example, is the "reported_as_fraud_historic" field. Following that, if there are attributes that look promising, quick tests with common supervised classification algorithms will be next on the list. This comprises the initial stages of experimental data mining - i.e. the process of determining which predictive techniques are going to give the best result for a given problem.

 

The following figure shows an ML process, for initial exploration, designed in WEKA's Knowledge Flow environment. It demonstrates three main exploratory activities:

 

    1. Assessment of variable importance. In this example, the top five variables most correlated with "reported_as_fraud_historic" are found, and can be visualized as stacked bar charts/histograms.
    2. Knowledge discovery via decision tree learning to find key variable interactions.
    3. Initial predictive evaluation. Four ML classifiers—two from WEKA, and one each from Python Scikit-learn and R respectively—are evaluated via 10-fold cross validation.

 

WekaKnowledgeFlowDiagram.png

Exploratory Data Mining

 

Visualization of the top five variables (ordered from left-to-right, top-to-bottom) correlated with fraud show some clear patterns. In the figure below, blue indicates fraud, and red the non-fraudulent orders. There are more instances of fraud when the billing and shipping zip codes are different. Fraudulent orders also tend to have a higher total dollar value attached to them, involve more individual items and be perpetrated by younger customers.

 

top_drivers_of_fraud.png

Top Drivers of Fraud

 

The next figure shows visualizing attribute interactions in a WEKA decision tree viewer. The tree has been limited to a depth of five in order to focus on the strongest (most likely to be statistically stable) interactions – i.e., those closest to the root of the tree. As expected, given the correlation analysis, the attribute "billing_shipping_zip_equal" forms the decision at the root of the tree. Inner (decision) nodes are shown in green, and predictions (leaves) are white. The first number in parenthesis at a leaf shows how many training examples reach that leaf; the second how many were misclassified. The numbers in brackets are similar, but apply to the examples that were held out by the algorithm to use when pruning the tree. Variable interactions can be seen by tracing a path from the root of the tree to a leaf. For example, in the top half of the tree, where billing and shipping zip codes are different, we can see that young, first-time customers, who spend a lot on a given order (of which there are 5,530 in the combined training and pruning sets), have a high likelihood of committing credit card fraud.

 

variable_interactions.png

Variable Interactions

 

The last part of the exploratory process involves an initial evaluation of four different supervised classification algorithms. Given that our visualization shows that decision trees appear to be capturing some strong relationships between the input variables, it is worthwhile including them in the analysis. Furthermore, Because WEKA has no-coding integration with ML algorithms in the R [4] statistical software and the Python Scikit-learn[5] package, we can get a quick comparison of decision tree implementations from all three tools. Also included is the ever-popular logistic regression learner. This will give us a feel for how well a linear method does in comparison to the non-linear decision trees. There are many other learning schemes that could be considered, however, trees and linear functions are popular starting points.

 

four_different_supervised_classification_algorithms.png

Four Different Supervised Classification Algorithms

 

The WEKA Knowledge Flow process captures metrics relating to the predictive performance of the classifiers in a Text Viewer step, and ROC curves - a type of graphical performance evaluation - are captured in the Image Viewer step. The figure below shows WEKA's standard evaluation output for the J48 decision tree learner.

 

evaluation_output_for_the_j48_decision_tree.png

Evaluation Output for the J48 Decision Tree

 

It is beyond the scope of this article to discuss all the evaluation metrics shown in the figure but, suffice to say, decision trees appear to perform quite well on this problem. J48 only misclassifies 2.7% of the instances. The Scikit-learn decision tree's performance is similar to that of WEKA's J48 (2.63% incorrect), but the R "rpart" decision tree fares worse, with 14.9% incorrectly classified. The logistic regression method performs the worst with 17.3% incorrectly classified. It is worth noting that default settings were used with all four algorithms.

 

For a problem like this — where a fielded solution would produce a top-n report, listing those orders received recently that have the greatest likelihood of being fraudulent according to the model — we are particularly interested in the ranking performance of the different classifiers. That is, how well each does at ranking actual historic fraud cases above non-fraud ones when the examples are sorted in descending order of predicted likelihood of fraud. This is important because we'll want to manually investigate the cases that the algorithm is most confident about, and not waste time on potential red herrings. Receiver Operating Curves (ROC) graphically depict ranking performance, and the area under such a curve is a statistic that conveniently summarizes the curve[6]. The figure below shows the ROC curves for the four classifiers, with the number of true positives shown on the y axis and false positives shown on the x axis. Each point on the curve, increasing from left to right, shows the number of true and false positives in the n rows taken from the top of our top-n report. In a nutshell, the more a curve bulges towards the upper left-hand corner, the better the ranking performance of the associated classifier is.

 

comparing_performance_with_roc_curves.png

Comparing Performance with ROC Curves

 

At this stage, the practitioner might be satisfied with the analysis and be ready to build a final production-ready model. Clearly decision trees are performing best, but is there a (statistically) significant difference between the different implementations? Is it possible to improve performance further? There might be more than one dataset (from different stores/sites) that needs to be considered. In such situations, it is a good idea to perform a more principled experiment to answer these questions. WEKA has a dedicated graphical environment, aptly called the Experimenter, for just this purpose. The Experimenter allows multiple algorithm and parameter setting combinations to be applied to multiple datasets, using repeated cross-validation or hold-out set testing. All of WEKA's evaluation metrics are computed and results are presented in tabular fashion, along with tests for statistically significant differences in performance. The figure below shows the WEKA Experimenter configured to run a 10 x 10-fold cross-validation[3] experiment involving seven learning algorithms on the fraud dataset. We've used decision tree and random forest implementations from WEKA and Scikit-learn, and gradient tree boosting from WEKA, Scikit-learn and R. Random forests and boosting are two ensemble learning methods that can improve the performance of decision trees. Parameter settings for implementations of these in WEKA, R and Python have been kept as similar as possible to make a fair comparison.

 

The next figure shows analyzing the results once the experiment has completed. Average area under the ROC is compared, with the J48 decision tree classifier set as the base comparison on the left. Asterisks and "v" symbols indicate where a scheme performs significantly worse or better than J48 according to a paired correctedt-test. Although Scikit-learn's decision trees are less accurate than J48, when boosted they slightly (but significantly) outperform boosted versions in R and WEKA. However, when analyzing elapsed time, they are significantly slower to train and test than the R and WEKA versions.

 

configuring_an_experiment.png

Configuring an Experiment

 

analyzing_results.png

Analyzing Results

 

Step 3

DEPLOY PREDICTIVE MODELS IN PENTAHO

 

Now that the best predictive scheme for the problem has been identified, we can return to PDI to see how the model can be deployed and then periodically re-built on up-to-date historic data. Rebuilding the model from time-to-time will ensure that it remains accurate with respect to underlying patterns in the data. If a trained model is exported from WEKA, then it can be imported directly into a PDI step called Weka Scoring. This step handles passing each incoming row of data to the model for prediction, and then outputting the row with predictions appended. The step can import any WEKA classification or clustering model, including those that invoke a different environment (such as R or Python). The following figure shows a PDI transformation for scoring orders using the Scikit-learn gradient boosting model trained in WEKA. Note that we don't need the historic fraud spreadsheet in this case as that is what we want the model to predict for the new orders!

 

deploying_a_predictive_model_in_pdi.png

Deploy a Predictive Model in PDI

 

PDI also supports the data scientist who prefers to work directly in R or Python when developing predictive models and engineering features. Scripting steps for R and Python allow existing code to be executed on PDI data that has been converted into data frames. With respect to machine learning, care needs be taken when dealing with separate training and test sets in R and Python, especially with respect to categorical variables. Factor levels in R need to be consistent between datasets (same values and order); the same is true for Scikit-learn and, furthermore, because only numeric inputs are allowed, all categorical variables need to be converted to binary indicators via the one-hot-encoding (or similar). WEKA's wrappers around MLR and Scikit-learn take care of these details automatically, and ensure consistency between training and test sets.

 

Step 4

DYNAMICALLY UPDATING PREDICTIVE MODELS

 

The following figure shows automating the creation of a predictive model using the PDI WEKA Knowledge Flow step. This step takes incoming rows and injects them into a WEKA Knowledg Flow process. The user can select either an existing flow to be executed, or design one on-the-fly in the step's embedded graphical Knowledge Flow editor. Using this step to rebuild a predictive model is simply an exercise in adding this it to the end of our original data prep transformation.

 

building_a_weka_model_in_pdi.jpg

Building a WEKA Model in PDI

 

To build a model directly in Python (rather than via WEKA's wrapper classifiers), we can simply add a CPython Script Executor step to the transformation. PDI materializes incoming batches of rows as a pandas data frame in the Python environment. The following figure shows using this step to execute code that builds and saves a Scikit-learn gradient boosted trees classifier.

 

BuildPythonModel.png

Scripting to Build a Python Scikit-Learn Model in PDI

 

 

A similar script, as shown in the figure below,  can be used to leverage the saved model for predicting the likelihood of new orders being fraudulent.

 

CPythonScriptExecutor.png

Scripting to Make Predictions with a Python Scikit-Learn Model

 

This predictive use-case walkthrough demonstrates the power and flexibility of Pentaho afforded to the data engineer and data scientist. From data preparation through to model deployment, Pentaho provides machine learning orchestration capabilities that streamline the entire workflow.

 

[1] Also known as data wrangling, is the process of manually converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data by semi-automated tools.

[2] The Cross Industry Standard Process for Data Mining.

[3] 10 separate runs of 10-fold cross validation, where the data is randomly shuffled between each run. This results in 100 models being learned and 100 sets of performance statistics for each learning scheme on each dataset.

[4] https://cran.r-project.org/web/packages/mlr/index.html. 132 classification and regression algorithms in the MLR package.

[5] http://scikit-learn.org/stable/index.html. 55 classification and regression algorithms.

[6] The under the ROC has a nice interpretation as the probability (on average) that a classifier will rank a randomly selected positive case higher than a randomly selected negative case.

From industry giants to bright startups, brands are frantic to deliver the best interactions with customers and to burn the churn. In other words, master the customer experience (CX).

The frenzy for telcos is to morph from staunch utility companies trying to keep millions of customers happy into innovative, tuned-in service providers that keep happy customers. To compete means being on par with OTT companies, which were born in the cloud and exemplify dexterity, whereas telcos did not have agility built into their DNA.

Gene Therapy

To uplevel CX, some telcos have successfully added OTT–like agility (or actual OTTs) to their DNA (T-Mobile/Layer3 TV, Comcast/NBC Universal, Verizon/Yahoo/AOL and pending AT&T/Time Warner). Some have partnered (Sprint/Hulu). Some OTTs offer solutions (Whisbi.com meshes live voice, video, chat and chatbot data in real time to create a personalized, conversational CX solution for savvy telcos and other enterprises).

Competitive relevancy, however, also demands a consummate understanding of every single customer. While companies are rightfully focused on providing omnichannel content, I’m looking at how "omnichannel data integration" will be the force shaping CX long into the future.

Omnichannel Data

CX 2018 predictions exclusively forecast the preciseness of deep intelligence of customer data. In a short time, we’ve seen data technology mature from summarized averages of customer behaviors to pinpoint accuracy, so you can zoom into every customer touchpoint.

One example is the net promotor score (NPS) index, which tracks, weights and packages each online imprint, call, complaint, purchase and contact per customer. A low NPS indicates that a particular customer needs to be saved or he will leave. A high NPS means this individual will promote your company or service. Computational frameworks applied to NPS systems provide a dashboard of, say an enterprise’s overall NPS.

Comcast recently paired with Convergys, a billing company focused on helping telcos improve CX by analyzing billing data, call center data and operations data. By fully understanding customer drivers and its own CX shortfalls, Comcast recently removed 25 million calls from its business: a huge win.

Bringing meaning to billions or trillions of data points across the endless parade of sources is anything but easy and requires more than trial-and-error. Data scientists can spend up to 79% of their time cleaning, organizing and collecting data sets[i]. But what’s the right data? While each datum has the potential to be correlated and actionable, not all data is useful.

Pentaho Orchestration

CX orders of magnitude require predictive data analytics nimble and comprehensive enough to extract intelligence and aggregate the most value from omnichannel data. Deftly integrating and mining data means knowing how and what to ingest and validate, which algorithms to use and how to prepare and blend traditional sources, machine intelligence, social media, etc. After all, we’re talking about reaching online audiences in real time at scale.

Figure out how to operationalize and capitalize on omnichannel data, and you’re looking at vast opportunities to master CX.

I believe Pentaho gets it right. Pentaho Data Integration (PDI) and Pentaho Machine Intelligence (PMI) use enterprise-grade orchestration capabilities to train, tune, test and deploy predictive modeling for big data and machine learning (ML) workflows.

Why is this important? Because these capabilities buffer ridiculously difficult tasks of big data onboarding, transformation and validation. Regardless of whether predictive models were built in R, Python, Scala or Weka, the Pentaho tools enable smooth collaboration for faster, more complete intel. Pentaho uses an impressive automated drag-and-drop environment that accelerates collaboration across platforms and mitigates recoding and reengineering.

Let’s apply Pentaho to the successes already mentioned. For NPS, Pentaho could streamline how scoring frameworks are computed and delivered, to readily adapt and capture touchpoints for added or changed customer products and services. Companies like Convergys and Whisbi can benefit from Pentaho’s supervised, unsupervised and transfer learning algorithms, to measure ROI of the software tools and customer behaviors being collected.

With a bring-your-own ML philosophy and transparency across algorithms, Pentaho integrates and mines omnichannel data in the most complete and meaningful fashion. Our Hitachi Vantara Labs has even been working on ML model management plug-ins for the Pentaho Marketplace.

http://www.pentaho.com/marketplace/The bottom line here is that a simplified data-in-data-out analytics approach can deliver optimal value and choice to customers, new revenue and monetized opportunities for telcos and the OTTs looking to help.

For more information, please contact me or visit http://www.pentaho.com/machine-learning-capabilities.

[i] Source: Survey of 80 data scientists conducted by Crowdflower, provider of a data enrichment platform for data scientists.

Author: Rakesh Saha, Sr. Product Manager, Pentaho Product Line

 

In the world of enterprise IT, managing data in multiple clouds is now the new normal — whether it’s the result of a deliberate strategy or from shadow IT doing their own thing. Enterprises are not only moving data to the cloud at an unprecedented pace, but they are also embracing different cloud platforms from different vendors at the same time for good business and technical reasons. That means IT leaders need a plan to manage multiple clouds uniformly. But it’s not just about maintaining resource utilization views anymore. If left unchecked, multi-cloud sprawl can put your data assets at tremendous risk.

 

According to a study by Forrester Research, 65 percent of IT leaders believe “data integration becomes more complex in the public cloud”.  To give you some perspective, these cloud data integration challenges came in behind only security and compliance challenges.

 

With Pentaho 8.1, we are continuing to enhance our data integration and analytics platform to be more cloud-friendly so that enterprises can develop data pipelines on and with data in any of the leading cloud platforms without the complexity. Now, following our support for AWS and then Microsoft Azure, Pentaho 8.1 supports Google Cloud platform.

 

By supporting Google Cloud, Pentaho 8.1 is a significant step toward helping our customers with their multi-cloud strategies.  We now provide even more choice regarding which public cloud vendor to use for their data management.

 

Pentaho 8.1 also delivers new capabilities which directly and indirectly support multi-cloud data strategies.  With Pentaho, for example, you can: 

 

  • Visually manage data in multiple-cloud storage environments, now using Google Cloud storage (see Figure 1)
  • Load data in bulk to Google BigQuery (see Figure 3)
  • Visualize and analyze data in Google BigQuery
  • Elastically deploy Pentaho in the cloud to scale up and down based on workload
  • Use Spark in the Cloud (AWS EMR) for visual data processing
  • Load & download data files from Google Drive

 

Fig 1.png

 

Figure 1: Job spanning on-premise to multi-cloud

 

Each cloud platform offers their own services, but data integration platforms like Pentaho also need to support a set of common components, like those shown in Figure 2.  What also differentiates us from the data integration tools specific to the vendors themselves is our flexible deployment architecture.  This means you can use Pentaho to access and process data where it lives, whether the data is in the cloud or on premises, and whether it’s in AWS, Azure or Google Cloud platform – rather than needing to move data around – thereby reducing latency.

 

Fig 2.png

Figure 2: Job spanning on-premise to multi-cloud

 

Now Pentaho can also be used to move files from on-premise to one cloud, and then to another cloud vendor with any data format because of the seamless integration of different cloud storage technologies via VFS (see figure 4). Pentaho encapsulates security and other integration details and makes it easy to load data into the appropriate cloud data management or warehouse services with new and existing capabilities.

 

Fig 3.png

Figure 3: Data Loading to Google Cloud Storage

 

Fig 4.png

Figure 4: Data loading to Google BigQuery

 

After loading data in cloud data warehouses, data can be consumed in data pipelines running in Pentaho data integration and directly by data analysts using Pentaho’s Business Analytics.  With all these cloud data sources and our data management services, we can facilitate end-to-end ETL, analytics solutions and help solve even more problems.

 

With the emergence of multi-cloud IT deployments, data professionals need to work with data they understand and trust, and now more than ever need a platform to harmonize the data with transformation processes, across different cloud and on-premise environments. Data integration platforms like Pentaho have an enormous role to play for those enterprises and for our cloud future. Pentaho’s multi-cloud capabilities squarely address this enterprise need – especially with the new capabilities introduced in 8.1 release.  

Pentaho 8.1 Webinar | May 24 8:30am PST/ 4:30 BST

Get an update from the Pentaho product team

 

We will be discussing Pentaho 8.1 features to help modernize your analytic data pipeline:

  • Deploy in hybrid and multi-cloud environments
  • Connect, process and visualize streaming data
  • Get better platform performance and increase user productivity.

 

Register Today

 

Customer Spotlight

Learn how Hitachi customers are transforming their businesses

 

ResEvo (ResearchEvolution) Enters the Big Data World

  • Learn how ResEvo is leveraging the Pentaho platform to provide a differentiated analytics offering to their customers in the European market which include being in the forefront of the country-wide smart city initiatives in Slovenia.

CPFL Energia's Success Powers Positive Brand Exposure

  • Learn how CPFL initiated a digital transformation program and converted their operations to an intelligent power distribution network with the Hitachi smart grid universe.

 

Share your story! Email Caitlin.Croft@hitachivantara.com  to be featured in the next Customer Spotlight.

 

Pre-Recorded Webinars

Find recordings of recent webinars

 

Pentaho Product Training

Get the most out of Pentaho through instructor-led training

 

Best Practices

Ensure you're utilizing Pentaho

Last summer, one of my healthcare clients asked my team if there was a better way to project the costs from surgery. Patients who experience complications go through supplementary cares for recovery, increasing the overall cost of care. In healthcare, the set of services to treat a clinical condition from start to finish is defined as an episode of care.

 

Everyone from providers to patients desire the best outcomes in an episode of care. However, outcomes from major resections of vital organs or joint replacements can be unpredictable. Successful surgery depends on many factors such as the patient’s conditions before and during surgery.

 

Machine learning is a perfect candidate in cases like this providing insights on issues which have multiple factors. The ability to predict outcomes from surgery can improve patient experiences and also allow practices to better manage the costs of care.

 

The machine learning solution for this article was developed with Plugin Machine Intelligence (PMI) for PDI and used a publicly available surgery data from University of California—Irvine’s data archive. In the dataset, one of the variables flags whether or not a patient survived beyond one year after surgery. With this in hand, I instructed the machine learning algorithms to predict survivability. The resulting solution features a family of tree algorithms which visualize how the machine came to a certain prediction.

 

A map of decision tree algorithm is illustrated in Fig. 1. Navigating through the map is very easy. Each node (depicted as brown circles) represents a medical condition that the algorithm determined important when it classified patients. Each path ends with leaves (depicted as green rectangles) that represents the two classifications: patients who are predicted to survive or those that are not.

 

Fig. 1 Decision tree algorithm to classify survivability of patients
2map.JPG

 

One path in the tree is quite interesting (Fig. 3). The algorithm is able to predict accurately even when the medical conditions seem fairly benign compared to other paths.

 

Fig. 2 Decision tree interest areaFig 3. A path in the decision tree
2map_edited.jpg3decisiontree_edit.jpg

 

The algorithm predicted correctly that three patients did not survive within one year of their operations. This set of patients had diabetes and experienced weakness before surgeries which elevated their risks. Other serious conditions such as hemoptysis and dyspnea were not observed. Hypothetically, surgeons may have looked at the conditions of these patients and determined that the risks are low enough to proceed with the surgeries. They did not have an objective way to weigh how different factors contributed to the overall risks of each patient.

 

Fig. 4 Detailed procedure record and model confidence in classifications of outcomes
detail.JPG

 

By studying the data, the model determined that certain conditions are particularly good at inferring the outcomes. In machine learning terminology, this is called feature importance. Forced vital capacity (FVC) and TNM are the features that appear the most in the decision tree. Forced vital capacity is the maximum volume of air a person is able to exhale. This is one of the metrics used by doctors to diagnose patients and determine severities of respiratory illnesses. TNM codes measure the size of the original tumor observed in cancer patients. The two features appear most frequently because they are closely linked to the severity of the illness.

 

One strength of machine learning is its ability to learn or adjust weights of each condition as new data stream in. Here is another way the decision tree algorithm adapted with a different set of data.

 

Fig. 5 Adaptation of decision tree algorithm under different set of data
decisiontree2.JPG

 

The algorithm is able learn on its own as the underlying data changes. The hierarchy of decision tree changed with diagnosis appearing as the root node. Compared to before, a new feature FEV1 plays a major role in classifying patients. FEV1 is similar to FVC where patients exhale maximum amount of air in one second. The two metrics are used in conjunction for diagnosis. The algorithm is adapting in order to maintain its predictive capabilities.

 

Hitachi Vantara Labs recently unveiled Plugin Machine Intelligence (PMI) for PDI which vastly accelerates development of machine learning models. From a data scientist’s point of view, what makes this solution unique is that the complete stack was developed with very minimal coding.

 

Before the announcement of PMI, I have been developing the machine learning solution via the traditional methods, meaning writing many lines of code. Careful code management was required so that if issues arises I or my colleagues can address them. All of these nuances are taken care by PMI under the hood. The complexity of managing a machine learning models and the benefits of PMI is explained in Mark Hall and Ken Wood’s article “4-Steps to Machine Learning Model Management”.

 

From a business analyst’s point of view, PMI allows machine learning to be a natural addition to one’s analytical toolkit. Analysts often have deep insights in their domains. Once folks are familiar with the thought process of articulating questions that machine learning is designed to solve, analysts can produce powerful insights by employing machine intelligence models. This is because PDI + PMI are fundamentally visual tools: drag-and-drop steps to manage data and machine learning models.

 

Fig. 6 Plugin Machine Intelligence (PMI): Drag and drop development of machine learning models

pmi.JPG

 

Fig. 7 Neural Network Editor in PMI: Artificial neural network used in gradient boosted tree algorithm

neuralnetwork_public.jpg

 

The synergy produced by the technologies is explained in Hu Yoshida’s article, “Orchestrating Machine Learning Models and Improving Business Outcomes”. Yoshida notes, “tools can be used in a data pipeline built in Pentaho to help improve business outcomes and reduce risk by making it easier to update models in response to continual change. Improved transparency gives people inside organizations better insights and confidence in their algorithms.” I can attest to this from my experience working in the platform.

 

The PMI toolkit allows people to explore capabilities of machine learning and see their relevancies in solving specific business problems. With PMI, machine learning is no longer a mysterious black box. Machine intelligence is now available for everyone.

 

External Links

Interactive Report

Downloadable Complete Stack

Introduction

 

 

Pentaho Report Designer can be internationalized and localized. To accomplish that, translation files used by the report need to have a specific format and their names need to obey certain rules.

 

 

How to configure translation files?

 

 

By default, each report that you create contains an empty fall-back/default translation file: translations.properties, editable. You can access it by clicking on File Tab, then Resources option and finally selecting translations.properties and selecting Edit option.

 

The other translation files can be created or imported by accessing File Tab/Resources option. These files need to contain key-value pairs, separated by an equal symbol (=):

 

  • dashboardTitle = Internationalization dashboard

 

Additionally, translation files need to have the following format name:

 

 

    • translations_en.properties
    • translations_ja.properties

 

 

    • translations_en_US.properties

 

Note: Special characters such as Japanese characters or "Ç" for instance or even accents need to be Ascii encoded (Native2ascii Online), otherwise the characters are not displayed correctly in the report.

 

Resource-labels are the only elements that perform translations in PRD. You need to set its value property with a key, defined in the translations files. For instance:

 

  • value property = dashboardTitle

 

value property is under resource-label Atributes Panel.

 

How is the hierarchical structure in the translations.properties ?

 

 

Translations or messages can be created in multiple translation files. Whenever that happens, keep in mind the following hierarchy:

 

     + translations.properties

     ++ translations_en.properties

     ++++ translations_en_US.properties

     ++++ translations_en_UK.properties

 

In conclusion, translations_<language>_<COUNTRY>.properties overrides  translations_<language>.properties file, which overrides translations.properties file.

 

 

Pentaho Report Designer Example

 

 

Imagine that you want to create a PRD for English and Japanese languages. You need to configure/create

translations.properties, translations_en.properties and translations_ja.properties with the keys and values that you want to translate and add resource-labels with the keys that you just defined as values properties.  Then, to test the translation functionality, you need to go to File Tab, click on Configuration option and click and set ~.environment.designtime.Locale with the language code that you want to see: ja or en to see a Japanese or English translation, respectively.

 

In case you want to publish this PRD in the repository, PUC, and test the translation functionality,  you need to click on View Tab, choose Languages option and then select English or Japanese language. Afterwards, open the PRD report or reload it if it was already opened and see that the report shows the translation for the language chosen in PUC.

 

Note: In the event of having to translate two reports that share some translation, you only want to maintain one file for each language instead of several. So, you should import the translation file from your file system in File Tab/Resources option.  You can edit the imported translation file by clicking on edit option from File Tab/Resources option, or by editing the translation file in your file system. In these situations there are some considerations your should take into consideration. Whenever you edit the translation file inside the File Tab/Resources option you will need to export it. Otherwise, the translation file in your file system won't have the changes that you have performed. The same way, if you change the translation file in your file system, you will need to remove the file that you have imported to File Tab/Resources option and imported it again.

This example was tested in Pentaho 8.0.


Introduction

 

 

Dashboards can be internationalized and localized. Several files need to be created in order to perform the dashboard translation, all with .properties extension and their location needs to be the same as the dashboard.

 

 

How to configure?

 

 

As mentioned before, some files need to be created and configured properly:

 

    • en_US=English
    • ja=\u65E5\u672C\u8A9E

     en_US and ja correspond to the language and country codes and English and \u65E5\u672C\u8A9E are the languages names.

 

          Note: Special characters such as Japanese characters or "Ç" for instance or even accents need to be Ascii encoded (Native2ascii Online), otherwise the characters are not displayed correctly in the dashboard.

 

  • The translation file name should have this format
    • messages_<language>.properties - language corresponds to the language code, specified in the  messages_supported_languages.properties. File name examples:

 

      • messages_en.properties
      • messages_ja.properties

    

    • messages_<language>_<COUNTRY>.properties - COUNTRY corresponds to the country code, uppercase code ISO ALPHA-2. Whenever, you add this type of file, do not forget to add <language>_<COUNTRY> to messages_supported_languages.properties file. Otherwise, the  configured translation file will never be used.

 

      • messages_en_US.properties

 

   The latter files should contain their translation information organised as key value pairs, separated by an equal symbol (=):

    • dashboardTitle = Internationalization dashboard

    

  • messages.properties - is the translation fallback file, the default translation, in case of something not being configured well in translation functionality. The file should have the same structure as described for messages_<language>_<COUNTRY>.properties and messages_<language>.properties.

 

 

How is messages.properties structure hierarchical?

 

 

Translations or messages can be created in multiple translation files. Whenever that happens, keep in mind the following hierarchy:

 

     + messages.properties

     ++ messages_en.properties

     ++++ messages_en_US.properties

     ++++ messages_en_UK.properties

 

In conclusion, messages_<language>_<COUNTRY>.properties overrides messages_<language>.properties file, which overrides messages.properties file.

 

 

CDE dashboard Example

 

 

Imagine that you want to build a CDE dashboard with only a title for English and Japanese languages. You construct the CDE layout and then you add the Text Component element. In the Expression of the latter, place the following code that calls a CDF API- prop function from i18nSupport, that uses the translation file according to the PUC language:

     function f(){

         return this.dashboard.i18nSupport.prop('dashboardTitle');

     }

 

Additionally, message.properties and message_en.properties should have:

 

  • dashboardTitle = Internationalization dashboard

 

and  message_ja.properties should have:

 

  • dashboardTitle = \u56fd\u969b\u5316\u30c0\u30c3\u30b7\u30e5\u30dc\u30fc\u30c9

 

When you want to test the translation configuration, you need to go to View Tab, choose Languages option and then select English or Japanese language. After that, you need to reload the dashboard, if it was already opened, or open the dashboard and you will see that the translation showed respects the PUC language chosen.

 

Note: This example was tested in Pentaho 8.0.

 

 

Using a language that it is not available in PUC

 

 

To add new languages, you need to instal them from the marketplace. Please follow these instructions: pentahoLanguagePacks/README.md at master · webdetails/pentahoLanguagePacks · GitHub

After your language pack is installed, please add the language code to messages_supported_languages.properties file and create the translations file using the format name previously specified.

Hi everyone,

 

with 100 participants from Austria, Germany and Switzerland, Pentaho User Meeting has been a great success. I´m happy to share with you the live blog covering all presentations and speakers:

 

  • Migrating from Business Objects to Pentaho (CERN, Gabriele Thiede)
  • Pentaho 8 (Pedro Alves)
  • Best Practices for Data Integration Architectures (Matt Casters)
  • Operating Pentaho at Scale (Jens Bleuel)
  • Running Pentaho in Kubernetes (Nis Christian Carstensen, Netfonds)
  • Data handling with Pentaho (Marco Menzel, Hansainvest)
  • IoT and Predictive Analytics (Jonathan Doering, Hitachi Vantara)
  • Adding Pentaho Dashboards to Angular 5 applications (Francesco Corti, Alfresco)
  • Predictive Analytics with PDI and R (Dr. David James, it-novum)
  • Integrating and analyzing SAP data with SAP/Pentaho Connector (Stefan Müller, it-novum)
  • Analyzing IT service management data with openLighthouse (Dirk Rönsch, it-novum)

 

See the full agenda at

Pentaho User Meeting 2018 in Frankfurt - it-novum

 

Thanks to all who have contributed to this event!

There are currently 3 Installation Guides to accompany the Plug-In Machine Intelligence (PMI) plug-in and one Developers Guide. Also, the demonstration transformations and sample datasets are available. These sample transformations and sample datasets are for demonstration and educational purposes. They are downloadable at the following,

 

Download Link and Document Name
Description
PMI_Installation_Linux.pdfInstallation guide for the Linux OS platform.
PMI_Installation_Windows.pdfInstallation guide for the Windows OS platform.
PMI_Installation_Mac_OSX.pdfInstallation guide for Mac OS X platform.
PMI_Developer_Docs.pdfA developer's guide to extending and contributing to the PMI framework.
PMI_MLChampionChallengeSamples.zipThis zip file contains all of the sample transformations, sample folder layouts and datasets for running the Machine Learning demonstrations and the Machine Learning Model Management samples. This is for demonstration and educational purposes.
PMI_AddingANewScheme.pdfThis documents describes the development process of exposing the Multi-Layer Perceptron (MLP) regressor and classifier in the Weka and scikit-learn engines.

Introducing Plug-in Machine Intelligence

by Mark Hall and Ken Wood

 

 

HVLabsLogo.pngToday, the need to swiftly operationalize machine learning based solutions to meet the challenges of businesses is more pressing than ever. The ability to create, deploy and scale a company’s business logic to quickly take advantage of opportunities or react to changes is exceeding the capabilities of people and legacy thinking. Better and more machine learning is vital going forward but, more importantly, easier machine learning is essential. Leveraging an organization’s existing staff levels, business domain knowledge, and skillsets by lowering the entry into the realm of data science can dramatically expand business opportunities.

 

Everytime I am in PMI, I am seeing more and more of its value!!! Great stuff!!!”

Carl Speshock - Hitachi Vantara Product Manager, Hitachi Vantara Analytics Group

 

The world of Machine Learning is empowering an ever-increasing breadth of applications and services from IoT to Healthcare to Manufacturing to Energy to Telecom, and everything in between. Yet the skills gap between business domain knowledge and the analytic tools used to solve these challenges needs to be bridged. People are doing their part through education, training and experimentation in order to become data scientists, but that’s only half of the equation. Making the analytic tools easier to use can help bridge this gap quickly. Throw in the ability to access and blend different data sources, cleanse, format and engineer features into these datasets, and you have a unique and powerful tool. In fact, the combination of PDI and PMI is an evolution of the PDI tool suite for deeper analytics and data integration capabilities

 

"While exploring solutions with a major healthcare provider that was using predictive
analytics to reduce the costs and negative patient care incurred from
complications
from surgery
, we were given internal access to PMI. Working with python
and using
the Scikit-Learn library required
2 weeks of coding and prototyping to perform
just the
machine
learning model selection and training. With PDI and PMI, I was able to
prep
the data, engineer in the features and train the models
in about 3 hours. And, I could
include other machine learning engines from R and Weka and evaluate the results. The
combination of PDI and PMI makes machine learning solutions easier to use and maintain."

Dave Huh - Data Scientist - Hitachi Vantara Analytics Services

 

Hitachi Vantara Labs is excited to introduce a new PDI capability, Plug-in Machine Intelligence (PMI) to the PDI Marketplace. PMI is a series of steps for Pentaho Data Integration (PDI) that provides direct access to various supervised machine learning algorithms as full PDI steps that can be designed directly into your PDI data flow transformations. Users can download the PMI plugin from the Hitachi Vantara Marketplace or directly from the Marketplace feature in PDI (automatic download and install). Installation Guides for your platform, the Developer's Document, and the sample transformation, and datasets are available here. The motivation for PMI is:

 

  1. To make machine learning easier to use by combining it with our data integration tool as a suite of easy to consume steps, and ensuring these steps guide the developer through its usage. These supervised machine learning steps work “out-of-the-box” by applying a number of “under-the-cover” pre-processing operations and algorithm specific "last-mile data prep" to the incoming dataset. Default settings work well for many applications, and advanced settings are still available for the power user and data scientist.
  2. To combine machine learning and data integration together in one tool/platform. This powerful coupling between machine learning and data integration allows the PMI steps to receive row data as seamlessly as any other step in PDI. AND! No more jumping between multiple tools with inconsistent data passing methods or, complex and tedious performance evaluation manipulation.
  3. To be extensible. PMI provides access to 12 supervised Classifiers and Regressors “out-of-the-box”. The majority of these 12 algorithms are available in each of the four underlying execution engines that PMI currently implements: WEKA, python scikit-learn, R MLR and Spark MLlib. New algorithms and execution engines can be easily added to the PMI framework with its dynamic PDI step generation feature.

 

PMI also incorporates revamped versions of the Weka steps that have been originally part of the Pentaho Data Science pack. Essentially, this could be looked at as version 2 of the Data Science Pack. These include:PMIVantaraLogo2.png

 

  • PMI Forecasting for deploying time-series models learned in WEKA’s time series forecasting environment.
  • PMI Scoring for deploying trained supervised and unsupervised ML models. This includes new features to support evaluation/monitoring of existing supervised models on fresh data (when class labels are available).
  • PMI Flow Executor for executing arbitrary WEKA Knowledge Flow processes. This revamped step supports WEKA’s new Knowledge Flow execution engine and UI.

 

PMI tightly integrates, into the PDI “Data Mining” category, four popular machine learning “engines” via their machine learning libraries. These four engines are, Weka, Python, R and Spark. This first phase of PMI incorporates the supervised machine learning algorithms from these four engines from their associates machine learning libraries - Weka, Scikit-Learn, MLR and MLlib, respectively. Not all of the engines support all of the same algorithms evenly. Essentially, there are 12 new PMI algorithms added to the Data Mining category that executes across the four different engines;

 

PMIList.png

  1. Decision Tree Classifier – Weka, Python, Spark & R
  2. Decision Tree Regressor – Weka, Python, Spark & R
  3. Gradient Boosted Trees – Weka, Python, Spark & R
  4. Linear Regression – Weka, Python, Spark & R
  5. Logistic Regression – Weka, Python, Spark & R
  6. Naive Bayes – Weka, Python, Spark & R
  7. Naive Bayes Multinomial – Weka, Python & Spark
  8. Random Forest Classifier – Weka, Python, Spark & R
  9. Random Forest Regressor – Weka, Python & Spark
  10. 1Support Vector Classifier – Weka, Python, Spark & R
  11. Support Vector Regressor – Weka, Python, & R
  12. Naive Bayes Incremental – Weka

 

As such, eventually the existing “Weka Scoring” step will be deprecated and replaced with the new “PMI Scoring” step. This step can consume (and evaluate for model management monitoring processes) any model produced by PMI, regardless of which underlying engine is employed.

 

I know what you’re thinking, “why implement machine learning across four engines?”. Good question. Believe it or not, data scientists are picky and set in their ways, and… not all engines and algorithms perform (think accuracy and speed) the same or yield the same accuracies for any given dataset. Many analysts, data scientists, data engineers and others that look to these tools to solve their challenges, tend to use their favorite tool/engine. With PMI, you can compare up to four different engines and up to 12 different algorithms against each other to determine the best fit for your requirement.

 

 

What Happens When Data Patterns Change?

 

An important benefit to PMI is the evaluation metrics used to measure accuracy is uniform and unified. Since all steps are now built into the same PMI framework - unified, the resulting metrics are all calculated uniformly and can be used MLMMDiagram3.pngto easily compare performance even across the different engine's algorithms. This unique characteristic has resulted in a whole new use case in the form of model management. Concepts around model management with the PMI framework has enabled the ability to Auto-Retrain models, Auto-Re-evaluate, Dynamic-Deploy models and so on. A concept that we have recently proven with demonstration is the Champion / Challenger model management strategy. This Champion / Challenger strategy easily allows currently active model(s) to be re-evaluated and compared with other candidate models' performance and "hot-swap' deploy the new Champion model. A more detailed discussion on Machine Learning Model Management can be found with this accompanying blog called "4-Steps to Machine Learning Model Management".

 

 

BA-Champ.png

HotSwap-Models.png

 

The Fail Fast Approach

 

Thomas Edison is quoted as saying "I have not failed. I've just found 10,000 ways that won't work.". And back in the days leading up to the invention of the light bulb, this 10,000 ways that won't work took years to iterate through. What if you could eliminate candidates in days or hours? PMI allows a “Fail Fast” approach to achieving results. With the ease of using PMI on datasets, many combinations of algorithms and configurations can be tried and testing very fast, weeding out the approaches that won’t work and narrowing down to promising candidates quickly. The days of churning on code until it finally works, then finding out the results aren’t good enough and a new approach is needed, are coming to an end.

 

Over the next few month, Hitachi Vantara Labs will continue to provide blogs and videos to demonstrate how to use PMI, how to extend the PMI framework and how to add additional algorithms to PMI.

 

  PMIMOREktr.png

 

It is important to point out that this initiative is not formally supported by Hitachi Vantara, and there are no current plans on the Enterprise Edition roadmap to support PMI at this time.  It is recommended that this experimental feature be used for testing only and not used in production environments. PMI is supported by Hitachi Vantara Labs and the community. Hitachi Vantara Labs was created to formally test out new ideas, explore emerging technologies and as much as possible, share our prototypes with the community and users through the Hitachi Vantara Marketplace. We like to refer to this as "providing early access to advanced capabilities". Our hope is that the community and users of these advanced capabilities will help us improve and recommend additional use cases. Hitachi Vantara has forward thinking customers and users, so we hope you will download, install and test this plugin. We would appreciate any and all of your comments, ideas and opinions.