Skip navigation
1 2 3 Previous Next

Pentaho

82 posts

by John Magee, Vice President Portfolio Marketing

 

The Pentaho engine is a key component of Hitachi’s Lumada data and analytics platform for building innovative data-driven solutions for general business and IoT use cases.  Recently we delivered Pentaho 8.3, an important update to our flagship data integration and analytics software and a key part of our DataOps strategy (keep reading to learn why).

You can read the press release here: https://www.hitachivantara.com/en-us/news-resources/press-releases/2019/gl190710.html

The new release of Pentaho offers enhancements in three key areas:

  • Data pipelining– Improved drag and drop capabilities for accessing and blending data that’s difficult to access such as SAP and AWS Kinesis. For example, SAP data is often hard to access from outside of SAP. Now, with our connector to SAP, we enable drag and drop onboarding, blending, offloading and writing data to and from SAP ERP and Business Warehouse. 
  •  Data visibility for governance- Enhanced capabilities for metadata management with Hitachi Content Platform, integration with IBM Information Governance Catalog (ICG) and streaming data lineage tracking. The goal is to make it easier to understand what data you have and what its retention, compliance, and other governance requirements are, so you can analyze, share, and manage it appropriately.
  • Expanded multicloud support– New hybrid cloud data management capabilities for leveraging the public cloud, including an AWS Redshift bulk loader and Snowflake interoperability.

 

These new capabilities are important enhancements for our Pentaho users, but they also reflect and align with our broader portfolio strategy around DataOps. Organizations everywhere are looking to transform digitally and get more value out of their data. But getting the right information to the right place at the right time continues to be a challenge.  

Many of you have probably experienced the following DataOps challenges: analytic velocity is slowed by data engineering challenges that make it time-consuming to discover, blend, and deliver the data required. Data governance and compliance requirements place new demands on what data can be shared and how it should be managed. Moreover, distributed edge-to-cloud infrastructures mean that data is more distributed than ever before, which introduces a new set of operational data management challenges.

This lack of data agility has existed for years despite—or in some cases, because of—all the new tools and technologies. And for most organizations, it represents the biggest obstacle to achieving the promise of analytics, machine learning, and AI to transform their operations and drive innovation. Clearly, something needs to change.

DataOps has emerged as a collaborative data management discipline focused on improving the communication, integration and automation of data flows between data managers and consumers across an organization. At Hitachi Vantara, we are increasingly seeing customers who are embracing this more modern and holistic approach to managing data that spans applications, data centers, clouds, branch offices, the IoT edge, and other places where enterprise data resides.

Our strategy is laser-focused on providing the data management infrastructure, metadata-driven data management tools, and policy-based automation that organizations need to improve data agility through DataOps. In addition to Pentaho, we’re delivering new capabilities across our portfolio to make DataOps a reality for our customers, and you can expect to see more announcements from Hitachi Vantara in the coming months that reflect our strategic focus (especially at NEXT 2019).

DataOps is still a relatively new approach for many organizations, and first steps often start with improving data pipelines for the purposes of analytics and ML, whether that be real-time streaming data for embedded analytics in production apps or chatbots.  The end game could also be more complex data science projects involving building data lakes and managing related infrastructure and dev tools.

Automating and accelerating the discovery, integration, and delivery of data is certainly key to shortening the time it takes to get from raw data to actionable insights. And the often-cited analogy of DataOps as the data equivalent of Dev Ops—in the way it seeks to make formerly offline batch development processes more collaborative and automated—is a good one. But for the enterprise customers we work with, addressing the analytic data pipeline is only one part of the opportunity they see around DataOps. The others include data governance and operational agility.

Given the data compliance and regulatory requirements most enterprises operate under today, any analytic or data science projects that involve accessing, sharing, and analyzing data need to also include appropriate data access and governance controls. And the complexity of the edge-to-core-to-cloud infrastructures that most of our customers are building out today demand new approaches to managing data to optimize data retention and other policies in new ways. 

We believe the right DataOps strategy needs to address all three of these key areas – analytic velocity, governance, and edge-to-cloud operational agility. The good news is that many of the core technologies required to make DataOps a reality – discovery, metadata management, policy-based governance and retention management, automated data migrations and data pipelining, and so on can all be addressed with the right data platform for successful Data Operations.

Takeaways: DataOps is all about making companies more innovative and agile by getting the right data, to the right place, at the right time. Hitachi is laser-focused on providing a broad portfolio of tools and related services to make DataOps a reality. The new Pentaho 8.3 release provides key capabilities for customers looking to begin their DataOps journey.

Visit https://www.hitachivantara.com/go/pentaho.html to learn more.

 

 

 

 

 

So, in all of the cases where I’ve blogged or talked about running the Plug-in Machine Intelligence (PMI) plugin for Pentaho Data Integration (PDI), it has been on a workstation like environment, basically a high-end laptop. This is more than adequate for building machine learning models and many deep learning models with interesting datasets. However, not all datasets will fit in the constrained environment of a laptop, and I don’t mean just the storage size of the datasets. Sometimes, you just need a large machine and a powerful server class GPU to process complex datasets.

 

In this case, we are processing a deep learning dataset that consists of 101 classes of different foods. The entire dataset is only 5.2GB in storage capacity size, which on the surface isn’t a large amount of data, even if you break the colored images into its RGB channels (3 images each) for preprocessing, that’s only about 16GB. No, the “complexity size” comes from the nature of the dataset. The Food-101 dataset contains 1,000 images per food class for a total of 101,000 images (files). Now you can also multiple this by the RGB channels and the preprocessing and processing involved in training and building a deep learning model and you start to see that maybe your laptop or workstation will be overwhelmed. Even if you could force some file caching and other mechanisms to fit the dataset within the confines of your environment, the processing time to build your deep learning models might take a while.

 

Here we’ll describe what happens when we have access to a Hitachi Vantara DS225 server with a lot of memory and a lot of CPUs, and an NVidia Tesla V100 GPUs. We’re also going to use PDI and PMI to do some data science work building deep learning models against this 101 class dataset. You could easily use your favorite data science tools and code your own working models, it will perform equally as well, but this is a PDI and PMI with deep learning experiment, so we’ll use PDI and PMI for this.

 

This project basically compares CPUs against a GPU and the actual performance (in this case speed performance) using only PDI and PMI. Very little attempt was made to tune the model for accuracy or speed. AlexNet was selected from the PMI/DL4j zoomodel to run this experiment. The only modification made was to increase the memory size for PDI/spoon to 128GB of heap memory in order to fit the dataset and processing. This was done by first running the below PDI transformation to train and test a deep learning model with PMI and deep learning for java (DL4j) without configuring the NVidia CUDA API and drivers, thus DL4j isn’t able to “see” the NVidia Telsa V100 GPU and executes the preprocessing, training and testing of the deep learning model on the CPUs and memory. In this case, there are 2 Intel 6154 Gold 18 core 3GHz CPUs for a total of 36 cores and 512GB of DDR4 RAM.

 

Once that model was completed, then CUDA was installed and configured for PMI and DL4j to use. In this case the NVidia Tesla V100 Core GPU is configured with 16GB of HBM2 memory and 5120 CUDA cores. The exact same transformation was executed again.

 

This is the transformation used for training the deep learning models.

 

 

The PMI algorithm configuration settings are very close to the defaults.

 

 

As you can see in this chart, the performance increase in time to train in minutes is dramatic. Up to 18 times faster to complete a deep learning model on a GPU compared to training a deep learning model on top of the line CPUs.

 

 

A Reference Architecture White Paper was written detailing this experiment and how all the pieces were put together and installed. This White Paper will be released in the fall timeframe to coincide with the Hitachi Vantara NEXT 2019 conference in Las Vegas.

Hidden Results from Machine Learning Models – Blog 2 of 2

 

As mentioned in the first blog of this 2 part series, there are hidden results that needs to be interpreted and extrapolated from the initial or assumed prediction of your results. Just like the previous blog shows that a binary class can have in 3 results, YES, NO and a state that should be interpreted as UNKNOWN, regardless of the Class max prob answer.

 

This is an area of research called "learning to abstain from making a prediction" which is closely related to "active learning machine learning" where, in a semi-supervised scenario, a classifier can identify which cases/examples it would like a human to label in order to improve its model.

 

This situation compounds itself in a multi-class problem, more than 2 classes. Let’s now examine one of the “Hey Ray!” demonstrations again (sorry, I love these demonstration applications). First, there are two deep learning models in the analytic pipeline of this @Pentaho Data Integration application. The first deep learning model is a binary class that detects an injury or anomaly in an x-ray image, and as we learned previously, there's more than two answers. The second deep learning model, identifies the body part in the x-ray in a 5 part multi-class classifier. The parts that are being identified come from the arm and include, the hand, forearm, elbow, humerus and shoulder.

 

In the demonstration, I always try to zoom in on a specific body part for injury detection and arm part identification as shown in the first image chart below. In this case, the dominant feature is the elbow and the model will produce a Class max prob and Class:elbow predicted prob result of "elbow" with a probability of 0.894 or 89% for this specific perspective of this image. Different zoom levels or angled views could result in different probabilities values, but they would all point to this image as being an "elbow" and maybe some insignificant pieces of other parts. The chart below shows what the model “sees” in this x-ray image. 

 

 

 

 

This this is important to the “Hey Ray!” demonstration is because we are only looking for one of five body parts exclusively. The PDI transformation prepares the results using filter steps to determine the confidence of the results. At 89%, this would be considered a pretty high confidence level and the test-to-speech results would be reflected in the high probability. If the probability were to be lower, like 60% (which is still significant compared to the other probabilities), then the dialog of the speech would reflect the questionable result. We take the Class max prob, regardless of the value to predict our body part identification and use filter steps in PDI to determine the confidence in the dialog.

 

What we haven’t implemented as of yet, interpreting what happens when there are multiple significant predictions. What happens when a Class max prob prediction is below 50%? All class probabilities must sum to 1 or 100%. In a resulting probability of less than 50%, there must be other classes that are significantly greater than zero.

 

Let use the exact same x-ray image and zoom out to show more of the image.The full x-ray image shows significantly more of the rest of the arm beyond just the elbow. We can see much more of the humerus and the forearm in addition to the elbow. So what would our deep learning body part identification model predict with this image? Checkout the chart image below.

 

 

 

 

As you can see in this second chart image, the humerus probability is now the Class max prob, but is less than 50% at 48%. More importantly, the Class:forearm predicted and Class:elbow predicted are at significantly higher probabilities of 26% and 25% respectively. The sum of the probabilities of these three body parts is 0.992. We could interpret this as our body part identification model is identifying the middle arm area – forearm, elbow and humerus.

 

 

DescriptionClass:part prediction
Class:humerus predicted0.479
Class:elbow predicted0.259
Class:forearm predicted0.254
Probability sum0.992

 

 

 When we created this demonstration, we were not aware of this interesting artifact. That is, the the deep learning model could identify mulitple classes simultaneously, but it makes sense after exploring the results more closely for different types of image capture perspectives. We are looking to add a multi-class interpretation flow to more accurately identify the x-ray images where possibilities could include, but are not limited to the body part combinations of,

 

  • Shoulder
  • Shoulder and Humerus
  • Shoulder and Humerus and Elbow
  • Shoulder and Humerus and Elbow and Forearm
  • Shoulder and Humerus and Elbow and Forearm and Hand
  • Humerus
  • Humerus and Elbow
  • Humerus and Elbow and Forearm
  • Humerus and Elbow and Forearm and Hand
  • Elbow
  • Elbow and Forearm
  • Elbow and Forearm and Hand
  • Forearm
  • Forearm and Hand
  • Hand

 

There are more combinations possible and should be accounted for. This includes x-ray images of a hand and shoulder, or just the shoulder and elbow, and so forth. While it may seem impossible to capture these types of x-ray images in a normal situation, the fact most of the reasons for having an x-ray takens is for injury purposes, so you never know what you might analyze. Now consider the scenario where the entire body is being identified. Yikes! that's a lot of body parts and combinations. We're going to need a bigger team.

We are excited to highlight our active customer experts and their learnings in these Customer Spotlight interviews. In this Customer Spotlight interview, we are excited to feature Jude Vanniasinghe, Development Manager at Bell Canada. 

 

 


DESCRIBE YOUR ROLE AT BELL CANADA

I started my career with Bell Canada through a recent grad program as a Tools and Productivity Manager. In this role, I primarily focused on automation.  A year later, I took on another role to manage block of telephone numbers within Ontario and Quebec. With the team of seven, I closely worked with Canadian Numbering Association and CRTC to successfully deliver two new area codes in Ontario (London and Ottawa).

 

Later I moved on to become a Business Intelligence (BI) Manager, as part of the transport-network provisioning team to manage BI reporting developers using web applications and various other BI platforms. In 2008, I progressed onto the Business Intelligence Specialist position, and this is where I got introduced to Pentaho.

 

In 2011, I got promoted to a Senior Business Intelligence Specialist.  Some of the most interesting responsibilities in this current role are managing the development team, maintaining the architecture of the multi-tier system, training resources on Hitachi Vantara produces, etc.

 


BELL CANADA HAS BEEN A CUSTOMER SINCE 2008 - HOW DOES BELL LEVERAGE HITACHI VANTARA'S SOLUTIONS TO MEET YOUR DAY-TO-DAY BUSINESS NEEDS?

 

We mainly using Hitachi Vantara products for Data Integration and Reporting for Professional Services Team within Bell Business Market (BBM) under Bell Canada. Using Hitachi Vantara's Pentaho Data Integration capabilities, we solve most complex data manipulations within a short period of time. This helps us to produce CUBEs, Analyzer-reports, and dashboards way beyond stakeholders' expectations. Quick turnaround time is one the main reasons why Pentaho is the winning tool for our day-to-day activities.

 

 

HOW HAS PENTAHO CHANGED YOUR CAREER?

 

Prior to 2008, I had used Data Transform Services (DTS) from MSSQL-Sever and Web-Focus (Information Builder) for data manipulations. Since 2008, the Pentaho has changed the way the data integration techniques and data-visualization work. Pentaho has made it simpler and easier than ever before. The prebuilt ETL (extract, transform, load) objects and GUI based ETL builder have helped transforming many complex data sets into meaningful data for reporting. Also, the modest nature of the Pentaho products architecture keeps in the Business Intelligence field and supports me growing in this area.

 

 

ANY FUTURE GOALS OR PROJECTS THAT BELL HAS THAT INCLUDES PENTAHO?

 

Now we are expanding BI reporting to other services such as Managed Service, Connectivity, etc. within BBM. Expecting to double in size by 2021.

 

 

AS A LONG TIME CUSTOMER, CAN YOU SHARE SOME BEST PRACTICES? 

 

Basically, have a plan from the beginning. First, identify the top business needs and develop a set of business requirements and goals. Identify necessary data sources. Identify the KPIs and metrics and choose the right visualization for each metric. Once these are in practice then final product will be developed smoothly.

 

Click here to discover more Pentaho Best Practices.

 

 

YOU HOSTED A PENTAHO USER GROUP IN TORONTO - TELL US ABOUT THAT EXPERIENCE. 

 

It was really a great experience. We were surprised with the numbers of users from vast variety of companies participated.  It shows that the word is out there and more companies are starting to leverage the usefulness of Hitachi Vantara products as we do. We are looking forward to having another one this summer.

 

Click here to learn more about the Pentaho Toronto User Group. 

 

Click here If you're interested in other Pentaho User Groups. 

 

Do you have a story you want to share? Please reach out to Caitlin Croft to learn how you can be featured. Email Caitlin at: Caitlin.Croft@HitachiVantara.com.

Our Customer Success & Support teams are always working on providing our customers with tips and tricks that will help our customers with the Pentaho platform.

 

DevOps is a set of practices centered around communication, collaboration and integration between software development and IT operations teams and automating the processes between them. The main objective is to provide guidance on creating an automated environment where iteratively building, testing, and releasing a software solution can be faster and more reliable.

 

Our continuous integration (CI) and DevOps series was started to fill a need for increasingly complex information as you learn more. The main objective with this series is to provide guidance on creating this automated environment in which building, testing, and releasing a Pentaho Data Integration (PDI) solution can be faster, more reliable, and result in a high-quality solution that meets customer expectations at a functional and operational level. The Introduction to PDI and DevOps webinar will serve as the “prequel” to more complex concepts.

 

Our intended audience is Pentaho administrators and developers, as well as IT professionals who help plan software development.

 

Join us on May 7 2019 - 9am Pacific for our first webinar of our PDI + DevOps webinar series. Click here to register now! If you miss the webinar, you can always watch it on demand afterwards.

Guest Contributor: Matt Aslett, Research Vice President, Data, Analytics & Artificial Intelligence, 451 Research

 

451 research logo sm.JPGAslett_Matt head shot.jpg

Most companies are increasing their investment in data-processing, analytics and machine-learning software with a desire to become more data-driven. Data – and the rapid processing of data – is a key driver in enabling companies to grasp the opportunities presented by digital transformation to deliver improved operational efficiency and competitive advantage.

 

We have moved from the transactional era, through the interaction era to the engagement era, in which enterprises have recognized that they must store, process and analyze as much, if not all, data that is available to them in order to survive and thrive in the digital economy. This includes data produced by the myriad of sensors, embedded computers, industrial controllers and connected devices such as vehicles, wearable computing devices, robots and drones that make up the emerging Internet of Things (IoT).

 

Data from 451 Research’s Voice of the Enterprise: Internet of Things indicates that analytics is seen as the most critical technology for success in IoT projects, but also that the largest impediment to IoT projects is technology deployment and integration challenges, followed by security concerns, and a lack of a compelling business case or uncertain ROI.

 

For IoT projects the primary use-cases are optimizing operations (for preventative maintenance and reduced downtime, for example) followed by reduce risk (such as security and compliance); the development of new, or enhancement of existing, products or services; and enhanced customer targeting for increased sales.

 

In all of these cases, while data from IoT devices is extremely valuable, and has been stored and processed alone for many years, the greater value comes from blending that IoT data with enterprise data sources. Combining IoT data with data from existing enterprise applications makes the link with customer behavior data, employee behavior data, marketing/advertising data and sales data, for example, to provide a more complete picture and ensure the IoT data is seen in the context of the business goal.

 

Data from 451 Research’s Voice of the Enterprise Data and Analytics indicates that the complexity involved in integrating and managing data actually grows, the more data-driven a company is. The results show that while the most data-driven companies enjoy benefits such as increased focus on competitive advantage, they are also faced with more data integration and preparation overheads.

 

Data from each of these sources is likely to be delivered in different formats, meaning that it needs to be blended, transformed and cleansed before it can be used to generate business insights. Data will also be delivered via different mechanisms. Although most will likely be delivered from traditional enterprise applications in batch form, increasingly it will be generated at the edge by Internet of Things devices and delivered via stream processing, for IoT analytics.

 

Additionally, in attempting to become more data-driven, many organizations are investing in machine-learning tools and developing ML-driven applications. The success of these projects depends on the ability of the organization to operationalize experimental data-science projects through training and testing to model deployment and management.

 

Much attention is paid to the outputs of these data-processing pipelines – including visualizations and machine-learning models used to drive business decision-making. However, intelligent and automated data-processing pipelines that are able to rapidly integrate data from multiple sources, including enterprise applications and IoT devices, should not be overlooked as a foundation for delivering successful IoT projects that deliver improved operational efficiency and new revenue streams.

 

To learn more about how to combine IoT and business data to deliver business value, click on one of the links below:

 

Visit the Hitachi Vantara website.

 

Download the Business Impact Brief from 451 Research entitled "Agile Data Management as the Basis for the Data-Driven Enterprise.”

“New Hey Ray!” or “Hey Shin Ray!” or “Hey Ray!”

HeyShinRayVideoDemo3 - YouTube

(click on this link to watch a video of "Hey Shin Ray!" in action on youtube)

 

 

by Ken Wood

 

HVLabs_RdI_Logo.pngThe word 'shin' means 'new' in Japanese

 

Back in November, Mark Hall and I shared with you a uniquely interactive demonstration that used the Plugin Machine Intelligence’s (PMI) new feature, Deep Learning with DL4j, to analyze an x-ray film through voice commands and speech responses. This original demonstration was featured at Hitachi NEXT 2018 in September in 2018 and was used to demonstrate how Pentaho could use deep learning in an application.

 

The apparatus used to build the original “Hey Ray!” had multiple practical functions. First, it was easy to build since we were up against a deadline for getting “Hey Ray!” up and running in time for the conference. Second, “Hey Ray!” is an example application on how to use PMI, deep learning, Pentaho, speech recognition, IoT, text-to-speech and a Raspberry Pi in a uniquely integrated way to solve a particular problem and demonstrate the power of deep learning and Pentaho. And finally, the steampunk themed design, or as I like to say “… a Jules Verne inspired artificial intelligence”, allowed us to let everyone interested in this application understand that this is not a product or solution to be purchased but an example of the simplicity of building advance solutions with Pentaho and the new Hitachi Vantara Labs plugin, PMI.

HeyRayEvolution.png

As you can imagine, there were a couple of snafus during the conference. Speech recognition in a crowded, noisy environment is challenging at best. So, to overcome the loud ambient noise, I quickly built a remote control on my iPhone to mimic the voice command set that “Hey Ray!” used. Since we used the IoT protocol MQTT, it was easy to slide in commands into the demonstration apparatus through a remote command queue. Another hiccup was the physical x-ray film. I only have 16 usable (purchased off eBay) x-ray film pictures to choose from, so it started to get a little monotonous rotating through these physical x-rays. While this isn’t a major problem and it flowed nicely with the whole interactive steampunk theme of the demonstration, x-ray film is a little too antiquated and having additional access to digital x-ray images would improve the overall experience. Lastly, the initial size and "non-slickness" of the physical apparatus and the number of components made it difficult to travel with. Since I live in San Diego, the location of the 2018 conference, it was easy for me to transport and setup this demonstration.

 

So, with that, I started on two major upgrades; the first was to reduce to size of the demo to something that could fit in one pelican shipping case (or smaller), and the second was to change the format of the “Hey Ray!” demonstration. I only used the first upgrade once and it was fine, but still bulky, and it still only used physical x-ray film.

 

Over the 2018-2019 holidays I learned Apple Swift and created an iPhone application that used the internal camera or the internal photo library as the source of x-ray images. Overall, this is so much better. I can browse the internet looking for interesting x-ray images and save them to the photo library or I can use the internal camera to take live pictures of physical x-rays for the more interactive experience we were originally looking for. The resulting analysis is still a text-to-speech response of the analysis, but currently I have dropped the speech recognition portion for the time being. However, since the Swift environment has access to the Siri framework, I could still incorporate speech recognition into the overall application later.

NewArchitecture.png

The iPhone application is basically a user interface to the analytics server. There is still a Pentaho server running Pentaho Data Integration (PDI) and using PMI using deep learning to analyze the incoming image. In fact, the original analytic transformation is mostly the same and the two deep learning models used to detect injuries and identify the body part being analyzed is the same as the original. A text analysis is formed with the findings and sent back to the iPhone where the results are spoken using the iPhone voice synthesizer, the same voice synthesizer that Siri uses. The iPhone based application is still able to store all analysis artifacts to a Hitachi Content Platform (HCP) system and now it includes the ingestion of custom metadata (the results of the deep learning analysis - body part and probability, and injury detection and probability), and can tweet a movie of the analysis - a rendering of the audio analysis and the x-ray image, to twitter.

 

I am planning to build similar iPhone applications with other datasets to use as demonstrations and as examples of how to build artificial intelligence based applications with Pentaho and PMI, and possibly other Hitachi Vantara Labs experimental prototypes. Stay tune for more...

Blog 1 of 2

AllTweets.jpg

 

I have a few pent up blogs that need to be written, but I’ve been waiting for a couple of Pentaho User Meeting speaking events to finish before I share them. I don’t want to "steal my own thunder" if you know what I mean.

 

I did start tweeting some thoughts the other day on one of these viewpoints – hidden results from machine learning models. Which led me to start writing this first of 2 blogs on the matter.

 

Much of this idea comes from the “Hey Ray!” demonstration that uses deep learning in Pentaho with the Plugin Machine Intelligence plugin. In this demonstration, occasionally a “wrong” body part would be identified. I say wrong, because I personally see a “forearm” but “Hey Ray!” sees an “elbow”. In working with this example application and the interpretation of the results, it began to dawn on me, “maybe the neural networks in this transformation does 'see' a forearm”. Technically, both answers, forearm and elbow, are correct. Then you have to ask yourself, “what is the dominant feature of the image being analyzed” or more importantly, “what image perspective is being analyze?”. When looking at the results from the deep learning models used, you can see that both classes score high as a probability output. So what are you supposed to do with no "clear" result?

 

ElbowForearm.jpg

 

This is where hidden results come in. In my talks, I typically describe starting and using machine learning on problems and datasets that can be described with 2 classes or binary classes, "Yes or No", "True or False", "Left or Right", "Up or Down", and so on. Ask the question about the data for example, “is this a fraudulent transaction, Yes or No?”. With supervised machine learning, you have the answers from historical data. You are just training the models to recognize the patterns in the historical data to recognize future fraudulent transactions in live or production data.

 

Here's the issue, there could actually be three possible answers from binary classes – "Yes, No or Unknown". However, you have to interpret the Unknown answer with thresholds and flag the implied Unknown results. Typically, the Type:Yes_Predicted_prob outcomes from a machine learning model (assuming you are using PMI with Pentaho) are based on the halfway point – 0.5 or 50%. Any prediction above the 50% line is the Yes class and any prediction below the 50% line is the No class. This means that a Type:No_Predicted_prob at 0.49 or 49%. These statistical “ties” need to be contained in a range and this range needs to be defined as your policy.

 

For example, you could test and retest your models and define a range of results for Type:Yes_Predicted_prob or Type:No_Predicted_prob and the extrapolated Unknown results. The predicted results would be,

 

  • 4 predicted “yes” diabetic,
  • 4 predicted “no” diabetic
  • and 3 “unknown” or “undetermined” predictions.

 

The figure below shows the results of a PMI Scoring step of last 11 case study records from the Pima Indians diabetes dataset study.

 

NOTE:

For experimental purposes, of the 768 records in the diabetes case study dataset, I split the dataset into 3 separate datasets; a training dataset, a validation dataset and “production" dataset. For the production dataset, I use the last 11 records from the dataset and remove the “outcome” column (the study’s results) and score this dataset for the results you see below.

 

ResultsInterpretation2.png

ResultsInterpretationKTR.png

 

This example illustrates that any result predicted with a probability score between 0.30 to 0.70 (30% to 70%) should be interpreted as Unknown. While to top 30% and Botton 30% ranges should be used to define you Yes or No results. The results will need to score a predicted Yes of 0.70 or higher for the results to be interpreted as a Yes, and a score of 0.30 or lower to be interpreted as a No, all other results should be considered Unknown.

 

If you only used the type_MAX_prob to determine the outcome, the results would be,

 

  • 5 predicted “yes” diabetic,
  • 6 predicted “no” diabetic

 

So, use thresholds and ranges to define your results from machine learning models. Do not just go by the type_MAX_prob, Type:No_Predicted_prob or Type:Yes_Predicted_prob. By doing so, you will discover hidden results in binary classes.

 

Hidden results are more pronounced when working with datasets with more than 2 classes. Multi-class datasets may actually include combinations or answers as well as unknown results. This will be the subject of the next blog. Stay tuned!

PMIDLLogo.png

Why Streaming Data? Why Now?

In the past 10 years, the data market has undergone a vast transformation, expanding far beyond traditional data integration and business intelligence involving the analysis of transactional data. One of the most important changes has been a huge growth in the data created by devices, vehicles, and other items outfitted with sensors and connected to the internet – the so-called Internet of Things (IoT). IoT data is being generated by these devices at an unprecedented rate and the rate of growth itself continues to rise. This wealth of new data provides many opportunities to improve business, societal, and personal outcomes, if used properly. However, it is quite easy to get swamped by the flood of data and neglect it until a later date, potentially resulting in many missed opportunities to improve sales, reduce downtime, or respond to a problem in real time. This is where streaming data tools comes in handy. By processing, analyzing, and displaying data in real time you can act upon your IoT data when it matters most – right now.

 

3068665_keytothecity_londonnightconnections.jpg

 

Fortunately, Hitachi Vantara realizes the importance of streaming data and has invested significant resources into the field, resulting in new products like the Lumada platform and Smart Data Center as well as new features for our software such as Pentaho Data Integration’s support of MQTT and Kafka via its native engine or Spark via the Adaptive Execution Layer (AEL). I’ve been exploring many of these new features, especially those specific to Pentaho, and wanted to share some architectures I’ve had success with that allow me to receive, process, and visualize data in real time.

 

Use Case & Requirements

One example of an IoT case where I’ve used Pentaho’s streaming data capabilities is the Smart Business booth at the Hitachi Next 2018 conference. For this demo, we had a table with 3 bowls of various candies available and asked conference visitors to select a treat. Meanwhile, a Hitachi lidar camera was watching the visitor, detecting which candy they picked as well as their estimated age, gender, emotion, and height. The goal was to have the camera’s analysis streamed to a real time dashboard where guests could see their estimated demographics and candy choice as well as historical choices made by other conference attendees.

Requirement graphic (draw.io).pngThis demonstration was a bit intimidating at first, as it had some considerable requirements. The lidar camera needed to capture and analyze the depth “images”, generate an analysis data file including candy choice, gender, age, height, emotion and send this to data to a computer. Then, the computer needed to store this data for historical analysis purposes while simultaneously visualizing the data in an attractive manner. Furthermore, all of this needed to happen in a sub-second time frame for the demonstration to function as guests expect. Thankfully, with the new features in Pentaho I was confident we could pull this off, even with a relatively short time frame.

 

Possible Architectures

I considered a few possible architectures, all of which required the same basic flow. The data needed to be sent by the camera to the computer using some protocol (step 1), the data needed to be received by a messaging/queuing system (step 2), the data needed to be stored in a database (step 3), and the data needed to be quickly queried from the database or sent directly to the dashboard via another streaming protocol (step 4). The possibilities I considered for each of the steps are as follows:

  1. Camera to computer data protocol: Kafka, MQTT, or HTTP post
  2. Message receiver/queue: Pentaho Data Integration (PDI) transform using Kafka consumer or MQTT consumer or PostgREST, a program extending a Postgres database so it can be used via a RESTful API
  3. Receiver to database: A table output step in PDI after the Kafka or MQTT consumer or having PostgREST call a stored procedure that in turn writes data to the database
  4. Query or stream to dashboard: A Pentaho streaming data service, an optimized SQL query, or postgres-websockets, a program extending a Postgres database allowing notifications issued by postgres stored procedures to be sent to a client over a websocket.

 

Here is a diagram detailing how the components from each step would work together:

Possible architectures (draw.io).png

 

I had previously used and had success with an architecture using HTTP POST (step 1), PostgREST (step 2), stored procedure (step 3), and postgres-websockets (step 4), which made it a tempting solution to reach for. However, I realized that the postgres-websockets component might be overkill, as it was only required on the prior project because we had run into issues with the stored procedure taking too long to write data to disk. We remedied this issue by sending the data first to the websocket and then writing the data to the database. However, for this use case, I didn’t believe the data would take long to be written to disk, so I opted to swap out postgres-websockets for an optimized SQL query. Overall this solution was a success, but in 1 out of every 1000 cases there would be a slight delay in the optimized SQL query finishing, causing a bit of lag for the data to show up to the dashboard. Although this wasn’t a make or break for the demo, I decided to press on to see if we could find a better solution.

 

The second architecture we tried was Kafka (step 1), a PDI transform with a Kafka consumer step (step 2), a table output step after the Kafka consumer step (step 3), and a streaming data service (step 4). This solution proved to be much faster and more reliable for live data, with no timing issues in contrast to the prior solution. However, it did require for us to separately query for historical data, causing some difficulties in trying to join the data between the sources for components where we were trying to display both live and historical figures. Although this joining was a bit tricky, this completely Pentaho solution proved to be much easier to set up, manage, and performed better than the other architectures I’ve used including third-party tools. Therefore, we chose to implement this architecture.

 

Putting it All Together

Although I won’t go into it, we next hardened the architecture, developed the transformations, created the dashboard, and tested our solution. Before we knew it, it was time to head to Next 2018 where the demo would be put to the real test. How’d we do? Check out the media below!

 

Dashboard.png

In person.jpg

Takeaway

By utilizing an all Pentaho architecture, we were able to create a reliable, fast, end to end live data pipeline that met our goals of sub-second response time. It was impressive enough to wow conference attendees and simple enough that it didn’t require a line of code. Streaming data is here, and growing, and Pentaho provides real solutions to receive, process, and visualize it, all in near-real time.

Our Customer Success & Support teams are always working on providing our customers with tips and tricks that will help our customers with the Pentaho platform.

 

PDI Development and Lifecycle Management

 

A successful data integration (DI) project incorporates design elements for your DI solution to integrate and transform your data in a controlled manner. Planning out the project before starting the actual development work can help your developers work in the most efficient way.

 

While the data transformation and mapping rules will fulfill your project’s functional requirements, there are other nonfunctional requirements to keep in mind as well. Consider the following common project setup and governance challenges that many projects face:

 

  • Multiple developers will be collaborating, and such collaboration requires development standards and a shared repository of artifacts.
  • Projects can contain many solutions and there will be the need to share artifacts across projects and solutions.
  • The solution needs to integrate with your Version Control System (VCS).
  • The solution needs to be environment-agnostic, requiring the separation of content and configuration.
  • Deployment of artifacts across environments needs to be automated.
  • Failed jobs should require a simple restart, so restartability must be part of job design.
  • The finished result will be supported by a different team, so logging and monitoring should be put in place to support the solution.

 

We have produced a best practices document – PDI Development and Lifecycle Management- that focuses on the essential and general project setup and lifecycle management requirements for a successful DI Solution. It includes a DI framework that illustrates how to implement these concepts and separate them from your actual DI solution and is available for download from Pentaho Data Integration Best Practices.

 

Guidelines for Pentaho Server and SAML

 

SAML is a specification that provides a means to exchange an authentication assertion of the principal (user) between an identity provider (IdP) and a service provider (SP). Once the plugin is built and installed, your Pentaho Server will become a SAML service provider, relying on the assertion from the IdP to provide authentication.

 

Security and server administrators, or anyone with a background in authentication or authorization with SAML, will be most interested in the Pentaho Server SAML Authentication with Hybrid Authorization document available for download from Advanced Security.

Thank you to all speakers and participants for contributing to Pentaho User Meeting 2019!

It was a great day full of new stuff about and experiences with Pentaho. We had a lot of fun hosting 80 folks from Germany, Switzerland and Austria.

You couldn´t join the event? Here is the live blog covering all talks. It´s in German but Google Translate helps. We will upload presentation videos, slides and pictures shortly.

 

Update 03/13: the agenda is final now! The following users will share their projects with us:

  • Ken Wood: updates on machine learning and Plugin Machine Intelligence Read the interview
  • Helmut Borghorst: Pentaho sales reporting in a €400 Mio. food business with 35k customers Read the interview
  • Jens Junker: "Pentaho is our Swiss knife". Systems and business process integration Read the interview
  • Jürgen Sluyterman: big data and analytics for recycling and upcycling including PDI integration of SAP systems
  • Gunther Dell: Video Analytics
  • Christoph Acker: GDPDR compliant big data analytics and KI of personal data
  • Pedro Alves, Jens Bleuel: updates on Pentaho and PDI 8.2

 

Hey Pentaho users,

 

we finally fixed the date for Pentaho User Meeting for the German/Austrian/Swiss community!

 

March 26, 2019 in Frankfurt, Germany

Check out the details on the event page

 

Join us to learn from other users and share experiences with Pentaho (and maybe your latest hack). As usual, we´re trying to organize a full day of presentations including the notorious Pedro Alves and Jens Bleuel who will give us an update on what´s going on in Pentaho development. In the evening, we will have pizza, beer and lots of socializing.

 

NEW - NEW - NEW

This year we decided to do something new. As the Pentaho ecosystem keeps growing fast, we´re preparing a little showroom area to present products and solutions from the Pentaho universe.

If you´re interested to present your solution please contact Sarah at sarah.brunn@it-novum.com

 

Call for Papers

Please share your Pentaho experience with us! We´re accepting all kinds of proposals, usecases, projects, technical stuff... Language is German though exceptions are admitted

Please send your proposal to Stefan at stefan.mueller@it-novum.com

 

Agenda

09:30 - 10:00 Registration

10:00 - 10:15 Introduction by Stefan Müller, Director Big Data Analytics, it-novum

10:15 - 10:45 Pentaho 8.2: Big data integration, IoT analytics and data from the cloud. Pedro Alves, Head of Product Design, Hitachi Vantara

10:45 - 11:15 What is new at Pentaho Data Integration 8.2? Jens Bleuel, Senior Product Manager - Pentaho Data Integration, Hitachi Vantara

11:15 - 11:45 Pentaho at Deutsche See: challenges, problems and solutions. Helmut Borghorst, Leiter Verkaufsinnendienst und IT-Systeme, Deutsche See read the interview

 

11:45 - 13:30 Showroom and lunch break: learn more about solutions from Pentaho's ecosystem and take the chance to talk to the developers!

 

13:30 - 14:00 Machine and Deep Learning with Pentaho. Ken Wood, VP Hitachi Vantara Labs read the interview

14:00 - 14:40 Recycling with big data: resource and waste management with Pentaho data warehouse and SAP integration. Jürgen Sluyterman, IT Director, RSAG & Christopher Keller, it-novum

 

14:40 - 15:10 Coffee break

 

15:10 - 15:40 Pentaho Data Integration – glue and swiss knife. Connecting IT silos, automation of business processes and implementation of business requirements. Jens Junker, HR Anwendungsbetreung PFM / Risikomanagement, VNG Handel & Vertrieb GmbH read the interview

15:40 - 16:10 Video Analytics. Gunther Dell, Director Global Business Development SAP, Hitachi Vantara

16:10 - 16:40 GDPDR compliant big data analytics and KI of personal data. Christoph Acker, Data Scientist, it-novum


16:40: Get together with pizza and beer!

 

See you on March 26!

by Carl Speshock, Senior Technical Product Manager

 

1. Why Model Management

 

Data Scientists are constantly exploring and working with the latest in Machine Learning (ML) and Deep Learning (DL) libraries. This has increased in the number of model taxonomies Data Engineers manage and monitor in production environments. To address this need, Hitachi Vantara developed a model management solution that supports the evolving requirements of the data scientist and engineer community.

 

Pentaho offers the concept of the Model Management Reference Architecture as shown below.

 

MM_1.jpg

 

2. Why Model Management with Pentaho?

 

Data Engineers choose PDI for model management for the following reasons, it:

 

  • Operationalizes ML/DL Models for Production usage, expanding upon Pentaho’s Data Science abilities via the ability to implement a Model Management solution.
  • Enables data engineers to work with PDI GUI drag and drop environment to implement Python script files received from Data Scientists within the new PDI Python Executor Step.
  • Utilizes input(s) from previous PDI steps as variables, Python Numpy, and/or Pandas DataFrame objects within DL processing. This integrates DL Python Script file with PDI workflow and data stream

 

The Pentaho model management architecture includes:

  • PDI: Performs Data Wrangling/Feature Engineering performed by the Data Engineer to fulfill production data set request(s) from the Data Scientist
  • Python Executor Step: Executes Python scripts containing ML and DL libraries, and more.
  • The ML/DL Model: Developed by a Data Scientist, within a Data Science notebook such as Jupyter Notebook, etc.
  • The Model Catalog: Stores versioned model within a RDBMS, NoSQL database, and other data stores).
  • Champion/Challenger Multi-Model Analysis: Pulls models from Model Catalog and runs iteratively with hyperparameters to analyze and evaluate model performance.

 

3. How do we Manage Models within PDI?

 

Components referenced are:

  • Pentaho 8.2, PDI Python Executor Step
  • Python.org 2.7.x or Python 3.5.x

Review Pentaho 8.2 Python Executor Step in Pentaho online help for list of dependencies. https://help.pentaho.com/Documentation/8.2/Products/Data_Integration/Transformation_Step_Reference/Python_Executor

Basic process:

 

1. Build a Model Management workflow in PDI following the example below::

 

  • Create Model Catalog: Build models and store model run info, evaluation metrics, model version, etc. into the Model Catalog.
  • Setup Champion/Challenger runs: For each run competing models will compete against the Champion Model. – which isthe current model in production that Data Scientists are using.
  • Run Champion/Challenger Models: Pre-determined Challenger models will be utilized from the Model Catalog and run against the current Champion. The Model with the best evaluation metrics will be set to Champion. This keeps the current Champion or replaces it with a more accurate Challenger model.

 

MM_2.jpg

 

2. In PDI, use the new 8.2 Python Executor Step

 

MM_3.jpg

 

3. Model Experimentation is the first part of the Model Management workflow, which builds and trains models as well as stores evaluation metric information in a Model Catalog (figure below showcases 5 functional areas of processing). PDI offers a GUI drag and drop environment that could also add Drift Detection, Data Deviation Detection, Canary Pipelines and more.

 

MM_4.jpg

 

This example uses four models. One of the models has injected in-line model hyperparameters using a subset of the labeled production data. All of this this can be adjusted to fit your environment, (i.e. modified to incorporate labeled or unlabeled and non-production data to accommodate your requirements).

 

 

4. Next stage in the workflow is the Setup Champion/Challenger run. Place a model into a Champion folder and the remaining models in the Challenger folder. This can be accomplished via using the PDI Copy Files step with an example shown below:

 

MM_5.jpg

 

 

 

5. Now, perform a subset of a Champion/Challenger run with the three highest performing models. Decide on an evaluation metric or other value for comparing which model performed the best. The output of the Champion/Challenger run can produce log entries or send emails, to save the results.

 

MM_6.jpg

The example shows the models using production data, which could be modified to incorporate either labeled or unlabeled, non-production data to accommodate your requirements. PDI offers a GUI drag and drop environment that could add Drift Detection, Data Deviation Detection, Canary Pipelines and more.

 

 

6. Run the PDI Job to execute the Model Management Workflow. It will write its output to the PDI log, with the Champion/Challenger results displayed (see below):

 

MM_7.jpg

 

4. Why do Organizations use PDI and Python for Deep Learning?

It is no longer an option to operationalize your models in a production environment without a dedicated data preparation tool, it is a necessity. Pentaho 8.2 can assist you in this effort.

 

  • Ø PDI makes the implementation and execution of Model Management solutions easier using its' graphical development environment for related pipelines and workflows. It is easy to customize your Model Management workflow to meet specific requirements.
  • Ø Data Engineers and Data Scientist can collaborate on the workflow and utilize their skills and time more efficiently.
  • Ø The Data Engineer can use PDI to create the workflow in preparation of the Python scripts/models received from the Data Scientist and implemented in the Model Catalog.

Pentaho Data Integration (PDI) and Data Science Notebook Integration

Evan Cropper – PMM - Analytics

1. Using PDI, Data Science Notebook, and Python Together

 

Data Scientists are great at developing analytical models to achieve specific business results. However, deploying a model in a production environment requires a different skillset than data exploration and model development. The result is wasted human capital. The data scientist spends a significant amount of engineering the data, which is often re-written by your data engineers for production deployments.

 

So why not allow the data scientist and data engineer focus on what they do best? With Pentaho PDI’s Drag and Drop GUI environment, data engineers can prepare and orchestrate data to seamlessly flow into a Data Scientist’s Notebook, i.e. Jupyter (which is the focus of this blog). With Pentaho, Data Scientists can explore analytical models using the Python programming language and related Machine Learning and Deep Learning frameworks with cleansed data. The Result? Models ready for production environments at a fraction of the cost and delivered in a fraction of the time.

 

2. Why would you want to use PDI and the Jupyter Notebook together to develop models in Python?

Pentaho allows Data Scientists to spend their time on data science models instead of data prep tasks and makes it easier to share Python scripts between data scientists and data engineers. By choosing Pentaho to operationalize data science, organizations can:

 

  1. Utilize a graphical drag and drop development environment, which makes data engineering easier with a toolbox of connectors to many data sources - easily configured instead of coded, tools that can blend data from multiple sources, and transformation steps to cleanse and normalize the data.
  2. Migrate data to production environments with minimal changes.
  3. Scale seamlessly to address growing production data volumes, and
  4. Share production quality data Sets and Python scripts between Data Engineers and Data Scientists, as shown below in a collaborative workflow between the two personas:

 

JN-1.jpg

 

 

https://whatsthebigdata.com/2016/05/01/data-scientists-spend-most-of-their-time-cleaning-data/

3. How do you develop Python models using PDI and Jupyter Notebooks?

JN-2.jpg

Dependencies and components tested with:

  • Pentaho versions applicable with: 8.1/8.2
  • Python.org 2.7.x or Python 3.5.x
  • Jupyter Notebook 5.6.0
  • Python JDBC package dependencies, i.e. JayDeBeApi and jpype

Pentaho PDI Data Service On-line help link that includes configuration, installation, client jars, etc. (https://help.pentaho.com/Documentation/8.2/Products/Data_Integration/Data_Services)

 

Basic process:

 

  1. In PDI, create a new Transformation connected to the Pentaho Server repository. Implement all of your data connections, blending, filtering, cleansing, etc., as shown in below example,

 

JN-3.jpg

 

2. Use PDI's Data Service feature to export rows from the PDI transformation (which later will be consumed in a Jupyter Notebook). Create a New Data Service by right-clicking on the last step in the transformation. Test the Data Service within the UI and select Save Transformation As to save the Data Service to Pentaho Server.

 

JN-4.jpg

JN-5.jpg

3. Before the Data Scientist can work in the Jupyter Notebook utilize a Data Grid Step to review the Data Grid Fields and Data Values. These input variables will flow into the Python Executor Step. Note they can be easily changed by the Data Engineer for new PDI Data Services.

 

 

JN-6.jpg

JN-7.jpg

JN-8.jpg

 

4. Below, the Python Executor – Create Jupyter Notebook –Python API contains Python Script, Input and Output references and more. From here, the Data Engineer can create the Jupyter Notebook for the Data Scientist to consume.

JN-9.jpg

 

5. Python Executor – Create Jupyter Notebook –Python API step automatically populates the Jupyter Notebook (shown below) with the cleansed and orchestrated data from the transformation. The Data Scientist is connected directly to the PDI Data Service created earlier by the Data Engineer.

JN-10.jpg

 

 

6. Data Scientists will retrieve, i.e. Enterprise Data Catalog, File Share, etc., the Jupyter Notebook file created by the Data Engineer and PDI. The Data Scientist will confirm the output from the Python Pandas Data Frame named df in the last cell.

JN-11.jpg

7. From here, the Data Scientist can begin building, evaluating, processing and saving the machine and deep learning models by utilizing the Pandas Data Frame named df. An example is shown below using a Machine Learning Decision Tree Classifier.

JN-12.jpg

 

JN-13.jpg

JN-14.jpg

 

 

4. How can Data Engineers and Data Scientists using Python collaborate better with PDI and Jupyter Notebooks?

 

  1. PDI’s graphical development environment makes data engineering easier.
  2. Data Engineers can easily migrate PDI applications to production environments with minimal changes.
  3. Data Engineers can scale PDI applications to meet production data volumes.
  4. Data Engineers can quickly respond to Data Scientist’s data set requests with PDI Data Services.
  5. Data Scientists can easily access Jupyter Notebook templates connected to a PDI Data Service.
  6. Data Scientists can quickly pull data on demand from the Data Service and get to work on what they do best!

Anand Rao, Principal Product Marketing Manager, Pentaho

 

 

1 Deep Learning – What is the Hype?

 

According to Zion Market Research, the deep learning (DL) market will increase from $2.3 billion in 2017 to over $23.6 billion by 2024. With annual CAGR of almost 40%, DL has become one of the hottest areas for Data Scientists to create models[1]. Before we jump into how Pentaho can help operationalize your organization’s DL models within product environments, let’s take a step back and review why DL can be so disruptive. Below are some of the characteristics of DL, it:

DL-10.jpg

DL-11.jpg

 

 

 

  • Uses Artificial Neural Networks that have multiple hidden layers that can perform powerful image recognition, computer visioning/object detection, video stream processing, natural language processing and more. Improvements in DL offerings and in processing power, such as the GPU, cloud, have accelerated the DL boom in last few years.
  • Attempts to mimic the activity in the human brain via layers of neurons, DL learns to recognize patterns in digital representations of sounds, video streams, images, and other data.
  • Reduces need to perform feature engineering prior to running the model through use of multiple hidden layers, performing feature extraction on the fly when the model runs.
  • Improves on performance and accuracy over traditional Machine Learning algorithms due to updated frameworks, availability of very large data sets, (i.e. Big Data), and major improvements in processing power, i.e. GPUs, etc.
  • Provides development frameworks, environments, and offerings, i.e. Tensorflow, Keras, Caffe, PyTorch, etc that make DL more accessible to data scientists.

 

2 Why should you use PDI to develop and operationalize Deep Learning models in Python?

 

Today, Data Scientists and Data Engineers have collaborated on hundreds of data science projects built in PDI. With Pentaho, they’ve been able to migrate complex data science models to production environments at a fraction of the costs as compared to traditional data preparation tools. We are excited to announce that Pentaho can now bring this ease of use to DL frameworks, furthering Hitachi Vantara’s goal to enable organizations to innovate with all their data. With PDI’s new Python executor step, Pentaho can:

  • Integrate with popular DL frameworks in a transformation step, expanding upon Pentaho’s existing robust data science capabilities.
  • Easily implement DL Python script files received from Data Scientists within the new PDI Python Executor Step
  • Run DL models on any CPU/GPU hardware, enabling organizations to use GPU acceleration to enhance performance of their DL models.
  • Incorporate data from previous PDI steps, via data pipeline flow, as Python Pandas Data Frame of Numpy Array within the Python Executor step for DL processing
  • Integrate with Hitachi Content Platform (HDFS, Local, S3, Google Storage, etc.) allowing for the movement and positioning of unstructured data files to a locale, (i.e. Data Lake, etc.) and reducing DL storage and processing costs.

 

Benefits:

  • PDI supports most widely used DL frameworks, i.e. Tensorflow, Keras, PyTorch and others that have a Python API, allowing Data Scientists to work within their favorite libraries.
  • PDI enables Data Engineers and Data Scientists to collaborate while implementing DL
  • PDI allows for efficient allocation of skills and resources of the Data Scientist (i.e. build, evaluate and run DL models) and the Data Engineer (Create Data pipelines in PDI for DL processing) personas.

 

3 How does PDI operationalize Deep Learning?

 

Components referenced are:

  • Pentaho 8.2, PDI Python Executor Step, Hitachi Content Platform (HCP) VFS
  • Python.org 2.7.x or Python 3.5.x
  • Tensorflow 1.10
  • Keras 2.2.0

Review Pentaho 8.2 Python Executor Step in Pentaho On-line Help for list of dependencies. Python Executor - Pentaho Documentation

Basic process:

 

1. HCP VFS file location within a PDI step. Copy and stage unstructured data files for use by DL framework processing within PDI Python Executor Step.

DL-9.jpg

 

Additional info: https://help.pentaho.com/Documentation/8.2/Products/Data_Integration/Data_Integration_Perspective/Virtual_File_System

 

https://help.pentaho.com/Documentation/8.2/Products/Data_Integration/Data_Integration_Perspective/Virtual_File_System2. 2. Utilize a new Transformation that will implement workflows for processing DL frameworks and associated data sets, etc. Inject Hyperparameters (values to be used for tuning and execution of models) to evaluate the best performing model. Below is an example that implements four DL framework workflows, three using Tensorflow and one using Keras, with the Python Executor step.

 

 

DL-12.jpg

 

DL-13.jpg

 

 

 

3. Focusing on the Tensorflow DNN Classifier workflow (which implements injection of hyperparameters), utilize a PDI Data Grid Step, ie named Injected Hyperparameters, with values used by corresponding Python Script Executor steps.

 

 

4. Within the Python Script Executor step use Pandas DF and implement the Injected Hyperparameters and values as variables in the Input Tab

 

DL-4.jpg

5. Execute the DL related Python script (either via Embedding or a URL to a file) and reference a DL framework and Injected Hyperparameters from inputs. Also, you can set the Python Virtual Environment to a path other than what is the default Python install.

DL-5.jpg

 

6. Verify that you have Tensorflow installed, configured and is correctly importing into a Python shell.

DL-6.jpg

 

 

7. Going back to the Python Executor Step, click on the Output Tab and then click on the Get Fields button. PDI will do a pre-check on the script file as to verify it for errors, outputs, etc.

DL-7.jpg

 

8. This completes the configurations for the running of the transformation.

 

4 Does Hitachi Vantara also offer GPU offerings to accelerate Deep Learning execution?

 

DL frameworks can benefit substantially from executing with a GPU rather than a CPU because most DL frameworks have some type of GPU accelerators. Hitachi Vantara has engineered and delivered the DS225 Advanced Server with NVIDIA Tesla V100 GPUs in 2018. This is Hitachi Vantara’s first GPU server designed specifically for DL implementation.

 

DL-8.jpg

More details about the GPU offering are available here: https://www.hitachivantara.com/en-us/pdfd/datasheet/advanced-server-ds225-datasheet.pdf

 

5 Why Organization use PDI and Python with Deep Learning:

 

  • Intuitive drag and drop tools: PDI makes the implementation and execution of DL frameworks easier with its' graphical development environment for DL related pipelines and workflows.
  • Improved collaboration: Data Engineers and Data Scientist can work on a shared workflow and utilize their skills and time efficiently.
  • Better allocation of valuable resources: The Data Engineer can use PDI to create the workflows, move and stage unstructured data files from/to HCP, and configure injected hyperparameters in preparation for the Python script received from the Data Scientist.
  • Best-in-Class GPU Processing: Hitachi Vantara offers the DS225 Advanced Server with NVIDIA Tesla V100 GPUs that allow DL frameworks to benefit from GPU acceleration.

 

 

 

1. Global Deep Learning Market Will Reach USD 23.6 Billion By 2024 End: Zion Market Research

Pentaho CDE Real Time Analysis Dashboard

About a month ago Ken Wood posted a blog post sharing a real time IoT data stream with LiDAR data and challeging the Pentaho Community to create an Analysis and Visualization with it. A few days later added another real time IoT data stream with dust particulates sensor data.

 

Using the latest Pentaho 8.2.0 release, with PDI I was able to fetch the data from each sensor, read the values from the JSON format and write all the incoming lines to a stage table.

 

For the dust particulate sensor then all that was needed now was to create a Data Service to feed the CDE Dashboard's Charts that will show the last 10 minutes of data in a CCC Line Chart, and the most recent measure in a CCC Bar Chart.

 

As for the LiDAR Motion sensor data, some analysis was needed in order to determine the direction of the people detected by the sensor. For that, after writing the data in the stage table, a sub-transformation is called that will do that analysis and write the output in a fact table. In this fact table we'll only have 1 line per person that translates the behavior of the person crossing the entrance and hallway, namely, where from and to the person is coming/going, the timestamps of entering and leaving each area, if it's going in, out or crossing the hallway, and the lag time in each of the areas.

 

Here are the screenshots of the ETL Analysis Process created:

1dustDS.png

Image 1 - Particulate Sensor transformation

 

 

2lidarDS.pngImage 2 - LiDAR Motion Sensor transformation

 

3subLidar.pngImage 3 - LiDAR Motion Sensor sub-transformation

 

After this, I have all the data needed to make a cool visualization using CDE Dashboards.

 

 

The dashboard is divided in 3 sections:

  • The first for the LiDAR data, shows the KPIs and the Last 10 movements table, which is refreshed every time a person enters or leaves the hallway or entrance.
  • The second related with the dust particulate sensor, shows the last 10min of data in a line chart and the most recent measure received in a bar chart, and is updated every second, since we receive data from this sensor every second.
  • Finally the third section, show the correlation between the LiDAR data and the dust particulate sensor for the last 24h, and it is refreshed every minute.

 

Here is the final result:

LiDAR and Dust Dashboard 2.pngImage 4 - CDE Real Time Dashboard

 

 

We will be deploying this solution in a server in the near future so everyone can access and see it working and will update this blog post.