Skip navigation
1 2 3 Previous Next


74 posts

Why Streaming Data? Why Now?

In the past 10 years, the data market has undergone a vast transformation, expanding far beyond traditional data integration and business intelligence involving the analysis of transactional data. One of the most important changes has been a huge growth in the data created by devices, vehicles, and other items outfitted with sensors and connected to the internet – the so-called Internet of Things (IoT). IoT data is being generated by these devices at an unprecedented rate and the rate of growth itself continues to rise. This wealth of new data provides many opportunities to improve business, societal, and personal outcomes, if used properly. However, it is quite easy to get swamped by the flood of data and neglect it until a later date, potentially resulting in many missed opportunities to improve sales, reduce downtime, or respond to a problem in real time. This is where streaming data tools comes in handy. By processing, analyzing, and displaying data in real time you can act upon your IoT data when it matters most – right now.


Fortunately, Hitachi Vantara realizes the importance of streaming data and has invested significant resources into the field, resulting in new products like the Lumada platform and Smart Data Center as well as new features for our software such as Pentaho Data Integration’s support of MQTT and Kafka via its native engine or Spark via the Adaptive Execution Layer (AEL). I’ve been exploring many of these new features, especially those specific to Pentaho, and wanted to share some architectures I’ve had success with that allow me to receive, process, and visualize data in real time.

Use Case & Requirements

One example of an IoT case where I’ve used Pentaho’s streaming data capabilities is the Smart Business booth at the Hitachi Next 2018 conference. For this demo, we had a table with 3 bowls of various candies available and asked conference visitors to select a treat. Meanwhile, a Hitachi lidar camera was watching the visitor, detecting which candy they picked as well as their estimated age, gender, emotion, and height. The goal was to have the camera’s analysis streamed to a real time dashboard where guests could see their estimated demographics and candy choice as well as historical choices made by other conference attendees.

Requirement graphic ( demonstration was a bit intimidating at first, as it had some considerable requirements. The lidar camera needed to capture and analyze the depth “images”, generate an analysis data file including candy choice, gender, age, height, emotion and send this to data to a computer. Then, the computer needed to store this data for historical analysis purposes while simultaneously visualizing the data in an attractive manner. Furthermore, all of this needed to happen in a sub-second time frame for the demonstration to function as guests expect. Thankfully, with the new features in Pentaho I was confident we could pull this off, even with a relatively short time frame.

Possible Architectures

I considered a few possible architectures, all of which required the same basic flow. The data needed to be sent by the camera to the computer using some protocol (step 1), the data needed to be received by a messaging/queuing system (step 2), the data needed to be stored in a database (step 3), and the data needed to be quickly queried from the database or sent directly to the dashboard via another streaming protocol (step 4). The possibilities I considered for each of the steps are as follows:

  1. Camera to computer data protocol: Kafka, MQTT, or HTTP post
  2. Message receiver/queue: Pentaho Data Integration (PDI) transform using Kafka consumer or MQTT consumer or PostgREST, a program extending a Postgres database so it can be used via a RESTful API
  3. Receiver to database: A table output step in PDI after the Kafka or MQTT consumer or having PostgREST call a stored procedure that in turn writes data to the database
  4. Query or stream to dashboard: A Pentaho streaming data service, an optimized SQL query, or postgres-websockets, a program extending a Postgres database allowing notifications issued by postgres stored procedures to be sent to a client over a websocket.

Here is a diagram detailing how the components from each step would work together:

Possible architectures ( had previously used and had success with an architecture using HTTP POST (step 1), PostgREST (step 2), stored procedure (step 3), and postgres-websockets (step 4), which made it a tempting solution to reach for. However, I realized that the postgres-websockets component might be overkill, as it was only required on the prior project because we had run into issues with the stored procedure taking too long to write data to disk. We remedied this issue by sending the data first to the websocket and then writing the data to the database. However, for this use case, I didn’t believe the data would take long to be written to disk, so I opted to swap out postgres-websockets for an optimized SQL query. Overall this solution was a success, but in 1 out of every 1000 cases there would be a slight delay in the optimized SQL query finishing, causing a bit of lag for the data to show up to the dashboard. Although this wasn’t a make or break for the demo, I decided to press on to see if we could find a better solution.


The second architecture we tried was Kafka (step 1), a PDI transform with a Kafka consumer step (step 2), a table output step after the Kafka consumer step (step 3), and a streaming data service (step 4). This solution proved to be much faster and more reliable for live data, with no timing issues in contrast to the prior solution. However, it did require for us to separately query for historical data, causing some difficulties in trying to join the data between the sources for components where we were trying to display both live and historical figures. Although this joining was a bit tricky, this completely Pentaho solution proved to be much easier to set up, manage, and performed better than the other architectures I’ve used including third-party tools. Therefore, we chose to implement this architecture.

Putting it All Together

Although I won’t go into it, we next hardened the architecture, developed the transformations, created the dashboard, and tested our solution. Before we knew it, it was time to head to Next 2018 where the demo would be put to the real test. How’d we do? Check out the media below!


In person.jpg


By utilizing an all Pentaho architecture, we were able to create a reliable, fast, end to end live data pipeline that met our goals of sub-second response time. It was impressive enough to wow conference attendees and simple enough that it didn’t require a line of code. Streaming data is here, and growing, and Pentaho provides real solutions to receive, process, and visualize it, all in near-real time.

Our Customer Success & Support teams are always working on providing our customers with tips and tricks that will help our customers with the Pentaho platform.


PDI Development and Lifecycle Management


A successful data integration (DI) project incorporates design elements for your DI solution to integrate and transform your data in a controlled manner. Planning out the project before starting the actual development work can help your developers work in the most efficient way.


While the data transformation and mapping rules will fulfill your project’s functional requirements, there are other nonfunctional requirements to keep in mind as well. Consider the following common project setup and governance challenges that many projects face:


  • Multiple developers will be collaborating, and such collaboration requires development standards and a shared repository of artifacts.
  • Projects can contain many solutions and there will be the need to share artifacts across projects and solutions.
  • The solution needs to integrate with your Version Control System (VCS).
  • The solution needs to be environment-agnostic, requiring the separation of content and configuration.
  • Deployment of artifacts across environments needs to be automated.
  • Failed jobs should require a simple restart, so restartability must be part of job design.
  • The finished result will be supported by a different team, so logging and monitoring should be put in place to support the solution.


We have produced a best practices document – PDI Development and Lifecycle Management- that focuses on the essential and general project setup and lifecycle management requirements for a successful DI Solution. It includes a DI framework that illustrates how to implement these concepts and separate them from your actual DI solution and is available for download from Pentaho Data Integration Best Practices.


Guidelines for Pentaho Server and SAML


SAML is a specification that provides a means to exchange an authentication assertion of the principal (user) between an identity provider (IdP) and a service provider (SP). Once the plugin is built and installed, your Pentaho Server will become a SAML service provider, relying on the assertion from the IdP to provide authentication.


Security and server administrators, or anyone with a background in authentication or authorization with SAML, will be most interested in the Pentaho Server SAML Authentication with Hybrid Authorization document available for download from Advanced Security.

Update 02/20: 330 million tons of plastic waste are annually produced. So why not use big data for recycling and upcycling? Jürgen Sluyterman will share RSAG's analytics project with Pentaho that integrates data from waste management systems and SAP.


Update 02/19: Ken Wood will share updates on machine learning and PMI, Plugin Machine Intelligence. Learn more in this interview with Ken


Update 01/31: Did you ever think that Pentaho is a swiss knife? Then Jens Junker's talk is for you! Coming from VNG, Jens will explain how they integrate their IT systems and implement business processes with Pentaho Data Integration. Read the interview with Jens


Update 01/23: Deutsche See, one of Europe's leading fish wholesalers, will talk about why they use Pentaho since 2009 and how they do sales reporting for their €400 Mio. revenues and 35,000 business customers. Learn more in this interview with Helmut Borghorst, Head of IT and Sales at Deutsche See.


Hey Pentaho users,


we finally fixed the date for Pentaho User Meeting for the German/Austrian/Swiss community!


March 26, 2019 in Frankfurt, Germany

Check out the details on the event page


Join us to learn from other users and share experiences with Pentaho (and maybe your latest hack). As usual, we´re trying to organize a full day of presentations including the notorious Pedro Alves and Jens Bleuel who will give us an update on what´s going on in Pentaho development. In the evening, we will have pizza, beer and lots of socializing.



This year we decided to do something new. As the Pentaho ecosystem keeps growing fast, we´re preparing a little showroom area to present products and solutions from the Pentaho universe.

If you´re interested to present your solution please contact Sarah at


Call for Papers

Please share your Pentaho experience with us! We´re accepting all kinds of proposals, usecases, projects, technical stuff... Language is German though exceptions are admitted

Please send your proposal to Stefan at



09:30 - 10:00 Registration

10:00 - 10:10 Introduction by Stefan Müller, Director Big Data Analytics, it-novum

10:10 - 10:40 Pentaho 8.2: Big data integration, IoT analytics and data from the cloud. Pedro Alves, Head of Product Design, Hitachi Vantara

10:40 - 11:10 What is new at Pentaho Data Integration 8.2? Jens Bleuel, Senior Product Manager - Pentaho Data Integration, Hitachi Vantara

11:10 - 11:40 Pentaho at Deutsche See: challenges, problems and solutions. Helmut Borghorst, Leiter Verkaufsinnendienst und IT-Systeme, Deutsche See read the interview

11:40 - 12:20 Resource management with Pentaho data warehouse and SAP and waste management software integration. Jürgen Sluyterman, IT Director, RSAG & Christopher Keller, it-novum


12:20 - 14:20 Showroom and lunch break – learn more about exciting solutions from the Pentaho eco system!


14:20 - 14:50 Machine Learning with Pentaho. Ken Wood, VP Hitachi Vantara Labs read the interview

14:50 - 15:20 Pentaho Data Integration – glue and swiss knife. Connecting IT silos, automation of business processes and implementation of business requirements. Jens Junker, HR Anwendungsbetreung PFM / Risikomanagement, VNG Handel & Vertrieb GmbH read the interview

15:20 - 15:40 Video Analytics. Gunther Dell, Director Global Business Development SAP, Hitachi Vantara

Pentaho/SAP Connector. Christopher Keller, Head of Consulting BDA, it-novum read the interview


15:40 - 16:00 Coffee break

16:00 - 17:00 More talks…


17:00 - open end: Get together with pizza and beer!


See you on March 26!

by Carl Speshock, Senior Technical Product Manager


1. Why Model Management


Data Scientists are constantly exploring and working with the latest in Machine Learning (ML) and Deep Learning (DL) libraries. This has increased in the number of model taxonomies Data Engineers manage and monitor in production environments. To address this need, Hitachi Vantara developed a model management solution that supports the evolving requirements of the data scientist and engineer community.


Pentaho offers the concept of the Model Management Reference Architecture as shown below.




2. Why Model Management with Pentaho?


Data Engineers choose PDI for model management for the following reasons, it:


  • Operationalizes ML/DL Models for Production usage, expanding upon Pentaho’s Data Science abilities via the ability to implement a Model Management solution.
  • Enables data engineers to work with PDI GUI drag and drop environment to implement Python script files received from Data Scientists within the new PDI Python Executor Step.
  • Utilizes input(s) from previous PDI steps as variables, Python Numpy, and/or Pandas DataFrame objects within DL processing. This integrates DL Python Script file with PDI workflow and data stream


The Pentaho model management architecture includes:

  • PDI: Performs Data Wrangling/Feature Engineering performed by the Data Engineer to fulfill production data set request(s) from the Data Scientist
  • Python Executor Step: Executes Python scripts containing ML and DL libraries, and more.
  • The ML/DL Model: Developed by a Data Scientist, within a Data Science notebook such as Jupyter Notebook, etc.
  • The Model Catalog: Stores versioned model within a RDBMS, NoSQL database, and other data stores).
  • Champion/Challenger Multi-Model Analysis: Pulls models from Model Catalog and runs iteratively with hyperparameters to analyze and evaluate model performance.


3. How do we Manage Models within PDI?


Components referenced are:

  • Pentaho 8.2, PDI Python Executor Step
  • 2.7.x or Python 3.5.x

Review Pentaho 8.2 Python Executor Step in Pentaho online help for list of dependencies.

Basic process:


1. Build a Model Management workflow in PDI following the example below::


  • Create Model Catalog: Build models and store model run info, evaluation metrics, model version, etc. into the Model Catalog.
  • Setup Champion/Challenger runs: For each run competing models will compete against the Champion Model. – which isthe current model in production that Data Scientists are using.
  • Run Champion/Challenger Models: Pre-determined Challenger models will be utilized from the Model Catalog and run against the current Champion. The Model with the best evaluation metrics will be set to Champion. This keeps the current Champion or replaces it with a more accurate Challenger model.




2. In PDI, use the new 8.2 Python Executor Step




3. Model Experimentation is the first part of the Model Management workflow, which builds and trains models as well as stores evaluation metric information in a Model Catalog (figure below showcases 5 functional areas of processing). PDI offers a GUI drag and drop environment that could also add Drift Detection, Data Deviation Detection, Canary Pipelines and more.




This example uses four models. One of the models has injected in-line model hyperparameters using a subset of the labeled production data. All of this this can be adjusted to fit your environment, (i.e. modified to incorporate labeled or unlabeled and non-production data to accommodate your requirements).



4. Next stage in the workflow is the Setup Champion/Challenger run. Place a model into a Champion folder and the remaining models in the Challenger folder. This can be accomplished via using the PDI Copy Files step with an example shown below:






5. Now, perform a subset of a Champion/Challenger run with the three highest performing models. Decide on an evaluation metric or other value for comparing which model performed the best. The output of the Champion/Challenger run can produce log entries or send emails, to save the results.



The example shows the models using production data, which could be modified to incorporate either labeled or unlabeled, non-production data to accommodate your requirements. PDI offers a GUI drag and drop environment that could add Drift Detection, Data Deviation Detection, Canary Pipelines and more.



6. Run the PDI Job to execute the Model Management Workflow. It will write its output to the PDI log, with the Champion/Challenger results displayed (see below):




4. Why do Organizations use PDI and Python for Deep Learning?

It is no longer an option to operationalize your models in a production environment without a dedicated data preparation tool, it is a necessity. Pentaho 8.2 can assist you in this effort.


  • Ø PDI makes the implementation and execution of Model Management solutions easier using its' graphical development environment for related pipelines and workflows. It is easy to customize your Model Management workflow to meet specific requirements.
  • Ø Data Engineers and Data Scientist can collaborate on the workflow and utilize their skills and time more efficiently.
  • Ø The Data Engineer can use PDI to create the workflow in preparation of the Python scripts/models received from the Data Scientist and implemented in the Model Catalog.

Pentaho Data Integration (PDI) and Data Science Notebook Integration

Evan Cropper – PMM - Analytics

1. Using PDI, Data Science Notebook, and Python Together


Data Scientists are great at developing analytical models to achieve specific business results. However, deploying a model in a production environment requires a different skillset than data exploration and model development. The result is wasted human capital. The data scientist spends a significant amount of engineering the data, which is often re-written by your data engineers for production deployments.


So why not allow the data scientist and data engineer focus on what they do best? With Pentaho PDI’s Drag and Drop GUI environment, data engineers can prepare and orchestrate data to seamlessly flow into a Data Scientist’s Notebook, i.e. Jupyter (which is the focus of this blog). With Pentaho, Data Scientists can explore analytical models using the Python programming language and related Machine Learning and Deep Learning frameworks with cleansed data. The Result? Models ready for production environments at a fraction of the cost and delivered in a fraction of the time.


2. Why would you want to use PDI and the Jupyter Notebook together to develop models in Python?

Pentaho allows Data Scientists to spend their time on data science models instead of data prep tasks and makes it easier to share Python scripts between data scientists and data engineers. By choosing Pentaho to operationalize data science, organizations can:


  1. Utilize a graphical drag and drop development environment, which makes data engineering easier with a toolbox of connectors to many data sources - easily configured instead of coded, tools that can blend data from multiple sources, and transformation steps to cleanse and normalize the data.
  2. Migrate data to production environments with minimal changes.
  3. Scale seamlessly to address growing production data volumes, and
  4. Share production quality data Sets and Python scripts between Data Engineers and Data Scientists, as shown below in a collaborative workflow between the two personas:



3. How do you develop Python models using PDI and Jupyter Notebooks?


Dependencies and components tested with:

  • Pentaho versions applicable with: 8.1/8.2
  • 2.7.x or Python 3.5.x
  • Jupyter Notebook 5.6.0
  • Python JDBC package dependencies, i.e. JayDeBeApi and jpype

Pentaho PDI Data Service On-line help link that includes configuration, installation, client jars, etc. (


Basic process:


  1. In PDI, create a new Transformation connected to the Pentaho Server repository. Implement all of your data connections, blending, filtering, cleansing, etc., as shown in below example,




2. Use PDI's Data Service feature to export rows from the PDI transformation (which later will be consumed in a Jupyter Notebook). Create a New Data Service by right-clicking on the last step in the transformation. Test the Data Service within the UI and select Save Transformation As to save the Data Service to Pentaho Server.




3. Before the Data Scientist can work in the Jupyter Notebook utilize a Data Grid Step to review the Data Grid Fields and Data Values. These input variables will flow into the Python Executor Step. Note they can be easily changed by the Data Engineer for new PDI Data Services.







4. Below, the Python Executor – Create Jupyter Notebook –Python API contains Python Script, Input and Output references and more. From here, the Data Engineer can create the Jupyter Notebook for the Data Scientist to consume.



5. Python Executor – Create Jupyter Notebook –Python API step automatically populates the Jupyter Notebook (shown below) with the cleansed and orchestrated data from the transformation. The Data Scientist is connected directly to the PDI Data Service created earlier by the Data Engineer.




6. Data Scientists will retrieve, i.e. Enterprise Data Catalog, File Share, etc., the Jupyter Notebook file created by the Data Engineer and PDI. The Data Scientist will confirm the output from the Python Pandas Data Frame named df in the last cell.


7. From here, the Data Scientist can begin building, evaluating, processing and saving the machine and deep learning models by utilizing the Pandas Data Frame named df. An example is shown below using a Machine Learning Decision Tree Classifier.







4. How can Data Engineers and Data Scientists using Python collaborate better with PDI and Jupyter Notebooks?


  1. PDI’s graphical development environment makes data engineering easier.
  2. Data Engineers can easily migrate PDI applications to production environments with minimal changes.
  3. Data Engineers can scale PDI applications to meet production data volumes.
  4. Data Engineers can quickly respond to Data Scientist’s data set requests with PDI Data Services.
  5. Data Scientists can easily access Jupyter Notebook templates connected to a PDI Data Service.
  6. Data Scientists can quickly pull data on demand from the Data Service and get to work on what they do best!

Anand Rao, Principal Product Marketing Manager, Pentaho



1 Deep Learning – What is the Hype?


According to Zion Market Research, the deep learning (DL) market will increase from $2.3 billion in 2017 to over $23.6 billion by 2024. With annual CAGR of almost 40%, DL has become one of the hottest areas for Data Scientists to create models[1]. Before we jump into how Pentaho can help operationalize your organization’s DL models within product environments, let’s take a step back and review why DL can be so disruptive. Below are some of the characteristics of DL, it:






  • Uses Artificial Neural Networks that have multiple hidden layers that can perform powerful image recognition, computer visioning/object detection, video stream processing, natural language processing and more. Improvements in DL offerings and in processing power, such as the GPU, cloud, have accelerated the DL boom in last few years.
  • Attempts to mimic the activity in the human brain via layers of neurons, DL learns to recognize patterns in digital representations of sounds, video streams, images, and other data.
  • Reduces need to perform feature engineering prior to running the model through use of multiple hidden layers, performing feature extraction on the fly when the model runs.
  • Improves on performance and accuracy over traditional Machine Learning algorithms due to updated frameworks, availability of very large data sets, (i.e. Big Data), and major improvements in processing power, i.e. GPUs, etc.
  • Provides development frameworks, environments, and offerings, i.e. Tensorflow, Keras, Caffe, PyTorch, etc that make DL more accessible to data scientists.


2 Why should you use PDI to develop and operationalize Deep Learning models in Python?


Today, Data Scientists and Data Engineers have collaborated on hundreds of data science projects built in PDI. With Pentaho, they’ve been able to migrate complex data science models to production environments at a fraction of the costs as compared to traditional data preparation tools. We are excited to announce that Pentaho can now bring this ease of use to DL frameworks, furthering Hitachi Vantara’s goal to enable organizations to innovate with all their data. With PDI’s new Python executor step, Pentaho can:

  • Integrate with popular DL frameworks in a transformation step, expanding upon Pentaho’s existing robust data science capabilities.
  • Easily implement DL Python script files received from Data Scientists within the new PDI Python Executor Step
  • Run DL models on any CPU/GPU hardware, enabling organizations to use GPU acceleration to enhance performance of their DL models.
  • Incorporate data from previous PDI steps, via data pipeline flow, as Python Pandas Data Frame of Numpy Array within the Python Executor step for DL processing
  • Integrate with Hitachi Content Platform (HDFS, Local, S3, Google Storage, etc.) allowing for the movement and positioning of unstructured data files to a locale, (i.e. Data Lake, etc.) and reducing DL storage and processing costs.



  • PDI supports most widely used DL frameworks, i.e. Tensorflow, Keras, PyTorch and others that have a Python API, allowing Data Scientists to work within their favorite libraries.
  • PDI enables Data Engineers and Data Scientists to collaborate while implementing DL
  • PDI allows for efficient allocation of skills and resources of the Data Scientist (i.e. build, evaluate and run DL models) and the Data Engineer (Create Data pipelines in PDI for DL processing) personas.


3 How does PDI operationalize Deep Learning?


Components referenced are:

  • Pentaho 8.2, PDI Python Executor Step, Hitachi Content Platform (HCP) VFS
  • 2.7.x or Python 3.5.x
  • Tensorflow 1.10
  • Keras 2.2.0

Review Pentaho 8.2 Python Executor Step in Pentaho On-line Help for list of dependencies. Python Executor - Pentaho Documentation

Basic process:


1. HCP VFS file location within a PDI step. Copy and stage unstructured data files for use by DL framework processing within PDI Python Executor Step.



Additional info: 2. Utilize a new Transformation that will implement workflows for processing DL frameworks and associated data sets, etc. Inject Hyperparameters (values to be used for tuning and execution of models) to evaluate the best performing model. Below is an example that implements four DL framework workflows, three using Tensorflow and one using Keras, with the Python Executor step.









3. Focusing on the Tensorflow DNN Classifier workflow (which implements injection of hyperparameters), utilize a PDI Data Grid Step, ie named Injected Hyperparameters, with values used by corresponding Python Script Executor steps.



4. Within the Python Script Executor step use Pandas DF and implement the Injected Hyperparameters and values as variables in the Input Tab



5. Execute the DL related Python script (either via Embedding or a URL to a file) and reference a DL framework and Injected Hyperparameters from inputs. Also, you can set the Python Virtual Environment to a path other than what is the default Python install.



6. Verify that you have Tensorflow installed, configured and is correctly importing into a Python shell.




7. Going back to the Python Executor Step, click on the Output Tab and then click on the Get Fields button. PDI will do a pre-check on the script file as to verify it for errors, outputs, etc.



8. This completes the configurations for the running of the transformation.


4 Does Hitachi Vantara also offer GPU offerings to accelerate Deep Learning execution?


DL frameworks can benefit substantially from executing with a GPU rather than a CPU because most DL frameworks have some type of GPU accelerators. Hitachi Vantara has engineered and delivered the DS225 Advanced Server with NVIDIA Tesla V100 GPUs in 2018. This is Hitachi Vantara’s first GPU server designed specifically for DL implementation.



More details about the GPU offering are available here:


5 Why Organization use PDI and Python with Deep Learning:


  • Intuitive drag and drop tools: PDI makes the implementation and execution of DL frameworks easier with its' graphical development environment for DL related pipelines and workflows.
  • Improved collaboration: Data Engineers and Data Scientist can work on a shared workflow and utilize their skills and time efficiently.
  • Better allocation of valuable resources: The Data Engineer can use PDI to create the workflows, move and stage unstructured data files from/to HCP, and configure injected hyperparameters in preparation for the Python script received from the Data Scientist.
  • Best-in-Class GPU Processing: Hitachi Vantara offers the DS225 Advanced Server with NVIDIA Tesla V100 GPUs that allow DL frameworks to benefit from GPU acceleration.




1. Global Deep Learning Market Will Reach USD 23.6 Billion By 2024 End: Zion Market Research

Pentaho CDE Real Time Analysis Dashboard

About a month ago Ken Wood posted a blog post sharing a real time IoT data stream with LiDAR data and challeging the Pentaho Community to create an Analysis and Visualization with it. A few days later added another real time IoT data stream with dust particulates sensor data.


Using the latest Pentaho 8.2.0 release, with PDI I was able to fetch the data from each sensor, read the values from the JSON format and write all the incoming lines to a stage table.


For the dust particulate sensor then all that was needed now was to create a Data Service to feed the CDE Dashboard's Charts that will show the last 10 minutes of data in a CCC Line Chart, and the most recent measure in a CCC Bar Chart.


As for the LiDAR Motion sensor data, some analysis was needed in order to determine the direction of the people detected by the sensor. For that, after writing the data in the stage table, a sub-transformation is called that will do that analysis and write the output in a fact table. In this fact table we'll only have 1 line per person that translates the behavior of the person crossing the entrance and hallway, namely, where from and to the person is coming/going, the timestamps of entering and leaving each area, if it's going in, out or crossing the hallway, and the lag time in each of the areas.


Here are the screenshots of the ETL Analysis Process created:


Image 1 - Particulate Sensor transformation



2lidarDS.pngImage 2 - LiDAR Motion Sensor transformation


3subLidar.pngImage 3 - LiDAR Motion Sensor sub-transformation


After this, I have all the data needed to make a cool visualization using CDE Dashboards.



The dashboard is divided in 3 sections:

  • The first for the LiDAR data, shows the KPIs and the Last 10 movements table, which is refreshed every time a person enters or leaves the hallway or entrance.
  • The second related with the dust particulate sensor, shows the last 10min of data in a line chart and the most recent measure received in a bar chart, and is updated every second, since we receive data from this sensor every second.
  • Finally the third section, show the correlation between the LiDAR data and the dust particulate sensor for the last 24h, and it is refreshed every minute.


Here is the final result:

LiDAR and Dust Dashboard 2.pngImage 4 - CDE Real Time Dashboard



We will be deploying this solution in a server in the near future so everyone can access and see it working and will update this blog post.

Hi everyone,


thank you for attending Pentaho Community Meeting 2018! With 220 attendees from 5 continents and 29 speakers it was a great event.


This page contains all PCM18 resources - summaries of the talks, presentation slides and pictures.





You think there are contents missing or would like to add yours? Please leave a comment below.


Thank you again for contributing to PCM18! Thanks to Pentaho User Group Italia for welcoming us, thanks to all participants for joining, thanks to Dan Keeley for sponsoring the hackaton and A BIG THANK YOU to the fantastic speakers that shared their experiences and innovations with us. We at it-novum were happy to help to make this an inspiring community meeting again!


Keep the good spirit and see you at PCM19!


PS: Save the date for German Pentaho User Meeting on March 26, 2019 in Frankfurt

Easy to Use Deep Learning and Neural Networks with Pentaho

By Ken Wood and Mark Hall



Hitachi Vantara Labs is excited to release a new version of the experimental plugin, Plugin Machine Intelligence version 1.4. Along with several minor fixes and enhancements is the addition of a new execution engine for performing deep learning and executing other machine learning algorithms using neural networks. The whole mission of Pentaho and Hitachi Vantara Labs is to make complex technology simple to use and deploy, and the Plugin Machine Intelligence (PMI) is a huge advancement towards making machine learning and artificial intelligence part of this mission.


Back in October, I shared a glimpse of what's coming with a blog, Artificial Intelligence with Pentaho, that describes a demonstration using artificial intelligence elements. PMI and Pentaho Data Integration with deep learning is the main artificial intelligence element capability that enables that demonstration. Feel free to ask us more questions about the use of deep learning models in PDI transformations. We will also be blogging more details and "how to" about that demonstration and how to do some of those elements with PDI.


We call this plugin "experimental" because it is a research project from HV Labs and is released openly for the Pentaho community and users to try out and experiment with. We refer to this as "early access to advance, experimental capabilities". As such, it is not a supported product or service at this time.


Deep learning is a recent addition to the artificial intelligence domain of machine learning. PMI initially focuses on supervised machine learning schemes which means there is a continuous or categorical target variable that is being "learned" from a dataset of labeled training data. This deep learning integration is also a supervised learning scheme.




The new release of PMI v1.4 can be downloaded and installed from the PDI and spoon Marketplace. If you are already running a previous version of PMI, check the installation documentation for guidance on getting your system ready for PMI v1.4. If you are not using PMI at all, the Marketplace will install the new PMI v1.4 for you. During the PMI v1.4 installation from the Marketplace, PMI will automatically install, as included machine learning engines, WEKA, Spark MLlib and Deep Learning for java (DL4j). You will need to install and setup python with the scikit-learn, and R with Machine Learning with R (MLR), machine learning libraries, at which point the installation process will configure them into PMI if they are installed and setup correctly. Again, check with the installation documentation for your system.


This means there are now 5 machine learning execution engines integrated in PMI for PDI providing you with many options for training, building, evaluating and executing machine learning models. PMIDLLogo.pngIn fact, some of the existing machine learning algorithms that are available for WEKA, scikit-learn, MLlib and MLR, can also execute on DL4j, like Logistic Regression, Linear Regression and Support Vector Classifier. There are also 2 new machine learning algorithms "exposed" from the scikit-learn, Weka and MLR libraries. They are the Multi-layer Perceptron Classifier and a Multi-layer Perceptron Regressor. These algorithms were exposed from the scikit-learn library to help us write some additional developer documentation on how to expose algorithms to the PMI framework.


Of course the most exciting part of this release is the ability to train, build, evaluate and execute deep learning models with PDI. Stated another way, the ability to analyze unstructured data with PDI. In addition, by using DL4j, you can TrainingTimes.pngtrain your deep learning models using a locally attached graphic processing unit (GPU) that is either internal to your system or externally attached, like a eGPU. DL4j uses the CUDA API from NVidia and thus only uses NVidia GPUs at this time. The speed up in training time for image processing is super fast when compared to training time on a CPU.






There is a lot of reference material available to help you get started with PMI including some new installation documents to help setup PMI v1.4 and how to setup your GPU and CUDA environment for DL4j. The list of materials and references can be found at this location.






It is important to point out that this initiative is not formally supported by Hitachi Vantara, and there are no current plans on the Enterprise Edition roadmap to support PMI at this time. It is recommended that this experimental feature be used for testing, educational and exploration purposes only. PMI is supported by Hitachi Vantara Labs and the community. Hitachi Vantara Labs was created to formally test out new ideas, explore emerging technologies and as much as possible, share our prototypes with the community and users through the Hitachi Vantara Marketplace. We like to refer to this as "providing early access to advanced capabilities". Our hope is that the community and users of these advanced capabilities will help us improve and recommend additional use cases. Hitachi Vantara has forward thinking customers and users, so we hope you will download, install and test this plugin. We would appreciate any and all of your comments, ideas and opinions.

In addition to the LiDAR Motion Sensor real-time data feed from the 8th floor lobby of the HLDS facility, we've added another sensor to the configuration. The new real-time sensor data PMDustSensor.pngcomes from a prototype sensor that is being developed by the same LiDAR Hitachi LG Data Systems (HLDS) development team. This sensor is a Particulate Matter sensor, or dust sensor. We thought it would be an interesting combination of sensor data to detect human traffic AND the amount of dust or particles being "kicked up" from this traffic. The lobby is a carpeted area.



DustSensor8thFloorLobby.pngIn Korea, there is an increasing concern with particulate matter and pollution in the environment PMStandard.pngcoming from their neighboring country. This new sensor allows monitoring of air quality by the detection of particulate matter. There is a Particulate Matter, or PM, standard for defining dust in the air. While the eventual sensor device will be used both indoor and outdoor, today we are deploying the sensor indoor and making the data from this sensor available to everyone to analyze. In the future, we will deploy an outdoor sensor to monitor the air pollution in the city of Seoul.


The PM sensor data uses MQTT to publish its data. The real-time data feed can be accessed at the following MQTT broker and topic.



There is a problem with the original broker and we have moved this
data stream to a new broker. Please note the new broker URL below.
Sorry for the inconvenience.



Broker location - tcp://


Topic - hlds/korea/8thFloor/lobbyDust



The data streamed from this sensor is a json formatted message that has the following definition,


  • Event: AirQuality - the event type
  • Time: TimeStamp - time of the sample
  • PM1_0: Particulate Matter at 1 micrometer and smaller - quantity of sample
  • PM2_5: Particulate Matter at 2.5 micrometer and smaller - quantity of sample
  • PM10: Particulate Matter at 10 micrometer and smaller - quantity of sample


Here is a screen shot of MQTT Spy inspecting these messages.



What kind of Pentaho transformation, dashboards and analysis can you create with this data? is there a correlation of human traffic through the lobby and the amount of dust detected? We want to see your creations. Please share your work in the comments are below, or write-up your own blog and share it with us. Who knows, there might be something in it for you.

There are currently 3 Installation Guides to accompany the Plug-In Machine Intelligence (PMI) plug-in and one Developers Guide. Also, the demonstration transformations and sample datasets are available. These sample transformations and sample datasets are for demonstration and educational purposes. They are downloadable at the following,


Download Link and Document Name
PMI_1.4_Installation_Linux.pdfInstallation guide for the Linux OS platform.
PMI_1.4_Installation_Windows.pdfInstallation guide for the Windows OS platform.
PMI_1.4_Installation_Mac_OSX.pdfInstallation guide for Mac OS X platform.
PMI_Developer_Docs.pdfA developer's guide to extending and contributing to the PMI framework.
PMI_MLChampionChallengeSamples.zipThis zip file contains all of the sample transformations, sample folder layouts and datasets for running the Machine Learning demonstrations and the Machine Learning Model Management samples. This is for demonstration and educational purposes.
PMI_AddingANewScheme.pdfThis documents describes the development process of exposing the Multi-Layer Perceptron (MLP) regressor and classifier in the Weka and scikit-learn engines.

REAL! Real-time IoT data stream available for Pentaho Analysis and Visualization

Everyone knows how hard it is to get access to real-time data feeds. Well, here is a chance to access real-time data using a 3D LiDAR motion sensor.





There has been a lot of talk about the new 3D LiDAR (Light Radar) motion sensor from Hitachi LG Data Systems LiDARs2.png(HLDS) recently. The 3D LiDAR is a Time of Flight (ToF) motion sensor that calculates distance by measuring the time it takes for an infrared laser to emit light and receive the reflection back. Because it measures a pixel-by-pixel image via the sensor, it shows the shape, size and position of a human and/or an object in 3D at 10 to 30 fps (frames per second), so it is possible to detect and track the motion, direction, height, volume, etc. of humans or objects.


Unfortunately, general access to this sensor it a bit difficult to come by at the moment and setting one up in a useful location, like a bank, retail store or casino, is also a challenge. So, in a partnership with HLDS, we have setup a LiDAR configuration at a company lobby on the 8th floor at HLDS in Seoul South Korea and will make the real-time output stream available to Hitachi Vantara Pentaho developers to use and develop to. The real-time data stream will be published from an MQTT broker at,



There is a problem with the original broker and we have moved this
data stream to a new broker. Please note the new broker URL below.
Sorry for the inconvenience.


Broker location – tcp:// tcp://

Topic – hlds/korea/TOFData



An example .json formatted data record published from this broker and topic looks like this,




The data stream will be published in clear text. The data is not sensitive. We are looking for real-time dashboards, visuals, analytics and integration transformations.


To help start this off, there is a collection of transformations to start from here.





The setup scenario is a “Human Direction Detection” challenge using the filter processor "Human Counter Pro". There are two zones being monitored by the 2 ceiling mounted LiDARs (the two LiDARs are grouped together to cover the wide area). The first zone is the entrance area called “entrance” and the second zone is the lobby area called the “hallway”. What can be happening in this configuration scenario is that,


  • People arrive (out of the elevator) and enter the “entrance” area, then they enter the “hallway” area, and are either walking towards the South Wing doorway or the North Wing doorway. This is the most common scenario and is basically employees arriving on their floor and heading to their work area.
    • This scenario can also happen in reverse order where people enter in the "hallway" from either the North Wing or South Wing and enter the "entrance" signifying leaving.
  • Someone enters and stays in the “hallway” for a period of time. Someone or others arrive in the entrance area and the group heads to one of the doorways. This scenario is basically an employee waiting for visitors to be escorted to a meeting or other activity.
  • Someone or a group crosses the “hallway” from the South Wing to the North Wing, or from the North Wing to the South Wing. This is a scenario where people are crossing over from one side of the building to the other side.
  • Someone enters the “hallway” area and stays there for a period of time, then heads to one of the doorways. In this scenario, someone is probably looking at one of the live demos or items in the lobby’s display area.
  • There could be other scenarios that you can identify with the data from the LiDARs, these are just a few that we came up with.






The published data stream will have identified and tracked people as they move into the “entrance” area and then move to the “hallway” area. Timing information of when each person enters (Appear) in the zones and when they leave (Disappear) the zone. Duration time in the zones area will need to be calculated yourself.


Lastly, remember South Korea is 16 hours ahead of pacific time, so the work day and work week activity is very skewed. It will be busy in the evening pacific time, and it will be the weekend on Friday pacific time.


You can use a MQTT inspection tool like "MQTT Spy" to explore and examine the data coming from the sensor.




Some background


Originally, this was going to be setup for me, then it was discussed that since this is an MQTT design, we can open this up company wide. Access to real world IoT data is hard to come by.


There are other Processor Filters in the LiDAR device middle-ware suite that provide different functions from the sensor. We are starting with the Human Counter Pro because this one publishes via MQTT. If this is successful, the other Processor Filters will also be integrated with MQTT as a simple mechanism for integrating Pentaho to the LiDAR sensor, and future physical sensors and Processing Filters.


No special plugin development is required to integrate to a state-of-the-art motion sensor to Pentaho. We’ve had access to MQTT steps for PDI for a few years now. There are a few blogs in the Vantara Community here and here describing how to use MQTT with Pentaho.


Some analysis ideas,


  • How many people entered the “entrance” only and then “Disappeared” (wrong floor?)?
  • How many people exited from “entrance”?
  • How many people went to North Wing?
  • How many people went to South Wing?
  • How many people crossed the “hallway”?
  • How long did people stay in the “hallway”?
  • Most people in the “hallway” at what times of the day?
  • Does the time of day matter?
  • What reports, visuals, dashboards and/or real-time dashboards can be created from this data?


Please share what you come up with in the comments section and/or submit your own write-up or blog. Who knows, there might be some recognition in it for you. Enjoy!



What Can You Do with Deep Learning in Pentaho?


By Ken Wood and Mark Hall


For those of you that have installed and are using the Plugin Machine Intelligence (PMI) plugin that Hitachi Vantara Labs released to the Pentaho Market Place back in March 2018, get ready for an exciting new

PMIDLLogo.pngupdate. This fall, we will release PMI version 1.4 as an update to the existing PMI which is an experimental plugin for Pentaho Data Integration (PDI). Our initial release of PMI focused on classical machine learning and the ability to build, use and manage machine learning models from four popular machine learning libraries – Python’s Scikit-Learning, R’s Machine Learning with R, Spark’s Machine Learning library and WEKA.


I say classical machine learning because traditionally classic machine learning has its best success executing on structured data. With the next release of PMI, we integrate a new machine learning library, what we refer to as “execution engines” – Deep Learning for Java (DL4J). This means PMI can now perform deep learning operations - training, validating, testing, building, evaluating and using deep learning models - directly from PDI.




Deep Learning is gaining lots of attention in the industry for its ability to operate on unstructured data like images, video, audio etc. Deep Learning is a recent addition to the Artificial Intelligence domain of machine learning, though technically the technology has been around for quite some time.




Deep learning to some degree gets its name from the deep, complex, hidden, neural network layers the technology creates to analyze data. To be clear, both machine learning and deep learning can operate on both structured and unstructured data, it’s just that the current general practice and greater success rate of applying deep learning to unstructured data and applying classical machine learning to structured data is the state of understanding at tis time.


The reason we’re blogging about this now is because we showcased and demonstrated PMI v1.4 with deep learning at Hitachi NEXT 2018 in San Diego. Along with a series of one-on-one workshops showing the new deep learning step with PDI and PMI v1.4, we demonstrated an example application using deep learning in an interactive apparatus that uses two deep learning models in a PDI transformation, and then uses PDI to drive the entire application.




This PDI transformation contains several parts when called,

  • The “Data Capture and Data Preparation” phase
    • This portion of the transformation starts by narrating what the entire transformation will do
    • Then communicates with a Raspberry Pi to capture a picture of a physical x-ray - essentially analog to digital conversion
    • Information about the image is then transformed into image metadata. Basically, an in-memory location of the actual digital image
  • The PDI transformation then executes the two deep learning models on the x-ray image. The two deep learning models vectorizes the image into usable numbers, determines the probability of identifying the body part focused on in the image and detecting whether an injury or anomaly exists.
  • The results of the two deep learning models is the probabilities of,
    • A multi-class classifier – Shoulder, Humerus, Elbow, Forearm and Hand
    • And a 2-class classifier, injury or anomaly detected – yes or no
    • These probabilities are numbers between 0 and 1
  • The next phase of the PDI transformation, “Results Preparation” takes the output probabilities (numbers between 0 and 1) from the deep learning models and prepares the result for use.
    • Determine the most likely value – max value is the “answer”
    • Format the 5 decimal digit value into a percentage and into a string
      • This formatting allows the next phase to say “Forty seven percent” instead of "4, 7, percent sign"
  • The last phase, “Confidence Dialog Preparation”, builds logic for the different speaking phrases and applies confidence to the result as an analysis.
    • For example, instead of saying, “There is a 98% chance that this elbow is injured.”. Just say “I detect that this elbow is injured.”. At 98%, we’ve determined that it is injured, but at 47%, we’re not too sure, so the spoken analysis would be “I detect a 47% probability that this elbow is injured, you might want to have it checked out.”.
    • This confidence logic applies to both the body part identification and the injury detection parts of the spoken analysis.


A diagram of the "Deep Learning Pipeline" can be seen here.

  • We use a "Speech Recognition Module" written in python to capture spoken phrases and determines the actions to be taken.
  • In case the environment is too noisy for sound, a special remote control application is available to manually HeyRayTweet.pngexecute the "Hey Ray!" command set.
    • A main transformation is used to interpret the incoming tasks and orchestrate the execution of other transformations as needed.
      • The tasks includes,
        • Introduction narration
        • Help on how to use "Hey Ray!"
        • Analyze the x-ray film and provide the results speech
        • the current analysis session can be saved to the Hitachi Content Platform (HCP)
          • During this operation, the content, x-ray image and analysis phrases, are converted into a single image movie file, then all of the content is saved to HCP
        • You can have "Hey Ray!" tweet the movie file
        • Provide insightful thoughts and opinions
        • And finally, "Hey Ray!" can tell radiologist jokes




We call this demonstration “Hey Ray!”. “Hey Ray!” is just an example of applying deep learning to a situation. We came up with "Hey Ray!" because of the dataset we had access to, it just happens to be x-ray images. We could have created something with flowers, food, automobiles, etc. We also decided to speak the results and add speech recognition for demonstration and "Wow Factor" for the Hitachi NEXT conference. Also, we felt that creating charts of probability distributions of number between 0 and 1 would take to long to explain, so why not have the demonstration state the results. This demonstration turned out to be highly interactive as the attendees could select a x-ray picture, insert it into the x-ray viewing screen and tell the device to "Analyze the x-ray".






We will be providing more blogs about PMI 1.4 with deep learning and other information on the artificial intelligence that goes into “Hey Ray!” in the coming months to help support this release. Stay tuned!


What can you do with machine learning and now deep learning in Pentaho?




It is important to point out that this initiative is not formally supported by Hitachi Vantara, and there are no current plans on the Enterprise Edition roadmap to support PMI at this time.  It is recommended that this experimental feature be used for testing, educational and exploration purposes only. PMI is supported by Hitachi Vantara Labs and the community. Hitachi Vantara Labs was created to formally test out new ideas, explore emerging technologies and as much as possible, share our prototypes with the community and users through the Hitachi Vantara Marketplace. We like to refer to this as "providing early access to advanced capabilities". Our hope is that the community and users of these advanced capabilities will help us improve and recommend additional use cases. Hitachi Vantara has forward thinking customers and users, so we hope you will download, install and test this plugin. We would appreciate any and all of your comments, ideas and opinions.

Sandra Wagner is part of our Customer Success & Support team dedicated to Pentaho and Analytics. You might also know her as The Goddess of Best Practices from the Support Portal. We want to make sure all customers who are using Pentaho know where to find helpful resources including Support, Best Practices and so much more.



Confused about how to upgrade Pentaho?


Upgrading to Pentaho 8.1 can seem like a complicated process, but it does not have to be difficult. We have published guidelines and best practices that answer some common questions about upgrading Pentaho. We’ve included a checklist of steps to take, such as what the upgrade path to use to get to Pentaho 8.1, what to back up and restore, when to update the design tools, and more:




You should have all necessary information and software available to you, and then it will be a simple matter of following your upgrade path from its beginning to its end. There is a comprehensive and downloadable version of this checklist to help you record and keep track of the information you’ll need to upgrade.



If you have custom configurations, contact your CSM, then Support and let them know before upgrade.



There is also a pdf version available for download at the bottom of  Guidelines for Successfully Upgrading to Pentaho 8.1. Here are a few more links that you might find helpful:



Click here to download a full guide on Upgrading to Pentaho 8.1

MyRepublic, one of the fastest growing telecom operators in Asia-Pacific, is disrupting the traditional telecommunications market with the introduction of TelcoTech, which uses data and new open source technologies, analytics and machine learning to create new business models.


One of the key tenets of the TelcoTech vision is providing telecommunications operators with the ability to enter markets quickly and provide services rapidly.


MyRepublic partners with Hitachi Vantara to revolutionize TelcoTech. “The implementation of Pentaho has strengthened MyRepublic’s TelcoTech strategy across the region which will help us scale quickly and expand our offerings to other markets in future.” Eugene Yeo, Group Chief Information Officer, MyRepublic.


The efficiencies gained from integrating the Pentaho open platform and leveraging the extensive library of data integration connectors helps MyRepublic further enhance the ability of its platform to deliver on this promise.


To learn more about MyRepublic’s success checkout their case study and podcast.


“While we have made significant manpower savings on data integration and reporting, the bigger benefit is the robust data pipeline that has been built. Pentaho allows us to add data to this pipeline rapidly, which is important to this vision. It paves the way for us to create new data monetization models, which will lead to innovation in the industry, just like what FinTech players achieved with the financial services industry.”


Eugene Yeo, Group Chief Information Officer, MyRepublic