Noah Schneider

Processing & Visualizing Streaming Data Using Pentaho

Blog Post created by Noah Schneider Employee on Feb 19, 2019

Why Streaming Data? Why Now?

In the past 10 years, the data market has undergone a vast transformation, expanding far beyond traditional data integration and business intelligence involving the analysis of transactional data. One of the most important changes has been a huge growth in the data created by devices, vehicles, and other items outfitted with sensors and connected to the internet – the so-called Internet of Things (IoT). IoT data is being generated by these devices at an unprecedented rate and the rate of growth itself continues to rise. This wealth of new data provides many opportunities to improve business, societal, and personal outcomes, if used properly. However, it is quite easy to get swamped by the flood of data and neglect it until a later date, potentially resulting in many missed opportunities to improve sales, reduce downtime, or respond to a problem in real time. This is where streaming data tools comes in handy. By processing, analyzing, and displaying data in real time you can act upon your IoT data when it matters most – right now.




Fortunately, Hitachi Vantara realizes the importance of streaming data and has invested significant resources into the field, resulting in new products like the Lumada platform and Smart Data Center as well as new features for our software such as Pentaho Data Integration’s support of MQTT and Kafka via its native engine or Spark via the Adaptive Execution Layer (AEL). I’ve been exploring many of these new features, especially those specific to Pentaho, and wanted to share some architectures I’ve had success with that allow me to receive, process, and visualize data in real time.


Use Case & Requirements

One example of an IoT case where I’ve used Pentaho’s streaming data capabilities is the Smart Business booth at the Hitachi Next 2018 conference. For this demo, we had a table with 3 bowls of various candies available and asked conference visitors to select a treat. Meanwhile, a Hitachi lidar camera was watching the visitor, detecting which candy they picked as well as their estimated age, gender, emotion, and height. The goal was to have the camera’s analysis streamed to a real time dashboard where guests could see their estimated demographics and candy choice as well as historical choices made by other conference attendees.

Requirement graphic ( demonstration was a bit intimidating at first, as it had some considerable requirements. The lidar camera needed to capture and analyze the depth “images”, generate an analysis data file including candy choice, gender, age, height, emotion and send this to data to a computer. Then, the computer needed to store this data for historical analysis purposes while simultaneously visualizing the data in an attractive manner. Furthermore, all of this needed to happen in a sub-second time frame for the demonstration to function as guests expect. Thankfully, with the new features in Pentaho I was confident we could pull this off, even with a relatively short time frame.


Possible Architectures

I considered a few possible architectures, all of which required the same basic flow. The data needed to be sent by the camera to the computer using some protocol (step 1), the data needed to be received by a messaging/queuing system (step 2), the data needed to be stored in a database (step 3), and the data needed to be quickly queried from the database or sent directly to the dashboard via another streaming protocol (step 4). The possibilities I considered for each of the steps are as follows:

  1. Camera to computer data protocol: Kafka, MQTT, or HTTP post
  2. Message receiver/queue: Pentaho Data Integration (PDI) transform using Kafka consumer or MQTT consumer or PostgREST, a program extending a Postgres database so it can be used via a RESTful API
  3. Receiver to database: A table output step in PDI after the Kafka or MQTT consumer or having PostgREST call a stored procedure that in turn writes data to the database
  4. Query or stream to dashboard: A Pentaho streaming data service, an optimized SQL query, or postgres-websockets, a program extending a Postgres database allowing notifications issued by postgres stored procedures to be sent to a client over a websocket.


Here is a diagram detailing how the components from each step would work together:

Possible architectures (


I had previously used and had success with an architecture using HTTP POST (step 1), PostgREST (step 2), stored procedure (step 3), and postgres-websockets (step 4), which made it a tempting solution to reach for. However, I realized that the postgres-websockets component might be overkill, as it was only required on the prior project because we had run into issues with the stored procedure taking too long to write data to disk. We remedied this issue by sending the data first to the websocket and then writing the data to the database. However, for this use case, I didn’t believe the data would take long to be written to disk, so I opted to swap out postgres-websockets for an optimized SQL query. Overall this solution was a success, but in 1 out of every 1000 cases there would be a slight delay in the optimized SQL query finishing, causing a bit of lag for the data to show up to the dashboard. Although this wasn’t a make or break for the demo, I decided to press on to see if we could find a better solution.


The second architecture we tried was Kafka (step 1), a PDI transform with a Kafka consumer step (step 2), a table output step after the Kafka consumer step (step 3), and a streaming data service (step 4). This solution proved to be much faster and more reliable for live data, with no timing issues in contrast to the prior solution. However, it did require for us to separately query for historical data, causing some difficulties in trying to join the data between the sources for components where we were trying to display both live and historical figures. Although this joining was a bit tricky, this completely Pentaho solution proved to be much easier to set up, manage, and performed better than the other architectures I’ve used including third-party tools. Therefore, we chose to implement this architecture.


Putting it All Together

Although I won’t go into it, we next hardened the architecture, developed the transformations, created the dashboard, and tested our solution. Before we knew it, it was time to head to Next 2018 where the demo would be put to the real test. How’d we do? Check out the media below!



In person.jpg



By utilizing an all Pentaho architecture, we were able to create a reliable, fast, end to end live data pipeline that met our goals of sub-second response time. It was impressive enough to wow conference attendees and simple enough that it didn’t require a line of code. Streaming data is here, and growing, and Pentaho provides real solutions to receive, process, and visualize it, all in near-real time.