Ken Wood

Internet of Things Data – How do you get it? Make it yourself.

Blog Post created by Ken Wood Employee on Jun 2, 2015

First off, I want to officially welcome Pentaho to Hitachi. I have enjoyed working with these guys for over a year and a half now and I am really excited to be working with them on future projects and solutions.


During an off site planning meeting with some of these cool Pentaho guys, we discussed the Internet of Things (IoT) and what we can do to create something with Pentaho.  First of all you need to get IoT data in order to experiment and analyze it. IoT data isn’t as prevalent as it might seem, at least in quantities worth analyzing and even harder to obtain data in the realm of Internet of Things That Matter (IoTTM). Even in Hitachi, casual train sensor data is hard to come by unless you are deeply involved in the project (and even that’s guess on my part).


PentahoPlusRaspberryLogos.pngAnother subject we discussed was running Pentaho Data Integration on platforms more suited to the IoT. In theory, PDI should run on a Raspberry Pi computer even though it’s not listed as a supported platform. It is safe to say that out-of-the-box, it doesn’t work, but following Mark Melton’s blog and some additional fiddling around with some scripts, I can confidently say that YES, PDI (at least the enterprise edition) runs on a Raspberry Pi on the Raspian OS. Probably not a “supported” configuration (yet), but the possibilities are interesting. This is running transformations with the lightweight "carte" server on the RPi which is a remote scale-out mechanism of PDI. Now, what can we do with this?


How to create IoT data with this setup? Something useful and interesting, not a hobbyist project measuring aquarium water temperature, but something that incorporates the power of an intelligent sensor and the ability to process data at the edge of the network.


Introducing the intelligent edge “People Detector”. This device takes pictures and analyzes the images for faces and upper body (head and shoulders pattern) part using the ScannerAndTV.pngSimple Computer Vision (SimpleCV) libraries for Python. Since I don't live in a crowded area, I’ve forced my new device to watch cable news TV for days (occasionally flipping the channels to try out different venues). It’s a little like the opening scene of Robot Chicken on Adult Swim where the cyborg chicken is forced to watch TV monitors. This application is coded in about 90 lines of Python code, half of which are comments for clarity.


This device is controlled by PDI, but runs on its own collecting images, analyzing them and logging the results in a sensor log. A decision is made if there is a “people” hit, either a detected face or a detected body, that image is saved in HCP.


HugePeopleCount.pngSo, everything is logged in the sensor log, but if there is a person detected, ingest the image, along with the outlined areas of detection, into the objectstore with a descriptive filename. Repeat forever. The sensor log looks like this (I’ve added the color for some clarity).



PDI (the carte server) also runs on the RPi to cleanup the sensor logs and load the data into a external database for analysis with Pentaho Business Analytics. The resulting transformed data looks like this in the database.



The images that have people detected in them are ingested into HCP. The HCP browser view looks like this. Since I started running this, I’ve collect over 20K images and they are all stored in HCP. There have been some modifications and changes over the past week, and with my new Raspberry Pi V2, I can now double the resolution on my images without any decrease in performance, while causing only a 3 times increase in image file size.



The images are also stored with a very descriptive file name that includes timestamp, sensor ID, number of faces, number of bodies and the number of people detected as a jpeg image. Everything stored in this HCP system has a “people” hit. This is all processed by the RPi.


You can click on any filename and preview the image with the detected objects outlined. Granted, a camera pointing at a TV monitor isn’t the most quality image capture, but it is a constant feed of people that I can detect to some level of accuracy. This logic of course can be reversed to store images that have no "people" hits if that's more interesting.


My diagram for how this process flow works, looks something like this. Another configuration note, the RPi running the cart server is getting the transformation from a Pentaho repository. This allows for the remote controlling of the transformations on the RPi.



So now that I have a lot of IoT data to analyze, what can I do with it, granted watching a cable news network skews reality a bit, unless you want to analyze the proportions of people detected to when no people are actually on the screen, we can look for outliers. Below is a quick scatter chart using Pentaho’s Business Analytics server on the over 15K scans captured during the new version 3 run with 11K images with “hits” ingested into HCP.  By the way, this means there was about 4,000 scans where no people were detected on a cable news channel.


There is an interesting outlier data point shown in this chart where 21 bodies were detected but no faces. Pulling up the image from that data point shows the reason to be an infographic for a commercial. Interesting find and it helps illustrate the data that can be collected and analyzed with an IoT like device and Pentaho analytics.


There are several enhancements I have planned for this experiment including running several detectors simultaneously and in real-life situations. Using a TV has its testing advantages and benefits, but the image quality and capture isn’t optimal. It would be a best practice to forgo the descriptive image filename and ingest the image details as metadata into HCP. Also, I’d like to add some sort of machine-to-machine real-time action based on a detection event. Like using the high quantity of bodies detected, but zero faces as a trigger to directly send me the picture of the event or tweet the picture, or something of interest. I’m also thinking of creating a version of this that detects vehicle traffic patterns in my neighborhood and possibly determining the truck to car to motorcycle ratio.


So, I’d like to hear your ideas for generating IoT data with the Raspberry Pi and Pentaho. Better yet, you can post your creation on the HDS Community as well.